Listing Criteria for SNP Inclusion
into the ISOGG Y-DNA Haplogroup Tree - 2017
The entire work is identified by the Version
Number and date given on the
Main Page.
Directions for citing the document are given at
the bottom of the Main
Page. Version History
Last
revision date for this specific page: 1 January 2017
These recommendations are to assure that there is a uniform
set of criteria for accepting new mutations for inclusion
on the ISOGG Y-DNA haplogroup tree.
Because of the abundance of alternatives now available, only
single nucleotide polymorphisms (SNPs) are being
accepted, and not insertions or deletions (indels) for new additions.
In exceptional cases other variants may be
considered for inclusion on a case by case basis if they can be clearly
demonstrated to have equivalent properties
to SNPs, but the burden of proof required will be much higher and at
the discretion of the committee.
The quality guidelines for chromosome positions and reads at
the bottom of this page are in addition to other requirements.
Special Coding for Interpreting SNP
status
Added SNPs
are color coded red and defined as SNPs that have met all of the
criteria for inclusion and did not appear on last year's tree.
SNPs under
Investigation are color coded pink and are SNPs that have not yet been
placed on the tree because additional testing is needed to confirm
adequate positive samples and/or correct placement on the tree.
SNPs found solely from next generation
sequencing are colored either black or red and shown in italics; they
indicate quality, consistent reads found in Y sequencing. These are not
confirmed by Sanger sequencing or microarray testing and sometimes may
not be amenable to either process.
SNP(s) printed in bold in a subclade:
The criteria for a representative SNP printed in bold for a subclade is
that it has traditionally represented that subgroup or seems the most
promising representative if in italics. Bolded items frequently were
confirmed by Sanger sequencing.
Identical SNPs are SNPs that have the same y-position,
mutation, and subclade within a haplogroup and were discovered in
different labs. They are listed in alphabetical order, (not necessarily
in the order of discovery), and are separated by "/". Examples:
P257/U6, L31/S149.
Mutation names followed by ^ represent ones from
next-generation sequencing which do not yet meet quality guidelines for
minimum number of reads. Those with ^^ represent mutations that do not
meet quality guidelines but may be a helpful identifier. ~ indicates a
subgroup whose position on the tree is only approximate.
General Requirements for SNP
Validation
The requirements listed here in this General Requirements
section apply to validating SNPs discussed in
Requirements of Specific Type of Testing in the next section below.
Inserting a SNP Creating a Non-Terminal Branch to
the ISOGG Tree
The supporting information provided by the proposer should demonstrate
that the new SNP is downstream of an established tree mutation. There
is need also to show that the SNP was tested in individuals from all
parallel subgroups on the tree.. In cases where relevant existing tree
subgroups are from rare populations and based solely on old research
listing only one sample proving the existence of the SNP, an exception
may be granted for testing of the old subgroup. The mutations of the
existing subgroup will then be listed temporarily as position
undetermined.
Example: Suppose that a new subgroup is
being added with name of Q18. Fictional
example:
G-L140
G-L13
G-L1266 G-Q18
G-L1268
Then the evidence for Q18 must show that a man is derived for both Q18
and L140. Simultaneously one man each from L1266 and L13 must be
ancestral for Q18. In addition, one man derived for Q18 must be derived
for L1268, and a second Q18 man ancestral for L1268. Derived means the
mutation is present; ancestral means it is absent.
Adding a SNP Representing a New Terminal Branch to
the ISOGG Tree
In the case where the new SNP is the terminal branch of an existing
branch then:
at least one individual who has the new SNP is found
also to have a SNP defining the immediate upstream subgroup.
at least one individual from any parallel subgroup to
the new subgroup is found also to lack the submitted SNP.
at least one individual from the new subgroup is found
to lack the SNP(s) defining the parallel subgroup(s).
Example: Suppose that a new subgroup is
being added with name QQ12. Fictional
example:
G-L5432
G-P343 G-QQ12
Then the evidence for QQ12 must show that two men are derived for QQ12.
Simultaneously one man from P343 must be ancestral for QQ12. Also, one
of the QQ12 men must be derived for L5432 and ancestral for P343.
Sanger Sequencing
Examples of Sanger sequencing are the tests at the company ySeq and the
Advanced Tests (SNP) at Family Tree DNA. STR testing is available, for
instance, at Genebase and Family Tree DNA. Acceptable testing for this
category consists of Sanger sequencing which targets a short segment of
Y-DNA.
The objective of the ISOGG Tree at this time is to include all SNPs
that arose prior to about the year 1500 C.E. This guideline may be
measured through STR diversity or alternative evidence.
Where a new terminal subgroup is being added, STR marker results or
other evidence described below for two men with the new SNP are needed.
STR Diversity
To be accepted the SNP must be observed in at least two individuals and
must meet the STR diversity requirement. A SNP that does not meet this
requirement will be classified as a Private SNP (see definition above).
The STR diversity requirement is met if the following conditions are
satisfied:
If the SNP is a Non-Terminal Branch SNP, no further
proof of diversity is required.
Genetic distance is calculated using the Infinite
Alleles Model (IAM). A marker for which there is a null value
in one sample must be discarded from the calculations. Otherwise, most
laboratories use the IAM.
All markers tested by both individuals must be compared.
If 74 markers (or fewer) are compared, the minimum
genetic distance to meet the diversity requirement is 5.
If 75 (or more) markers are compared, the diversity
requirement is a minimum of 7%, computed by dividing the genetic
distance by the number of markers compared, and rounding to the nearest
integer value.
Alternative Evidence
If the submitter can otherwise provide evidence that the common
ancestor of the two samples can be reasonably expected to have lived
more than 500 years ago, this will also be considered.
Next Generation Sequencing
Next generation sequencing is available for the genealogical community
at Full Genomes Corporation, Family Tree's Big Y Test. Next generation
sequencing has the largest coverage of any type of SNP testing
currently available.
The committee recognizes there are a wide variety of
ways in which sequencing information is available. Because of this, no
specific criteria for sequencing information is provided here except
the new, tentative quality guidelines in the next section. The goal of
the reviewers of the sequencing submissions – at one extreme – will be
to easily accept quality SNPs from old, root branches found in many
samples within all the downstream branches. At the opposite extreme, it
is unlikely reviewers will accept SNPs near or in terminal branches
whose positions depend on the results from one sample.
The submitter can use raw data report(s) pertaining to
the sequencing when they provide the needed information. Just two
examples of raw data reports would include a vcf file showing the usual
quality scores, DP scores for depth of reads, etc. for the involved
sample and pertinent additional ones, including ones from other
haplogroups OR instead the so-called “haplogroup compare report” from
Full Genomes Corp. Results from Sanger sequencing or from microarray
products, such as Geno 2.0 or Chromo 2.0, might be acceptable
comparative information in certain cases. Having a large number of
pertinent comparative samples on a vcf report, can improve the scoring
information.
The reviewer will have to take into consideration the
coverage of the next generation sequencing, varied quality scorings,
position of the site on the chromosome, the percentage of samples with
clean reads at the site in question, possible indel relationships to
the SNP, geographical separation of the samples, non-next generation
sequencing testing, results for the SNP site in other reports, and
other factors in making a complex judgment as to whether the submitted
SNP is almost certain to show the same results in next generation
sequencing of new comparable samples.
More precise criteria for next generation sequencing
submissions may be provided as evidence accumulates. Addendum: these
are now included in the tentative quality guidelines below.
When a new SNP creating a new terminal branch is being
added to the tree, at least two of the submitted samples must each have
an average of 3 unique (singleton) SNPs per 10 million base pairs of
sequencing coverage. Reviewers will determine uniqueness according to
comparisons to all available sequencing results rather than samples
tested at a particular laboratory.
If the evidence for the SNP is based solely on next
generation sequencing, the SNP will appear in italics on the tree.
Microarray Chip-based Genotyping
Examples of microarray chip-based genotyping are Geno 2.0or Geno 2.0
Next-Generation test, 23andMe, Chromo 2.0 and Family Tree DNA's Deep
Clade panels. Microarray chips target a selected group of snps.
Novel SNPs found in microarray products without a
presence also in other qualifying sources - such as Sanger sequencing
or next generation sequencing - cannot be submitted. However,
chip-based genotyping results can be used in combination with Sanger
sequencing and/or next generation sequencing results as validating
evidence for one of the samples. If chip-based genotyping is part of
the evidence, the approved SNP will be listed in regular type, rather
than italics, even if the other evidence is from next generation
sequencing.
Samples from chip-based genotyping used to prove a new
terminal branch must meet the criteria for STR diversity described in
the Sanger sequencing section.
Tentative Quality Guidelines
Recognizing that some guidelines are needed, these are presented here
tentatively. These are approximations of the border between reliable Y
chromosome sites or reads and those unreliable or inconclusive. The
guidelines are described as
tentative because they are not based on scientific studies but rather
on imprecise approximations from experience working with results. These
guidelines will be amended as better information is developed. All
guidelines must be met.
For situations where the mutation or mutations being submitted
to the ISOGG tree based only on next-generation sequencing
the mutation site and its results must meet the following
criteria pertaining to the findings in the individual or individuals
who have the mutation in each sample:
1. The total number of reads for that site in a sample must be at least
four.
2. The percentage of reads showing the mutation must be 100% for less
than 21 reads. The number allowed for 21-40 reads is one divergent
read, and 2 divergent reads allowed for 41-50 reads and at least 95%
for
more than 50. Any reads with a mapping quality score less than 10 can
be ignored in meeting the criteria of this paragraph.
3. The total number of reads cannot exceed four times the coverage of
the testing. For example, for 50x coverage, the total number of reads
at the site cannot exceed 200. If the laboratory is providing different
coverage than advertised, the total number of reads should be adjusted
accordingly.
4. The mutation site can already be listed on the ISOGG tree only in
three or less locations.
5. The mapping quality for the site must average at least 10. The
percentage of reads with mapping quality less than 10 at the site must
not exceed 10% of the total reads.
6. When 500 adjacent base pairs are viewed with the mutation site in
the center, the same sequence cannot appear at another chromosome site
where 95.5% or more of the base pairs are in the same sequence. This
applies only to those displayed comparisons where the number of base
pairs compared are 500 or almost 500 in number, and not for smaller
numbers.
7. No additional called mutations for that individual within 20 base
pairs of the submitted mutation site.
8. The mutation site must not be part of a series of repeated alleles.
And if part of a segment where the same allele is repeated, this
segment must not exceed 6 alleles of the same type.
If Sanger sequencing results show next-generation sequencing
information to be incorrect, an item submitted under this section is to
be removed from the tree.
If all criteria under this section are met except for minimum number of
reads in one of the samples, the mutation may be added to the tree, but
^ is to follow the name of the mutation and this will be defined on the
page as not yet having minimum number of reads.
For situations where the mutation or mutations being submitted
to the ISOGG based on Sanger sequencing or qualifying
microarray testing the mutation site and its results must
meet the following criteria:
1. The mutation site can already be listed on the ISOGG tree only in
three or less locations.
2. When the site in BAM files is viewed in samples with different
coverage and from different labs, the mapping quality for the site must
average at least 10. The percentage of reads with mapping quality less
than 10 at the site must not exceed 10% of the total reads. An
exception to using samples with different coverage and from different
labs, would be using the next-generation sequencing BAM file for the
same individual who had the Sanger sequencing or microarray testing of
this site. In this latter case, it is preferred that just the
individual’s BAM results meet the criteria of this paragraph.
3. When 500 adjacent base pairs are viewed with the mutation site in
the center, the same sequence cannot appear at another chromosome site
where 95.5% or more of the base pairs are in the same sequence. This
applies only to those displayed comparisons where the number compared
are 500 or almost 500 in number, and not for smaller numbers.
4. The mutation site must not be part of a series of repeated alleles.
And if part of a segment where the same allele is repeated, this
segment must not exceed 6 alleles of the same type.
The supplement of the 2015 Y-DNA study by Karmin et. al. provides a
listing of the highest quality Y-DNA chromosome sites, but there are
acceptable additional sites not included. Karmin, M. el al. "A recent
bottleneck of Y chromosome diversity
coincides with a global change in culture." Genome Research,
25: 1-8, 2015.
Submissions of haplogroup A0 and A00 items must be handled on a case-by
case basis because these haplogroups were not known when the Y
reference samples were chosen.
Where a shared variant fails to meet the quality guidelines above, it
can still be added to the tree followed by two ^^ symbols and the
following note in exceptional circumstances where evidence suggests it
occurs in at least 5 persons, is stable, does not occur elsewhere in
the tree and serves an important purpose:
^^does not meet quality guidelines but may be a helpful identifier.
Resources pertaining to the guidelines:
At YBrowse YBrowse
one can use the chromosome number to search. Then in Scroll/Zoom there,
choose 500bp and hit the Go
button. This produces the 500 base pairs. Copy these and paste into BLAT.
The default settings of Human, Feb 2009, BLAT's guess, query/score and
hyperlink are retained. Then Submit.
The list of possible duplications or near duplications appears.
At the Broad Institute site one can download free software for the IGV
reader for BAM files
IGV
Software Several small files are needed for use and can be
provided upon request to Ray
Banks.
Reading BAM files requires an index file which ends in .bai. If this is
not available, BAM
Indexer
the free BamView software will index the file.
BAM files are large and may require unzipping depending on the source.
Acceptance Process for Placing a SNP
on the ISOGG Y-DNA Haplotree
The discoverer of the SNP (or a knowledgeable third party) can email
the Contact Person listed on the appropriate
haplogroup page and describe where the new SNP fits in the
tree. The haplogroup experts will evaluate the evidence
for inclusion on the tree. If the information on tree placement is
insufficient, it will be listed as
investigational in the section under the tree. If the Contact Person is
not available, contact
Ray Banks.