Listing Criteria for SNP Inclusion into the ISOGG Y-DNA Haplogroup Tree - 2016
The entire work is identified by the Version Number and date given on the
Main Page. Directions for citing the document are given at
the bottom of the Main
Page. Version History Last
revision date for this specific page: 5 June 2016
These recommendations are to assure that there is a uniform set of criteria for accepting new mutations for inclusion
on the ISOGG Y-DNA haplogroup tree.
Because of the abundance of alternatives now available, only single nucleotide polymorphisms (SNPs) are being
accepted, and not insertions or deletions (indels) for new additions. In exceptional cases other variants may be
considered for inclusion on a case by case basis if they can be clearly demonstrated to have equivalent properties
to SNPs, but the burden of proof required will be much higher and at the discretion of the committee.
The quality guidelines for chromosome positions and reads at the bottom of this page are in addition to other requirements.
Special Coding for Interpreting SNP status
Added SNPs are color coded red and defined as SNPs that have met all of the
criteria for inclusion and did not appear on last year's tree.
SNPs under Investigation are color coded pink and are SNPs that have not yet
been placed on the tree because additional testing is needed to confirm adequate positive samples and/or
correct placement on the tree.
SNPs found solely from next generation sequencing are colored either black or red and shown in
italics; they indicate quality, consistent reads found in Y sequencing. These are not confirmed by Sanger
sequencing or microarray testing and sometimes may not be amenable to either process.
SNP(s) printed in bold in a subclade: The criteria for a representative SNP printed in bold for a
subclade is that it has traditionally represented that subgroup or seems the most promising representative if
in italics. Bolded items frequently were confirmed by Sanger sequencing.
Identical SNPs are SNPs that have the same y-position, mutation, and subclade within a haplogroup and were
discovered in different labs. They are listed in alphabetical order, (not necessarily in the order of
discovery), and are separated by "/". Examples: P257/U6, L31/S149.
Mutation names followed by ^ represent ones from next-generation sequencing which do not yet meet
quality guidelines for minimum number of reads. Those with ^^ represent mutations that do not meet
quality guidelines but may be a helpful identifier. ~ indicates a subgroup whose position on
the tree is only approximate.
General Requirements for SNP Validation
The requirements listed here in this General Requirements section apply to validating SNPs discussed in
Requirements of Specific Type of Testing in the next section below.
Inserting a SNP Creating a Non-Terminal Branch to the ISOGG Tree
The supporting information provided by the proposer should demonstrate that the new SNP is downstream of
an established tree mutation. There is need also to show that the SNP was tested in individuals from all
parallel subgroups on the tree.. In cases where relevant existing tree subgroups are from rare populations
and based solely on old research listing only one sample proving the existence of the SNP, an exception may
be granted for testing of the old subgroup. The mutations of the existing subgroup will then be listed
temporarily as position undetermined.
Example: Suppose that a new subgroup is being added with name of Q18. Fictional example:
G-L140
G-L13
G-L1266 G-Q18
G-L1268
Then the evidence for Q18 must show that a man is derived for both Q18 and L140. Simultaneously one man each
from L1266 and L13 must be ancestral for Q18. In addition, one man derived for Q18 must be derived for L1268,
and a second Q18 man ancestral for L1268. Derived means the mutation is present; ancestral means it is absent.
Adding a SNP Representing a New Terminal Branch to the ISOGG Tree
In the case where the new SNP is the terminal branch of an existing branch then:
at least one individual who has the new SNP is found also to have a SNP defining the immediate
upstream subgroup.
at least one individual from any parallel subgroup to the new subgroup is found also to lack
the submitted SNP.
at least one individual from the new subgroup is found to lack the SNP(s) defining the parallel
subgroup(s).
Example: Suppose that a new subgroup is being added with name QQ12. Fictional example:
G-L5432
G-P343 G-QQ12
Then the evidence for QQ12 must show that two men are derived for QQ12. Simultaneously one man from P343
must be ancestral for QQ12. Also, one of the QQ12 men must be derived for L5432 and ancestral for P343.
Sanger Sequencing
Examples of Sanger sequencing are the tests at the company ySeq and the Advanced Tests (SNP) at Family Tree DNA.
STR testing is available, for instance, at Genebase and Family Tree DNA. Acceptable testing for this category
consists of Sanger sequencing which targets a short segment of Y-DNA.
The objective of the ISOGG Tree at this time is to include all SNPs that arose prior to about the year 1500 C.E.
This guideline may be measured through STR diversity or alternative evidence.
Where a new terminal subgroup is being added, STR marker results or other evidence described below for two men
with the new SNP are needed.
STR Diversity
To be accepted the SNP must be observed in at least two individuals and must meet the STR diversity requirement.
A SNP that does not meet this requirement will be classified as a Private SNP (see definition above).
The STR diversity requirement is met if the following conditions are satisfied:
If the SNP is a Non-Terminal Branch SNP, no further proof of diversity is required.
Genetic distance is calculated using the
Infinite Alleles Model (IAM). A marker for
which there is a null value in one sample must be discarded from the calculations. Otherwise, most laboratories
use the IAM.
All markers tested by both individuals must be compared.
If 74 markers (or fewer) are compared, the minimum genetic distance to meet the diversity
requirement is 5.
If 75 (or more) markers are compared, the diversity requirement is a minimum of 7%, computed by
dividing the genetic distance by the number of markers compared, and rounding to the nearest integer value.
Alternative Evidence
If the submitter can otherwise provide evidence that the common ancestor of the two samples can be reasonably
expected to have lived more than 500 years ago, this will also be considered.
Next Generation Sequencing
Next generation sequencing is available for the genealogical community at Full Genomes Corporation, Family Tree's
Big Y Test. Next generation sequencing has the largest coverage of any type of SNP
testing currently available.
The committee recognizes there are a wide variety of ways in which sequencing information is
available. Because of this, no specific criteria for sequencing information is provided here except the new, tentative
quality guidelines in the next section. The goal of
the reviewers of the sequencing submissions – at one extreme – will be to easily accept quality SNPs from old,
root branches found in many samples within all the downstream branches. At the opposite extreme, it is unlikely
reviewers will accept SNPs near or in terminal branches whose positions depend on the results from one sample.
The submitter can use raw data report(s) pertaining to the sequencing when they provide the needed information.
Just two
examples of raw data reports would include a vcf file showing the usual quality scores, DP scores for depth of
reads, etc. for the involved sample and pertinent additional ones, including ones from other haplogroups OR
instead the so-called “haplogroup compare report” from Full Genomes Corp. Results from Sanger sequencing or
from microarray products, such as Geno 2.0 or Chromo 2.0, might be acceptable comparative information in
certain cases. Having a large number of pertinent comparative samples on a vcf report, can improve the
scoring information.
The reviewer will have to take into consideration the coverage of the next generation sequencing,
varied quality scorings, position of the site on the chromosome, the percentage of samples with clean reads at
the site in question, possible indel relationships to the SNP, geographical separation of the samples, non-next
generation sequencing testing, results for the SNP site in other reports, and other factors in making a complex
judgment as to whether the submitted SNP is almost certain to show the same results in next generation sequencing
of new comparable samples.
More precise criteria for next generation sequencing submissions may be provided as evidence
accumulates. Addendum: these are now included in the tentative quality guidelines below.
When a new SNP creating a new terminal branch is being added to the tree, at least two of the
submitted samples must each have an average of 3 unique (singleton) SNPs per 10 million base pairs of sequencing
coverage. Reviewers will determine uniqueness according to comparisons to all available sequencing results
rather than samples tested at a particular laboratory.
If the evidence for the SNP is based solely on next generation sequencing, the SNP will appear
in italics on the tree.
Microarray Chip-based Genotyping
Examples of microarray chip-based genotyping are Geno 2.0or Geno 2.0 Next-Generation test, 23andMe, Chromo 2.0 and Family Tree DNA's Deep Clade
panels. Microarray chips target a selected group of snps.
Novel SNPs found in microarray products without a presence also in other qualifying sources -
such as Sanger sequencing or next generation sequencing - cannot be submitted. However, chip-based genotyping
results can be used in combination with Sanger sequencing and/or next generation sequencing results as validating
evidence for one of the samples. If chip-based genotyping is part of the evidence, the approved SNP will be
listed in regular type, rather than italics, even if the other evidence is from next generation sequencing.
Samples from chip-based genotyping used to prove a new terminal branch must meet the criteria
for STR diversity described in the Sanger sequencing section.
Tentative Quality Guidelines
Recognizing that some guidelines are needed, these are presented here tentatively. These are approximations of the border
between reliable Y chromosome sites or reads and those unreliable or inconclusive. The guidelines are described as
tentative because they are not based on scientific studies but rather on imprecise approximations from experience working
with results. These guidelines will be amended as better information is developed. All guidelines must be met.
For situations where the mutation or mutations being submitted to the ISOGG tree based only on next-generation sequencing
the mutation site and its results must meet the following criteria pertaining to the findings in the individual or
individuals who have the mutation in each sample:
1. The total number of reads for that site in a sample must be at least four.
2. The percentage of reads showing the mutation must be 100% for less than 21 reads. The number allowed
for 21-40 reads is one divergent read, and 2 divergent reads allowed for 41-50 reads and at least 95% for
more than 50. Any reads with a mapping quality score less than 10 can be ignored in meeting the criteria of this paragraph.
3. The total number of reads cannot exceed four times the coverage of the testing. For example, for 50x coverage,
the total number of reads at the site cannot exceed 200. If the laboratory is providing different coverage than
advertised, the total number of reads should be adjusted accordingly.
4. The mutation site can already be listed on the ISOGG tree only in three or less locations.
5. The mapping quality for the site must average at least 10. The percentage of reads with mapping quality less than
10 at the site must not exceed 10% of the total reads.
6. When 500 adjacent base pairs are viewed with the mutation site in the center, the same sequence cannot appear at
another chromosome site where 95.5% or more of the base pairs are in the same sequence. This applies only to those displayed
comparisons where the number of base pairs compared are 500 or almost 500 in number, and not for smaller numbers.
7. No additional called mutations for that individual within 20 base pairs of the submitted mutation site.
8. The mutation site must not be part of a series of repeated alleles. And if part of a segment where the same allele is repeated,
this segment must not exceed 6 alleles of the same type.
If Sanger sequencing results show next-generation sequencing information to be incorrect, an item submitted under this
section is to be removed from the tree.
If all criteria under this section are met except for minimum number of reads in one of the samples, the mutation
may be added to the tree, but ^ is to follow the name of the mutation and this will be defined on the page
as not yet having minimum number of reads.
For situations where the mutation or mutations being submitted to the ISOGG based on Sanger sequencing or qualifying
microarray testing the mutation site and its results must meet the following criteria:
1. The mutation site can already be listed on the ISOGG tree only in three or less locations.
2. When the site in BAM files is viewed in samples with different coverage and from different labs, the mapping quality
for the site must average at least 10. The percentage of reads with mapping quality less than 10 at the site must not
exceed 10% of the total reads. An exception to using samples with different coverage and from different labs, would be
using the next-generation sequencing BAM file for the same
individual who had the Sanger sequencing or microarray testing of this site. In this latter case, it is preferred that
just the individual’s BAM results meet the criteria of this paragraph.
3. When 500 adjacent base pairs are viewed with the mutation site in the center, the same sequence cannot appear at
another chromosome site where 95.5% or more of the base pairs are in the same sequence. This applies only to those displayed
comparisons where the number compared are 500 or almost 500 in number, and not for smaller numbers.
4. The mutation site must not be part of a series of repeated alleles. And if part of a segment where the same allele is repeated,
this segment must not exceed 6 alleles of the same type.
The supplement of the 2015 Y-DNA study by Karmin et. al. provides a listing of the highest quality Y-DNA chromosome sites,
but there are acceptable additional sites not included. Karmin, M. el al. "A recent bottleneck of Y chromosome diversity
coincides with a global change in culture." Genome Research, 25: 1-8, 2015.
Submissions of haplogroup A0 and A00 items must be handled on a case-by case basis because these haplogroups
were not known when the Y reference samples were chosen.
Where a shared variant fails to meet the quality guidelines above, it can still be added to the tree followed by two ^^ symbols
and the following note in exceptional circumstances where evidence suggests it occurs in at least 5 persons, is stable,
does not occur elsewhere in the tree and serves an important purpose:
^^does not meet quality guidelines but may be a helpful identifier.
Resources pertaining to the guidelines:
At YBrowse YBrowse one can use the chromosome number to search. Then in Scroll/Zoom there, choose 500bp and hit the Go
button. This produces the 500 base pairs. Copy these and paste into
BLAT.
The default settings of Human, Feb 2009, BLAT's guess, query/score and hyperlink are retained. Then Submit.
The list of possible duplications or near duplications appears.
At the Broad Institute site one can download free software for the IGV reader for BAM files
IGV Software Several small files are needed for use and can be provided
upon request to Ray Banks.
Reading BAM files requires an index file which ends in .bai. If this is not available, BAM Indexer
the free BamView software will index the file.
BAM files are large and may require unzipping depending on the source.
Acceptance Process for Placing a SNP on the ISOGG Y-DNA Haplotree
The discoverer of the SNP (or a knowledgeable third party) can email the Contact Person listed on the appropriate
haplogroup page and describe where the new SNP fits in the tree. The haplogroup experts will evaluate the evidence
for inclusion on the tree. If the information on tree placement is insufficient, it will be listed as
investigational in the section under the tree. If the Contact Person is not available, contact
Ray Banks.
Corrections/Additions made since 1 January 2016:
Changed some wording in next-generation sequencing section to be compatible with the tentative quality guidelines.
Added wording to section discussing adding a SNP representing a new terminal branch on 5 June 2016.