Autosomal DNA statistics
From ISOGG Wiki
Autosomal DNA statistics describe the connection between the genealogical relationship between two people and the amount of autosomal DNA which they share. Understanding this connection is critical to interpreting the results of an autosomal DNA test.
Autosomal DNA is inherited equally from both parents. The amount of autosomal DNA inherited from more distant ancestors is randomly shuffled up in a process called recombination and the percentage of autosomal DNA coming from each ancestor is diluted with each new generation.
When interpreting autosomal DNA statistics, one must be careful to distinguish between the distribution of shared DNA for given relationships and the distribution of relationships for given amounts of shared DNA. For example, known second cousins on average share 212.5 centiMorgans (cMs), but in extreme cases can actually share as little as 47 cMs or as much as 760 cMs. Conversely, the relationship between pairs of individuals sharing 212.5 cMs has been found to be anywhere between aunt-or-uncle/niece-or-nephew and third cousin once removed.
Autosomal DNA tests for finding cousins and verifying relationships for genetic genealogy purposes are offered by 23andMe, AncestryDNA and Family Tree DNA (the Family Finder test). For comparisons of the different services see Tim Janzen's autosomal DNA testing comparison chart.
- 1 Distribution of shared DNA for given relationships
- 2 Ranges of sharing percentage
- 3 Shared SNPs
- 4 Distribution of genealogical relationships for given amounts of shared DNA
- 5 Simulations
- 6 Blog posts
- 7 Charts and tools
- 8 Scientific papers
- 9 See also
- 10 References
There are two simple mathematical methods of calculating the percentages of autosomal DNA shared by two individuals. Both methods give the same results except in the cases of parent/child comparisons, full siblings, double cousins, or any two individuals who are each related to the other through both parents.
The autosomal DNA of two related individuals will be half-identical in regions where each has inherited the same DNA from one parent, and ultimately from one common ancestor. In the cases of siblings and double cousins, their autosomal DNA will be fully identical in regions where each has inherited the same DNA from both parents or from two more distant common ancestors respectively. Full siblings are half-identical on regions where each has inherited the same DNA from exactly one parent and fully identical on regions where each has inherited the same DNA from both parents.
The first method of calculating percentages (displayed by 23andMe) expresses the aggregate length of the shared segments (i.e. the aggregate length of the half-identical regions, where there is one shared segment, plus twice the aggregate length of the fully identical regions, where there are two shared segments, one paternal and one maternal) as a percentage of the aggregate length of the paternal and maternal autosomes. Using this method, full siblings (excluding identical twins), who are expected to be half-identical on 50% of their autosomal DNA and fully identical on a further 25% of their autosomal DNA, will on average appear to have 50% shared.
The second method of calculating percentages (to which those relying on FTDNA or GEDmatch must resort) expresses the aggregate length of the half-identical (or better) regions as a percentage of the aggregate length of both sets of autosomes (paternal and maternal). The maximum value that the numerator in this percentage can take is the length of one set of autosomes (say the paternal); the denominator if the length of two sets of autosomes (maternal plus paternal). Thus, the percentages calculated by this method cannot exceed 50%, which is the value that it takes for a parent/child comparison (half-identical at all locations) or a comparison between identical twins (fully identical at all locations). Using Method II, full siblings (other than identical twins) will on average appear to have only 37.5% shared. Whenever there are fully identical regions, the calculated percentages will be smaller than for Method I as half-identical and fully identical regions cannot be distinguished from the available data and must be given equal weight in the calculation.
The first column in the table below shows the average percentages for different relationships and methods of calculation. The calculations assume that every child gets 50% from its mother and 50% from its father and in turn 25% from each of its four grandparents. The actual percentages vary from the average in individual cases. For example, a person might share 27% of his DNA with one nephew and only 23% with another. Because of the random way that autosomal DNA is inherited, third, fourth and more distant cousins will not necessarily have any detectable half-identical regions. According to Family Tree DNA's figures there is a 90% chance that third cousins will share enough DNA for the relationship to be detected, but there is only a 50% chance that you will share enough DNA with a fourth cousin for the relationship to be identified.
The degree of sharing is also displayed by the DNA companies in units of genetic distance known as centiMorgans (cMs), although in practice the total number of shared centiMorgans is less significant than the number and lengths of individual shared segments. The second column in the table below shows the aggregate lengths in cM of the half-identical (or better) regions shared on average by various pairs of relatives. It assumes that the aggregate length of each set of autosomal chromosomes is 3400cM, and thus that each individual inherits 6800cM of autosomal DNA, 3400cM from each parent. Different DNA companies use different autosomal DNA match thresholds, so that the actual cM figures, as displayed by 23andMe's Family Inheritance: Advanced, FTDNA and GEDmatch, may be slightly different from these round numbers, even before allowing for random variation around the averages in individual cases.
The reason for the different results from Method I and Method II in the case of siblings and double cousins is that the cM lengths displayed by FTDNA and in the free GedMatch utility (and, indeed, 23andMe's own Family Inheritance: Advanced) do not distinguish between half-identical and fully identical regions. The best place to see the distinction between half-identical regions and fully-identical regions is in the optional graphical output of the one-to-one comparisons at GEDmatch.com, where FIRs are displayed in green and HIRs are displayed in yellow. It is also possible to see the fully identical regions at 23andMe by using the Family Traits chromosome browser (accessed via the Family and Friends menu).
When using Family Finder data, the percentages based on Method II can be calculated from the cM lengths by dividing the displayed Shared cM by 68.
Note that the FTDNA figures exclude the X-chromosome cMs but the 23andMe figures include them. Males have one X-chromosome and females have two X-chromosomes. If you want to include the X-chromosome in the calculations, then instead of dividing by 68, divide by 68.81065 when combining the atDNA with the X-chromosome. Note that the expected shared percentages of X-DNA depend not only on the genealogical relationship between two people, but also on the numbers of males and females in the two paths to their common ancestor.
23andMe include the X-chromosome in their calculations, so their cM figures will be higher than those provided by FTDNA. 23andMe made adjustments to the cM count in June 2013 so the number of cMs will vary slightly depending on when the test was taken.
- For females using 23andMe data prior to June 2013, there were 7494.8cMs when combining the paternal and maternal autosomal DNA and the two X-chromosomes per Family Inheritance: Advanced.
- For females using 23andMe data after June 2013 there were 7438.6cMs when combining the paternal and maternal autosomal DNA and the two X-chromosomes per Family Inheritance: Advanced.
- There are 7074.6 autosomal cMs per 23andMe.
- For males using 23andMe data there are 7256.8 cMs when combining the atDNA with the single X-chromosome.
Note that AncestryDNA do not provide information on the lengths of half-identical (or better) regions in either centiMorgans or percentages. However, AncestryDNA customers can upload their raw data to the free GedMatch utility in order to extract the necessary cM data for making comparisons and to check the relationship predictions. David Pike's tools can also be used.
The following table shows the average amount of DNA shared by pairs of relatives in percentages and centiMorgans. All relationships up to and including second cousins can be detected by autosomal DNA tests. However, for third cousins and more distant cousins, some relationships will not be detected purely because of the random nature of autosomal DNA inheritance which means that we do not inherit DNA segments from every genealogical ancestor. The number of detectable relationships decreases with each generation. At ten generations we have approximately 1024 ancestors although there is generally some overlap as a result of pedigree collapse. While all these ancestors can potentially be documented in our genealogical tree we only inherit segments of DNA from a small subset of these ancestors.
The averages shown here are therefore only for relationships which have been detected. For the expected percentages of detectable relationships and different levels see the article on cousin statistics.
|% shared||Total cM shared half-identical (or better)||Relationship||Notes|
|100% (Method I)/50% (Method II)||3400.00||Identical twins (monozygotic twins)||Fully identical everywhere.|
|50% (Method I)/37.5% (Method II)||2550.00||Full siblings||Half-identical on 50%/1700 cM and fully identical on a further 25%/850 cM.|
|25%||1700.00||Grandparent/grandchild, aunt-or-uncle/niece-or-nephew, half-siblings|
|25% (Method I)/23.4375% (Method II)||1593.75||Double first cousins||Half-identical on 21.875%/1487.5 cM and fully identical on a further 1.5625%/106.25 cM|
|12.5%||850.00||First cousins, great-grandparent/great-grandchild, great-uncle or aunt/great-nephew or niece, half-uncle or aunt/half-nephew or niece|
|6.25%||425.00||First cousins once removed, half first cousins, great-great-grandparent/great-great-grandchild, great-great-aunt/uncle, half great-aunt/uncle|
|6.25%||425.00||Double second cousins|
|3.125%||212.50||Second cousins, first cousins twice removed, half first cousin once removed, half great-great-aunt/uncle, great-great-great-grandparent/great-great-great-grandchild|
|1.563%||106.25||Second cousins once removed, half second cousins, first cousin three times removed, half first cousin twice removed|
|0.781%||53.13||Third cousins, second cousins twice removed|
|0.391%||26.56||Third cousins once removed|
|0.195%||13.28||Fourth cousins, third cousins twice removed|
|0.0977%||6.64||Fourth cousins once removed. third cousins three times removed|
|0.0244%||1.66||Fifth cousins once removed|
|0.0061%||0.42||Sixth cousins once removed|
|0.001525%||0.10||Seventh cousins once removed|
Notes to Table
- There is no variation between families in the parent/child or identical twins shared cM figures; beyond these immediate relationships, recombination results in random variation around the average figures above from one pair of individuals to another.
- When a grandchild is compared to a grandparent, the shared cM with the other grandparent on the same side is easily inferred. The grandchild gets all 3400cM of, say, his paternal autosomes from his father. If it is seen that 1600cM of this came from the paternal grandfather, then the other 1800cM must have come from the paternal grandmother. The initial estimate of 1700cM shared by grandchild and paternal grandmother can thus be updated to 1800cM when it has been ascertained that grandchild and paternal grandfather share only a below average 1600cM.
- When the subjects of the comparison descend from identical twin children of their most recent common ancestral couple, then the figures in the above table should be doubled.
- The expected % shared for a half-relationship will always be exactly half of the expected % shared for the corresponding full relationship.
- A similar method to that used for full siblings and for double first cousins can be used to compute expected shared percentages for any two subjects of comparison who are doubly related. However, the expected % shared for a double relationship can be slightly less than the sum of the expected % shared for the appropriate single relationships.
- If Jack is related to both of Jill's parents, then Method I and Method II will give slightly different figures, as double cousins of this type are expected to be fully identical in some regions.
- If Jill is a more remote descendant of spouses who are both related to Jack, then Jill will clearly have inherited at most one of the two segments in regions where the child of those spouses was fully identical to Jack. This reduces Jack and Jill's expected % shared slightly from the ballpark figure obtained by adding the expected % shared for the two relationships.
- For example, double second cousins, where the double relationship arises because at least one is related on both the paternal side and the maternal side to the other, are expected to share 3.125% (1/32) on each side, or 6.25% (1/16) in total, using Method I. Using Method II, a small adjustment must be made to allow for regions where they are fully identical (1/1024 or approximately 0.098%), so that they are expected to be half-identical or better on 63/1024 or approximately 6.152%.
- On the other hand, double second cousins who are children of double first cousins are expected to be half-identical on a quarter of the approximately 23.438% on which their parents are half-identical or better, in other words on approximately 5.859%.
The chart below (courtesy Dimario, Wikimedia Commons) shows the average amount of autosomal DNA inherited by all close relations up to the third cousin level.
Ranges of sharing percentage
Figures from 23andMe's Relative Finder:
- Parent/child: 47.54 (for father/son pairs, who do not share the X-chromosome) to ~50%
- 1st cousins: 7.31-13.8
- 1st cousins once removed: 3.3-8.51
- 2nd cousins: 2.85-5.04
- 2nd cousins once removed: 0.57-2.54
- 3rd cousins: ca. 0.3-2.0
- 3rd cousins once removed: 0.11-1.32
- 4th and more distant cousins: 0.07-0.5
Figures from 23andMe Compare Genes function (from Tim Janzen's data):
- Parent-child pairs share between 83.94% and 84.20% of SNPs (50% of DNA in common)
- Siblings share between 83.81% and 87.47% of SNPs (50% of DNA in common)
- Uncle/aunt-niece/nephew pairs share between 78.48% and 79.57% of SNPs (25% of DNA in common)
- Grandparent-grandchild pairs share between 77.96% and 80.59% of SNPs (25% of DNA in common)
- First cousins and great uncle/great aunt-grandniece/grandnephew pairs share 75.78% and 77.03% of SNPs (12.5% of DNA in common)
- First cousins once removed share ca 75.5% of SNPs (6.25% of DNA in common)
- Second cousins and first cousins twice removed share ca 75% of SNPs (3.125% of DNA in common)
- Unrelated people of European descent share 73-74.6% of SNPs
In order to help people who have taken an autosomal DNA test gain greater insight into the genealogical relationships implied by the resultant data a number of genetic genealogists have been collecting statistics on the amount of DNA shared for known relationships.
Blaine Bettinger has been collecting statistics from the genetic genealogy community on the number of shared centiMorgans for known genealogical relationships as part of his shared cM project. The chart below is a visualisation of the range of shared centiMorgans based on data supplied to the project based on over 25,000 known relationships. The chart is made available under a Creative Commons Licence. You are free to share and use the information for non-commercial purposes, as long as you give proper attribution and release anything you create under the same licence. Additional information including histograms and the breakdown of companies is provided in the PDF download. Data is still being collected for the project and you can add your own statistics using this form on GoogleDocs.
Leah Larkin has written a blog post on The limits of predicting relationships using DNA which includes a table of relationship ranges from the DNA Detectives group and a chart showing ranges based on simulations by AncestryDNA.
Tim Janzen has created three charts that provide statistical information in various categories. The charts provide statistics on close relatives, distant endogamous relatives and distant non-endogamous relatives. The charts were originally designed for use with 23andMe data but now also incorporate data from FTDNA's Family Finder test. The charts are organized by the degree of relationship, with the most closely related people (parents and children, full siblings) being listed at the top and more distant cousins being listed at the bottom. The statistics are based on information from real people who have been tested by 23andMe and Family Tree DNA and who have a known genealogical relationship to someone else who has also been tested by the same company. The charts also include information on the median and the average number of shared cMs for people who are related to each other from the first cousin once removed level of relationship to the 5th cousin level of relationship. The charts can be downloaded from Anabaptist Genetic Genealogy website.
Tim Janzen has also compiled a chart showing the probability of a given genealogical relationship for each cM threshold going up in one cM increments starting at 6 cMs and going up 200 cMs. The chart may be downloaded as an Excel file from this link: File:Relationship prediction chart based on shared autosomal DNA.xlsx. This chart applies to non-endogamous populations. When using the chart to predict relationship from Family Finder data you will need to remove the data for all segments under 4 or 5 cMs.
The DNA Adoption team have provided a DNA Prediction Chart and a Relationship Estimator spreadsheet based on community-provided data for known relationships. For details see the guest blog post by Karin Corbeil on the DNAeXplained blog on Demystifying Ancestry's relationship predictions inspires new relationship estimator tool.
An unidentified author has also provided a spreadsheet on DNA Inheritance Statistics to which anyone can add their data. The spreadsheet can be found here.
- AncestryDNA has produced a table showing the cM range they use for predicted relationships. See the article The science behind a more precise DNA matching algorithm by Anna Swayne, published on 3 May 2016. A chart showing empirical data from the AncestryDNA database is given in Figure 5.2 in the AncestryDNA Matching White Paper.
- 23andMe has an FAQ showing the range of sharing for different relationships in percentages. See What's the average % DNA shared for different types of cousins.
The following studies have been published online by mathematician and genetic genealogist Paul Rakow:
- DNA Simulation Version 1a, published online on 27 September 2015. A simulation of the inheritance of DNA over several generations, to calculate how likely or unlikely matches of a given strength will be, for a given relationship.
- Ancestral segments, published online on 27 June 2016. A simulation which examines the size and distribution of ancestral segments across 20 generations.
See also the Wiki page on cousin statistics.
- Why your genetic tree is not the same as your family tree by Diahan Southard, Lisa Louise Cooke's Genealogy Gems, 7 May 2017.
- Face it: DNA cannot find all your relatives DNA.Land blog, 27 February 2016.
- Visualizing distributions for the shared cM project by Blaine Bettinger, The Genetic Genealogist, 23 December 2015.
- An analysis of fourth cousins and other near distant relationships by Jim Owston, Owston/Ouston One-Name Study blog, 10 August 2015.
- How many genomic blocks do you share with a cousin? by Graham Coop, The Coop Lab blog, 2 December 2013.
- How many genetic ancestors do I have? by Graham Coop, The Coop Lab blog, 11 November 2013.
- How much of your genome do you inherit from a particular ancestor? by Graham Coop, The Coop Lab blog, 4 November 2013.
- How much of your genome do you inherit from a particular grandparent? by Graham Coop, The Coop Lab blog, 20 October 2013.
- Widen the net by Judy Russell, The Legal Genealogist, 7 April 2013. A cautionary tale about third cousin matches.
- Genomic variation in sharing between siblings by Graham Coop, The Coop Lab blog, 26 January 2014.
- DNA portraits: second cousins by Jim Owston, Lineal Arboretum blog, 10th April 2012.
- Relatedness Qs and As. Ask a Geneticist, Understanding Genetics, Stanford at the Tech, 2 November 2011.
- The DNA numbers game by Lindsay Greenawalt, Confessions of a Cryokid Blog , 5 September 2011.
- Genetic genealogy and the single segment by Steve Mount. On Genetics blog, 19 February, 2011.
- Known Relative Studies by CeCe Moore, "Your Genetic Genealogist" blog, 26 September 2010 (this is a series).
- Q&A: Everyone has two family trees - a genealogical tree and a genetic tree by Blaine Bettinger, The Genetic Genealogist, 10 November 2009.
- Unequal genetic similarity between one's mother and father by Maciamo. Eupedia forum, 8 August 2009.
Charts and tools
- Nolan Lawson's Relatedness Calculator The tool provides information on the percentage of DNA shared, the degree of relatedness and the relatedness coefficient
- atDNA Table-based Predictions free from Gliesian, LLC.
- atDNA Grid-based Predictions free from Gliesian, LLC.
- atDNA Gauges-based Predictions free from Gliesian, LLC.
- atDNA Common Ranges free from Gliesian, LLC.
- atDNA Statistics Diagram free from Gliesian, LLC.
- Cousins Confidence Calculations free from Gliesian, LLC.
- MRCA Calculations free from Gliesian, LLC.
- Cousin calculator Free download
- Cousin relationship calculator from Ancestor Search
- Genealogy relationship chart from About.com
- Ancestor chart from Hope Carnicle showing all the percentages up to the ninth cousin level
- Ancestor chart from Hope Carnicle. The same ancestor chart from Hope Carnicle in a diamond shape
- Hill, WG & Weir, BS (2011). Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genetics Research, vol 93, no. 1, pp. 47-64. (See in particular Figure 5 which shows the distribution of actual genome sharing for different degrees of pedigree relationship.)
- Autosomal DNA portal
- Autosomal DNA
- Autosomal DNA match thresholds
- Autosomal DNA testing comparison chart
- Autosomal DNA tools
- Coefficient of relationship
- Cousin statistics
- Fully identical region
- Half-identical region
- Identical by descent
- Pedigree collapse
- Understanding genetic ancestry testing
- In genetic genealogy, the verb 'test' is used loosely to describe the process of submitting a DNA sample and receiving results such as lists of possible relatives. This meaning should not be confused with the precise use of the same word by statisticians referring to the 'testing' of a hypothesis, which is either accepted or rejected based on the statistics observed.
- Tiny differences between identical twins can now be detected by next generation sequencing. See: Weber-Lehman et al 2014. Finding the needle in the haystack: Differentiating "identical" twins in paternity testing and forensics by ultra-deep next generation sequencing. Forensic Science International: Genetics; 9: 42-46. See also the editorial by Bruce Budowle in Investigative Genetics: Molecular genetic investigative leads to differentiate monozygotic twins.
- Bettinger B. August 2017 update to the Shared cM Project. The Genetic Genealogist, 26 August 2017.