Page Actions

Autosomal DNA statistics

From ISOGG Wiki

Autosomal DNA statistics describe the connection between the genealogical relationship between two people and the amount of autosomal DNA which they share. Understanding this connection is critical to interpreting the results of an autosomal DNA test.[1]

Autosomal DNA is inherited equally from both parents. The amount of autosomal DNA inherited from more distant ancestors is randomly shuffled up in a process called recombination and the percentage of autosomal DNA coming from each ancestor is diluted with each new generation.

When interpreting autosomal DNA statistics, one must be careful to distinguish between the distribution of shared DNA for given relationships and the distribution of relationships for given amounts of shared DNA. For example, known second cousins on average share 212.5 centiMorgans (cMs), but in extreme cases can actually share as little as 47 cMs or as much as 760 cMs. Conversely, the relationship between pairs of individuals sharing 212.5 cMs has been found to be anywhere between aunt-or-uncle/niece-or-nephew and third cousin once removed.

Autosomal DNA tests for finding cousins and verifying relationships for genetic genealogy purposes are offered by 23andMe, AncestryDNA, Family Tree DNA (the Family Finder test), Living DNA and MyHeritageDNA. For comparisons of the different services see Tim Janzen's autosomal DNA testing comparison chart.

Distribution of shared DNA for given relationships

There are two simple mathematical methods of calculating the percentages of autosomal DNA shared by two individuals. Both methods give the same results except in the cases of parent/child comparisons, full siblings, double cousins, or any two individuals who are each related to the other through both parents.

The autosomal DNA of two related individuals will be half-identical in regions where each has inherited the same DNA from one parent, and ultimately from one common ancestor. In the cases of siblings and double cousins, their autosomal DNA will be fully identical in regions where each has inherited the same DNA from both parents or from two more distant common ancestors respectively. Full siblings are half-identical on regions where each has inherited the same DNA from exactly one parent and fully identical on regions where each has inherited the same DNA from both parents.

Method I

The first method of calculating percentages (displayed by 23andMe) expresses the aggregate length of the shared segments (i.e. the aggregate length of the half-identical regions, where there is one shared segment, plus twice the aggregate length of the fully identical regions, where there are two shared segments, one paternal and one maternal) as a percentage of the aggregate length of the paternal and maternal autosomes. Using this method, full siblings (excluding identical twins), who are expected to be half-identical on 50% of their autosomal DNA and fully identical on a further 25% of their autosomal DNA, will on average appear to have 50% shared.

Method II

The second method of calculating percentages (to which those relying on FTDNA or GEDmatch must resort) expresses the aggregate length of the half-identical (or better) regions as a percentage of the aggregate length of both sets of autosomes (paternal and maternal). The maximum value that the numerator in this percentage can take is the length of one set of autosomes (say the paternal); the denominator if the length of two sets of autosomes (maternal plus paternal). Thus, the percentages calculated by this method cannot exceed 50%, which is the value that it takes for a parent/child comparison (half-identical at all locations) or a comparison between identical twins (fully identical at all locations). Using Method II, full siblings (other than identical twins) will on average appear to have only 37.5% shared. Whenever there are fully identical regions, the calculated percentages will be smaller than for Method I as half-identical and fully identical regions cannot be distinguished from the available data and must be given equal weight in the calculation.

The first column in the table below shows the average percentages for different relationships and methods of calculation. The calculations assume that every child gets 50% from its mother and 50% from its father and in turn 25% from each of its four grandparents. The actual percentages vary from the average in individual cases. For example, a person might share 27% of his DNA with one nephew and only 23% with another. Because of the random way that autosomal DNA is inherited, third, fourth and more distant cousins will not necessarily have any detectable half-identical regions. According to Family Tree DNA's figures there is a 90% chance that third cousins will share enough DNA for the relationship to be detected, but there is only a 50% chance that you will share enough DNA with a fourth cousin for the relationship to be identified.

The degree of sharing is also displayed by the DNA companies in units of genetic distance known as centiMorgans (cMs), although in practice the total number of shared centiMorgans is less significant than the number and lengths of individual shared segments. The second column in the table below shows the aggregate lengths in cM of the half-identical (or better) regions shared on average by various pairs of relatives. It assumes that the aggregate length of each set of autosomal chromosomes is 3400cM, and thus that each individual inherits 6800cM of autosomal DNA, 3400cM from each parent. Different DNA companies use different autosomal DNA match thresholds, so that the actual cM figures provided by different companies may be slightly different from these round numbers, even before allowing for random variation around the averages in individual cases.

The reason for the different results from Method I and Method II in the case of siblings and double cousins is that the cM lengths displayed by FTDNA and in the free GEDmatch utility (and, indeed, 23andMe's own Family Inheritance: Advanced) do not distinguish between half-identical and fully identical regions. The best place to see the distinction between half-identical regions and fully-identical regions is in the optional graphical output of the one-to-one comparisons at GEDmatch.com, where FIRs are displayed in green and HIRs are displayed in yellow. It is also possible to see the fully identical regions at 23andMe by using the Family Traits chromosome browser (accessed via the Family and Friends menu).

When using Family Finder data, the percentages based on Method II can be calculated from the cM lengths by dividing the displayed Shared cM by 68.

Note that the X-chromosome is excluded from the total cM shared for all companies except for 23andMe. Males have one X-chromosome and females have two X-chromosomes. If you want to include the X-chromosome in the calculations, then instead of dividing by 68, divide by 68.81065 when combining the atDNA with the X-chromosome. Note that the expected shared percentages of X-DNA depend not only on the genealogical relationship between two people, but also on the numbers of males and females in the two paths to their common ancestor.

23andMe include the X-chromosome in their calculations, so their cM figures will be higher than those provided by FTDNA. 23andMe made adjustments to the cM count in June 2013 so the number of cMs will vary slightly depending on when the test was taken.

  • For females using 23andMe data prior to June 2013, there were 7494.8cMs when combining the paternal and maternal autosomal DNA and the two X-chromosomes per Family Inheritance: Advanced.
  • For females using 23andMe data after June 2013 there were 7438.6cMs when combining the paternal and maternal autosomal DNA and the two X-chromosomes per Family Inheritance: Advanced.
  • There are 7074.6 autosomal cMs per 23andMe.
  • For males using 23andMe data there are 7256.8 cMs when combining the atDNA with the single X-chromosome.

Note that AncestryDNA do not provide information on the lengths of half-identical (or better) regions in either centiMorgans or percentages on specific chromosomes. However, AncestryDNA customers can upload their raw data to the free GEDmatch utility in order to extract the necessary cM data for making comparisons and to check the relationship predictions. David Pike's tools can also be used.

Table

The following table shows the average amount of DNA shared by pairs of relatives in percentages and centiMorgans. All relationships up to and including second cousins can be detected by autosomal DNA tests. However, some genuine genealogical third cousins will not show up as a match because, although both cousins will have inherited DNA from their common ancestors they do not share any overlapping DNA segments. Beyond about five or six generations we will have some genealogical ancestors with whom we share no DNA. They are our genealogical ancestors but not our genetic ancestors.[2][3] If we go back 450 years we have over 32,000 genealogical ancestors but we will have DNA from only around 1000 of these ancestors.[4] The averages shown here for relationships beyond the second cousin level reflect the fact that an increasing proportion of cousins will not share any DNA. The expected amount of DNA shared will therefore be much higher than the average. For the expected percentages of detectable relationships at different levels see the article on cousin statistics.

Average autosomal DNA shared by pairs of relatives, in percentages and centiMorgans
% shared Total cM shared half-identical (or better) Relationship Notes
100% (Method I)/50% (Method II) 3400.00 Identical twins (monozygotic twins) Fully identical everywhere.[5]
50% 3400.00 Parent/child Half-identical everywhere
50% (Method I)/37.5% (Method II) 2550.00 Full siblings Half-identical on 50%/1700 cM and fully identical on a further 25%/850 cM.
25% 1700.00 Grandparent/grandchild, aunt-or-uncle/niece-or-nephew, half-siblings
25% (Method I)/23.4375% (Method II) 1593.75 Double first cousins Half-identical on 21.875%/1487.5 cM and fully identical on a further 1.5625%/106.25 cM
12.5% 850.00 First cousins, great-grandparent/great-grandchild, great-uncle or aunt/great-nephew or niece, half-uncle or aunt/half-nephew or niece
6.25% 425.00 First cousins once removed, half first cousins, great-great-grandparent/great-great-grandchild, great-great-aunt/uncle, half great-aunt/uncle
6.25% 425.00 Double second cousins
3.125% 212.50 Second cousins, first cousins twice removed, half first cousin once removed, half great-great-aunt/uncle, great-great-great-grandparent/great-great-great-grandchild
1.563% 106.25 Second cousins once removed, half second cousins, first cousin three times removed, half first cousin twice removed
0.781% 53.13 Third cousins, second cousins twice removed Up to 10% of third cousins will not share enough DNA to show up as match. See cousin statistics
0.391% 26.56 Third cousins once removed
0.195% 13.28 Fourth cousins, third cousins twice removed Up to 50% of fourth cousins will not share enough DNA to show up as match. See cousin statistics
0.0977% 6.64 Fourth cousins once removed. third cousins three times removed
0.0488% 3.32 Fifth cousins Only between 15% and 32% of fifth cousins will share enough DNA to show up as a match. See cousin statistics

Notes to Table

  • There is no variation between families in the parent/child or identical twins shared cM figures; beyond these immediate relationships, recombination results in random variation around the average figures above from one pair of individuals to another.
  • When a grandchild is compared to a grandparent, the shared cM with the other grandparent on the same side is easily inferred. The grandchild gets all 3400cM of, say, his paternal autosomes from his father. If it is seen that 1600cM of this came from the paternal grandfather, then the other 1800cM must have come from the paternal grandmother. The initial estimate of 1700cM shared by grandchild and paternal grandmother can thus be updated to 1800cM when it has been ascertained that grandchild and paternal grandfather share only a below average 1600cM.
  • When the subjects of the comparison descend from identical twin children of their most recent common ancestral couple, then the figures in the above table should be doubled.
  • The expected % shared for a half-relationship will always be exactly half of the expected % shared for the corresponding full relationship.
  • A similar method to that used for full siblings and for double first cousins can be used to compute expected shared percentages for any two subjects of comparison who are doubly related. However, the expected % shared for a double relationship can be slightly less than the sum of the expected % shared for the appropriate single relationships.
    • If Jack is related to both of Jill's parents, then Method I and Method II will give slightly different figures, as double cousins of this type are expected to be fully identical in some regions.
    • If Jill is a more remote descendant of spouses who are both related to Jack, then Jill will clearly have inherited at most one of the two segments in regions where the child of those spouses was fully identical to Jack. This reduces Jack and Jill's expected % shared slightly from the ballpark figure obtained by adding the expected % shared for the two relationships.
    • For example, double second cousins, where the double relationship arises because at least one is related on both the paternal side and the maternal side to the other, are expected to share 3.125% (1/32) on each side, or 6.25% (1/16) in total, using Method I. Using Method II, a small adjustment must be made to allow for regions where they are fully identical (1/1024 or approximately 0.098%), so that they are expected to be half-identical or better on 63/1024 or approximately 6.152%.
    • On the other hand, double second cousins who are children of double first cousins are expected to be half-identical on a quarter of the approximately 23.438% on which their parents are half-identical or better, in other words on approximately 5.859%.

Chart

The chart below (courtesy Dimario, Wikimedia Commons) shows the average amount of autosomal DNA inherited by all close relations up to the third cousin level.

Cousin tree (with genetic kinship).png

Distribution of genealogical relationships for given amounts of shared DNA

When interpreting DNA test results it is helpful to have an understanding of the statistics relating to the amount of DNA shared for known relationships. There is a limited amount of data available in the peer-reviewed scientific literature. This has been supplemented with citizen science projects, company data and computer simulations by geneticists and genetic genealogists.

Scientific papers

A 2012 paper by Henn, Hon and Macpherson et al looked at IBD (identical by descent) sharing in the 23andMe database.[6] The paper included a figure (reproduced below) based on computer simulations showing the range of sharing for relationships up to the fifth to eighth cousin level.

Figure 3A Henn et al 2012.png

The relationship between the degree of cousinship and IBDhalf metrics. The authors used pedigree-based simulations to characterize the relationship between IBDhalf metrics and degrees of cousinship for multiple population samples. Genomic data from a European sample were used to simulate an 11-generation pedigree. The joint distribution of IBDhalf and number of IBDhalf segments is shown for each pairwise comparison from the pedigree simulations. GP/GC indicates grandparent/grandchild pairs. Simulations were run on phased samples from several HGDP-CEPH population samples and European, Asian and Ashkenazi samples from a 23andMe customer dataset. Simulations were conducted by specifying an extended pedigree structure and simulating genomes for the pedigree by mating individuals drawn from a pool of empirical genomes. Reproduced from Henn et al 2012[6] under a Creative Commons Licence.

A paper published by Hill and Weir (2012) looked at the variation in actual relationship as a consequence of Mendelian sampling and linkage.[7] Figure 5 in this paper shows the distribution of actual genome sharing for different degrees of pedigree relationships.

Visscher et al (2006) looked at the range of IBD sharing in full siblings.[8] See Figure 1 which has a histogram showing the range of sharing for full siblings between ~37% and ~62%.

Company stats

AncestryDNA

23andMe

The following table, based on figures provided by 23andMe, shows the expected range of DNA shared for different relationships in both percentage terms and total cM shared, including the X-chromosome.[9][10] Note that for all relationships at the first cousin level or more distant, the percentages are likely to be averages over all possible relationships, and the average percentage of the X-chromosome shared is the same as the average percentage of the autosomes. 23andMe is the only company which includes the X-chromosome in the total cM shared. Note that for many relationships the ranges overlap.

Relationship Range in percentages Range in centiMorgans
Parent / child 50% ~3719 CM
Father / son (no X sharing) 47.5% ~3536 cM
Full sibling 38% - 61% 2826 - 4537 cM
Grandparent / grandchild; Aunt / uncle; Niece / nephew; Half siblings 17% - 34% 1264 - 2529 cM
1st cousin; Great-grandparent / Great-grandchild; Great-Uncle/Aunt Great Nephew / Niece 4% - 23% 298 - 1710 cM
1st cousin once removed; half first cousin 2% - 11.5% 149 cM - 855 cM
2nd cousin 2% - 6% 149 cM - 446 cM
2nd cousin once removed; Half 2nd cousin 0.6% - 2.5% 45 cM - 186 cM
3rd cousin 0% - 2.2% 0 - 164 cM
4th cousin 0% - 0.8% 0 - 60 cM
5th cousin to distant cousin Variable Variable

MyHeritage DNA

MyHeritage DNA have published a short table showing the range of sharing for close relations up to the first cousin level. See the article in their Help Centre What are shared DNA segments?.

Living DNA

Living DNA report matches up to the ninth degree (fourth cousin). See the article in their Support Centre which shows the ranges for all relationships up to the fourth cousin level. What does my relationship prediction mean?

Shared cM Project

Blaine Bettinger has been collecting empirical (real life) data from the genetic genealogy community on the number of shared centiMorgans for known genealogical relationships as part of his shared cM project.[11] The chart below is taken from the latest update of the project in March 2020 which was based on submissions for over 60,000 known relationships.[12] The chart is made available under a Creative Commons Licence. You are free to share and use the information for non-commercial purposes, as long as you give proper attribution and release anything you create under the same licence. Additional information including histograms and the breakdown of companies is provided in the PDF download. Data is still being collected for the project and you can add your own statistics using this form on GoogleDocs.

Jonny Perl has created a tool which incorporates data from the Shared cM project which allows the user to enter the total cM shared and get a report on the number of possible relationships. From an interactive chart you can click through and view the histograms from the Shared cM Project for each relationship. The tool can be found on the DNA Painter website.

Shared cM Project v4.png

Ashkenazi Shared DNA Survey

Lara Diamond set up an Ashkenazic Jewish Shared DNA Survey in January 2018 inspired by Blaine Bettinger's Shared cM Project. As of 1 September 2019 she had 5537 data points from the Ashkenazi community on the amount of DNA shared for different relationships. Further submissions are invited. She plans to issue follow on surveys for Sephardic and other Jewish subgroups.

For details of the project see Lara's blog post Ashkenazic Jewish Shared DNA Survey (14 January 2018).

For the latest results see Lara's blog post Ashkenazic Shared DNA Survey - August 2022 Update (1 August 2022).

Relationship data from Tim Janzen

Tim Janzen has provided two charts that provide statistical information about the amount of DNA sharing for various relationships from the first cousin once removed level and upwards for endogamous and non-endogamous populations. The statistics are based on information from real people who have been tested by 23andMe and Family Tree DNA and who have a known genealogical relationship to someone else who has also been tested by the same company. The charts were originally designed for use with 23andMe data but now also incorporate data from FTDNA's Family Finder test. The charts are organized by the degree of relationship with the most closely related people being listed at the top and more distant cousins being listed at the bottom. The charts also include information on the median and the average number of shared cMs for people who are related to each other from the first cousin once removed level of relationship to the 5th cousin level of relationship. The charts can be downloaded from the following links:

The following table provided by Tim Janzen shows the ranges of total centiMorgans shared and number of segments shared for different relationships based on data from the Mennonite DNA Project.[13] The cM totals are based on 6761 cMs in FTDNA's Family Finder test.

Relationship Range Expected Range of number of shared segments
Parent/child 3539-3748 cMs 23-29
First cousins 548-1139 cMs 888 cMs 17-32
First cousins once removed 220-638 cMs 444 cMs 12-23
Second cousins 86-426 cMs 222 cMs 10-18
Second cousins once removed 19-197 cMs 111 cMs 4-12
Third cousins 16-111 cMs 55.4 cMs 2-6?
Third cousins once removed 0-99 cMs 27.8 cMs 1-4
Fourth cousins 0-54 cMs 13.8 cMs 0-2

Tim Janzen has also compiled a chart showing the probability of a given genealogical relationship for each cM threshold going up in 1 cM increments starting at 6 cMs and going up 200 cMs. The chart may be downloaded as an Excel file from this link. This chart applies to non-endogamous populations. When using the chart to predict relationship from Family Finder data you will need to remove the data for all segments under 7 cMs.

Tim Janzen has also provided data on the number of SNPs shared for different relationships based on the 23andMe Compare Genes function:

  • Parent-child pairs share between 83.94% and 84.20% of SNPs (50% of DNA in common)
  • Siblings share between 83.81% and 87.47% of SNPs (50% of DNA in common)
  • Uncle/aunt-niece/nephew pairs share between 78.48% and 79.57% of SNPs (25% of DNA in common)
  • Grandparent-grandchild pairs share between 77.96% and 80.59% of SNPs (25% of DNA in common)
  • First cousins and great uncle/great aunt-grandniece/grandnephew pairs share 75.78% and 77.03% of SNPs (12.5% of DNA in common)
  • First cousins once removed share ca 75.5% of SNPs (6.25% of DNA in common)
  • Second cousins and first cousins twice removed share ca 75% of SNPs (3.125% of DNA in common)
  • Unrelated people of European descent share 73-74.6% of SNPs

Other sources for relationship ranges

Leah Larkin has written a blog post on The limits of predicting relationships using DNA which includes a table of relationship ranges from the DNA Detectives group and a chart showing ranges based on simulations by AncestryDNA.

The DNA Adoption team have provided a DNA Prediction Chart and a Relationship Estimator spreadsheet based on community-provided data for known relationships. For details see the guest blog post by Karin Corbeil on the DNAeXplained blog on Demystifying Ancestry's relationship predictions inspires new relationship estimator tool.

An unidentified author has also provided a spreadsheet on DNA Inheritance Statistics to which anyone can add their data. The spreadsheet can be found here.

Simulations

The geneticist Graham Coop has published a number of useful articles based on simulations on the subject of autosomal DNA inheritance:

Amy Williams and colleagues at Cornell University have provided some useful charts based on simulations which shows the likelihood of matching with cousins of different degrees of relationship and the number of segments shared:

For the scientific details of the simulations see: Caballero M, Seidman DN, Qiao Y et al (2019). Crossover interference and sex-specific genetic maps shape identical by descent sharing in close relatives. PLoS Genetics 15(12): e1007979.

The following studies have been published online by mathematician and genetic genealogist Paul Rakow:

  • DNA Simulation Version 1a, published online on 27 September 2015. A simulation of the inheritance of DNA over several generations, to calculate how likely or unlikely matches of a given strength will be, for a given relationship.
  • Ancestral segments, published online on 27 June 2016. A simulation which examines the size and distribution of ancestral segments across 20 generations.

Genetic genealogist Philip Gammon has published some first cousin match simulations:

See also the Wiki page on cousin statistics.

Charts and tools

Blog posts

See also

References

  1. In genetic genealogy, the verb 'test' is used loosely to describe the process of submitting a DNA sample and receiving results such as lists of possible relatives. This meaning should not be confused with the precise use of the same word by statisticians referring to the 'testing' of a hypothesis, which is either accepted or rejected based on the statistics observed.
  2. Bettinger B. Q&A: Everyone has two family trees - a genealogical tree and a genetic tree. The Genetic Genealogist, 10 November 2009.
  3. Coop G. How many genetic ancestors do I have?. The Coop Lab Blog, 11 November 2013.
  4. Coop G. Genetic ancestry groups and genetic similarity. arXiv:2207.11595v1 [q-bio.PE], 23 July 2022.
  5. Tiny differences between identical twins can now be detected by next generation sequencing. See: Weber-Lehman et al 2014. Finding the needle in the haystack: Differentiating "identical" twins in paternity testing and forensics by ultra-deep next generation sequencing. Forensic Science International: Genetics; 9: 42-46. See also the editorial by Bruce Budowle in Investigative Genetics: Molecular genetic investigative leads to differentiate monozygotic twins.
  6. 6.0 6.1 Henn BM, Hon L, Macpherson JM et al (2012). Cryptic distant relatives are common in both isolated and cosmopolitan genetic samples. PLOS One 3 April 2012.
  7. Hill, WG & Weir, BS (2011). Variation in actual relationship as a consequence of Mendelian sampling and linkage. Genetics Research, vol 93, no. 1, pp. 47-64.
  8. Visscher PM, Medland SE, Ferreira MAR et al (2006). Assumption-free estimation of heritability from genome-wide identity-by-descent sharing between full siblings. PLOS Genetics 24 March 2006.
  9. 23andMe Customer Care. DNA Relatives: detecting relatives and predicting relationships
  10. The original chart only provided percentages. An approximation of the cM ranges has been calculated by using Ann Turner’s data giving us a figure of 3719 cM (autosomes and X-chromosome) for a mother/child relationship which converts into a figure of 7438 cM for the size of the genome. The percentages were then converted into centimorgans by multiplying by 7438.
  11. Bettinger B. The Shared cM Project: a demonstration of the power of citizen science. Journal of Genetic Genealogy 2016 8(1):38-42.
  12. Bettinger B. Version 4.0! March 2020 Update to the Shared cM Project!. The Genetic Genealogist, 27 March 2020.
  13. The information in this table was included in Tim Janzen's presentation "Discovering and Verifying your Ancestry using Family Finder" at the 2014 Family Tree DNA Conference on Genetic Genealogy held in Houston, Texas, on 11 October 2014. The slides can be downloaded from http://bit.ly/2EyT36N.