From ISOGG Wiki
Phasing is the task or process of assigning alleles (the As, Cs, Ts and Gs) to the paternal and maternal chromosomes. The term is usually applied to types of DNA that recombine, such as autosomal DNA or the X-chromosome. Phasing can help to determine whether matches are on the paternal side or the maternal side, on both sides or on neither side. Phasing can also help with the process of chromosome mapping – assigning segments to specific ancestors. The use of phased data reduces the number of false positive matches, particularly for smaller segments under 15 centiMorgans (cMs).
- 1 Trio phasing
- 2 Phasing with data from one parent or other family members
- 3 Statistical phasing
- 4 Genetic genealogy companies
- 5 The future
- 6 Phasing tools
- 7 Scientific papers
- 8 Articles
- 9 Blog posts
- 10 Videos
- 11 See also
- 12 References
Trio phasing – using data from a child and both parents – is the gold standard for phasing. It is possible to phase about 94% of the alleles in an autosomal dataset using a two parent/one child trio. The number of alleles that can be phased is marginally increased if siblings are also tested. Roach et al found that they were able to phase 98.8% of the alleles by using data from two parents and four children.
Phasing with data from one parent or other family members
If only one parent is available for testing first test the parent, and all of that parent's children. Then test at least one of the parent's grandchildren through each of the parent's children who had children. It would also be reasonable to test the spouses of the parent's children since that increases the amount of the data you can phase.
If no parents are available for testing first test all children of the family up to at least five (assuming five or more are available for testing). Then test at least one of the parent's grandchildren through each of the parent's children who had children. It would also be reasonable to test the spouses of the parent's children since that increases the amount of the data you can phase.
Once you have done the above then start concentrating on testing first and second cousins of the parents. There will be a diminishing return after about five or so first cousins, but it makes sense to test as many first cousins as you can afford to test up to some limit.
Some phasing can also be done using siblings, aunts and uncles or other close relatives as a proxy for a parent. This is sometimes known as poor man's phasing.
It is not always possible to obtain trios for phasing and, even if it were, it is not economical or computationally feasible to phase large trio datasets. Sophisticated statistical algorithms have been developed which phase the data based on allele frequencies derived from reference populations. A number of programs are available such as Beagle and FastIBD. Phasing can be done with a high degree of accuracy if large enough reference cohorts are available which are representative of the populations being studied. However, with genotype data the current methodologies are not able to reliably phase small segments under 5 cMs. One study reported a false positive rate of over 67% for 2-4 cM segments when compared with trios.
Statistical or population-based phasing works because our DNA is all very similar and because it's passed on in chunks. Think of it like trying to read a sentence when some of the letters are missing. There are only so many combinations that will fit in the available spaces. If you saw these words:
R-d is my f-v--r-t- c-l--r
You would probably be able to work out that the sentence should read:
Red is my favourite colour
There are regional variations in the "sentences" but even if there were a couple of "deletions" you'd still be able to work it out:
Red is my favorite color
Difficulties arise when you have a short word without the context of a full sentence. R-d on its own could be red, rid, or rod.
Genetic genealogy companies
The raw genotype data generated by the Illumina microarray chips used for the autosomal DNA tests from the genetic genealogy companies is unphased and therefore does not distinguish the alleles on the maternal and paternal chromosomes. Customers who download their raw data file will observe that in the genotype column there are two DNA letters for each SNP. These letters are unsorted and could have come from either parent.
AncestryDNA and MyHeritage DNA are currently the only two companies which phase the data before assigning matches. Ancestry has developing its own phasing algorithm known as Underdog. The technical details are provided in the AncestryDNA Matching White Paper. They claim to have an error rate of under 1% and the error rate improves as the size of the training reference dataset increases. As of the beginning of 2016, AncestryDNA uses a reference panel of more than 300,000 genotypes. The details of MyHeritage DNA's phasing is given in the their blog post on major updates and improvements to MyHeritage DNA matching. See also the presentation given by Yaniv Erlich, MyHeritage DNA's Chief Scientific Officer, at Rootstech 2018 MyHeritage DNA 1010: from test to results
Note, however, that if you download the raw data from AncestryDNA or MyHeritage to upload to third-party sites you will receive a file of unphased data.
The 23andMe test and the Family Finder test from Family Tree DNA do not phase the data before assigning matches. However, 23andMe uses statistical phasing for their Ancestry Composition. If one or both parents has been tested at 23andMe Ancestry Composition can determine which ancestral segments have been inherited from each parent. For a detailed explanation see the 23andMe article on The phasing process.
None of the companies currently provide a facility for customers who have tested their parents to phase their data, and none of the companies allow customers to upload their own phased file.
The free GedMatch website provides a Phasing Data Generator which allows the user to generate phased maternal and paternal data files. The algorithm was developed by John S Walden and implemented by John Olson. Phased paternal files have the prefix P. Phased maternal files have the prefix M. The phased kits can be compared in the GedMatch database in the usual way. For a detailed explanation see the GedMatch Wiki page on phasing.
David Pike's phasing utilities
David Pike has developed two tools for phasing which can be accessed from his website:
- Phase a child when given data for child and both parents
- Phase siblings, with data available from both parents.
Felix Immanuel's phasing utility
Felix Immanuel has written his own phasing utility which can be downloaded from his Genetic Genealogy Tools website.
Oxford Statistics Phasing Server
Oxford Statistics provides a free phasing server for phasing whole genomes used VCF files. For details see the Oxford Statistics website.
Early pioneers of autosomal phasing, like Whit Athey and Tim Janzen, used Microsoft Excel. (NOTE: Do not use versions of Excel prior to 2007 since they will not have enough rows. Phasing can also be done with the free and open-source office suite LibreOffice.)
Tim Janzen's Excel program will phase either 23andMe or Family Finder data from two parents and one of their children. The program can do multiple or all of the autosomal chromosomes at once assuming that your computer can handle a large Excel file with all of the data in it. The program can be downloaded from Tim's Dropbox account at: http://dl.dropbox.com/u/21841126/phasing%20program%20%28small%20version%29.xls.
Instructions on how to use the program may be found at: http://dl.dropbox.com/u/21841126/phasing%20program%20instructions.rtf.
Tim has also uploaded a small version of the program that includes sample data from two parents and one of their children for 500 SNPs which will give people an idea of what the output looks like on a small scale. The program can be downloaded here.
For instructions on phasing see the artlcle on the phasing process which outlines Tim Janzen's methodology.
- Choi Y, Chan AP, Kirkness E, Telenti A and Schork NJ. Comparison of phasing strategies for whole genomes. PLOS Genetics, 5 April 2018.
- Loh PR, Palamara PF, Price AL (2016). Fast and accurate long-range phasing in a UK Biobank cohort. Nature Genetics 48(7): 811-86. Epub 2016 Jun 6. Preprint available here.
- O'Connell J, Sharp K, Shrine N et al (2016). Haplotype estimation for biobank-scale data sets. Nature Genetics 48(7): 817-820. Epub 2016 Jun 6.
- Browning SR, Browning BL (2012). Haplotype phasing: existing methods and new developments. Nature Reviews Genetics 2(10): 703-714. A good review article summarising the currently available methodologies for phasing.
- Williams AL, Patterson N, Glessner J, Hakonarson H, Reich D. Phasing of many thousands of genotyped samples. American Journal of Human Genetics 2012 Aug 10; 91(2): 238-251.
- Tewhey R, Bansal V, Torkamani A, Topol EJ, Schork NJ. (2011). The importance of phase information for human genomics. Nature Reviews Genetics 12(3): 215-223.
- Roach JC, Glusman G, Hubley R et al (2011). Chromosomal haplotypes by genetic phasing of human families. American Journal of Human Genetics 89 (3): 382-397.
- Slatkin M (2008) .Genotype data and haplotype phase. From the article Linkage disequilibrium — understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics 9, 477-485 (June 2008).
- Phasing the chromosomes of a family group when one parent is missing T. Whit Athey.
- Going through a phase: haplotyping the female X chromosomes Ann Turner.
- From the Director: why we need phasing ISOGG Newsletter Jan/Feb 2011.
- Bettinger B. The effect of phasing on reducing false distant matches (or, phasing a parent using GEDmatch). The Genetic Genealogist 26 July 2017.
- Turner A. What a difference a phase makes. Guest post on The Genetic Genealogist Blog, 30 March 2015.
- Rose K. A study on French Canadian DNA and the implications for genetic genealogy (Internet Archive version). DNA Genealogy blog, 16 May 2014. The article includes a discussion of the merits of the different phasing engines.
- Handy S. Autosomal DNA testing: phasing. DNA Genealogical Experiences and Tutorials blog, 3 November 2012.
- Use Family SNP Data to Phase Your Own Genome The Chromosome Chronicles, 30 September 2009.
- Phasing: determining which SNPs are inherited together The Chromosome Chronicles, 8 September 2009.
David Pike gave a presentation on "The use of phasing in genetic genealogy" at the Institute for Genetic Genealogy held in Maryland in 2014. The lecture can be viewed online for a small fee.
A guide to phasing from Illumina:
- Autosomal DNA
- Autosomal DNA match thresholds
- Autosomal DNA tools
- Chromosome mapping
- Identical by descent
- The phasing process
- Visual phasing
- Roach JC, Glusman G, Hubley R et al. Chromosomal haplotypes by genetic phasing of human families. American Journal of Human Genetics Volume 89, Issue 3, 382-397.
- Durand EY, Eriksson N, McLean CY (2014). Reducing pervasive false positive identical-by-descent segments detected by large-scale pedigree analysis. Molecular Biology and Evolution 2014 doi: 10.1093/molbev/msu151.