ChrisR/Y-SNP checking and certification
From ISOGG Wiki
< User:ChrisRCollection of good practices to check novel SNPs and prepare them for haplogroup certification, sanger sequencing submission, etc. This is for now a personal page and no claim to show the best or ISOGG preferred way.
Contents
Sources of Y-SNPs
- FTDNA BigY Results > Novel Variants: the filtering process does not exclude SNPs in many untrusted Y-chromosome areas and only has a Confidence field (High, Medium) which is not very telling since the underlying process is not reproducible. Available also to FTDNA Project Admins.
- FTDNA BigY Results > Download Raw Data > Download VCF: ZIP including .vcf (Wikipedia, variants hg build 37.3 PASS and REJECTED included) and .bed (regions targeted and passing QC). This is a more informative collection from the FTDNA BAM file to SNP generation pipeline. Only available for download for kit owners (with kit login). Used for example to build the The Big Tree (Alex Williamson).
- FGC Interpretation Results ZIP file > YSNP > *.variantCompare.* : usually for Y-Elite but also for BigY and other NGS BAM file interpretation. Reliable novel SNPs are registered in the FGC series and the detection and filtering algorithm is very balanced and together with the YFull pipeline one of the most optimized results for Y-Chromosome Variants.
- YFull > Novel SNPs > Download .CSV: Variants with quality describers (Best qual, Acceptable qual, Ambiguous qual, Low qual, One reading, INDELs), also included are the number of reads from the BAM file.
- Third party creation of variants from BAM files: Galaxy (Online, Wiki), samtools (Linux), BAM Analysis Kit (Windows-Toolset), etc.
chrY region reliability
- classic nine unique reliable regions ~8.97 mbp from Wei et al 2012 (Suppl. Table S1), chrY-graph with the regions fully or partly used in later papers as reference for analysis.
- combBED (857 regions, 8.4 mbp) as researched by Adamov et al. 2015 and used by YFull as reliable areas with constant mutation rates for TMRCA calculations.
- unreliable area: 22216158-22513120 b37/hg19
SNPs: quality, uniqueness and name registration
The following mainly manual workflow is used by Chris R. for J2-M172 research. Quicker results trough scripts and automatic processing would be of great help.
Filtering
Necessary for SNP lists from FTDNA or other outputs not optimized (prefiltered) for the Y.
- Exclude all SNPs in the "bad region" b37/hg19 22216158-22513120 Conditional formatting in spreadsheets (GDocs etc.) is useful
- Exclude all SNPs already registered in YBrowse unless they were registered from a sample or haplogroup under analysis: direct YBrowse check
You may use spreadsheet formulas like for GDocs: =CONCATENATE("http://ybrowse.org/gb2/gbrowse/chrY/?name=ChrY%3A",COORDFIELD-20,"..",COORDFIELD+20)
For every SNP it should be noted if he is from an STR-area, has a rs-number. SNPs which are included in popular SNP-Chips (with high uniqueness, see below) and are not found more then in one other haplogroup (YFull SNP search) may be kept for further research. - Exclude all SNPs which in a wide number of samples have no stable reads, for example use: YFull Y-Chr browser
Sorting by uniqueness
- Get 500bp reference sequence from YBrowse FastaDumper
You may use spreadsheet formulas like for GDocs: =CONCATENATE("http://ybrowse.org/gb2/gbrowse/chrY/?plugin=FastaDumper;plugin_action=Go;view_start=",COORDFIELD-250,";view_stop=",COORDFIELD+250) - Find sequence similarities in the genome by using BLAT, remember to set correct reference Assembly: currently Feb. 2009 (GRCh37/hg19)
from the second row (most similar sequence) save the whole output beginning with SCORE: 252 1 498 501 83.4% Y - 25886166 25886609 444
for later analysis (when SCORE is under 100 I add "0" so to better sort in spreadsheet later). - Sort all remaining SNPs per candidate haplogroup by BLAT score to the most similar sequence from lowest to highest,
registration / publication
- use this sorted list for Z-registration (naming) in Supplement to Community Spreadsheet (assign/reserve Z-numbers in Public 'whole genome' chrY data analysis), currently the following persons in J2 research (current and past) have access: Chris Rottensteiner, Al Aburto, Vince Tilroe, Bonnie Schrack, the spreadsheet is mainly managed by the ISOGG Y Haplogroup curator (Ray Banks).
- after registration send the new SNPs to YBrowse (Admin Thomas Krahn) for public ISOGG registration (used by FGC, YFull, etc.)
- find SNPs best suited for primary haplogroup definition: check BLAT IDENTITY percentage and SPAN length (under 95.5% per 500bp) as well as coordinates in the Wei et al 2012 region and if possible availability by sanger sequencing (YSEQ etc.). Include those SNPs (when names are public) in research trees and use them for FTDNA J2 Tree updating (Admin Michael Sager) as well as other important updates
- certify new haplogroups for ISOGG by using ISOGG Listing Criteria for SNP Inclusion
- after multiple new haplogroup discovery update FTDNA SNP Packs and YSEQ SNP Panels as well as other SNP tests open for updates