Page Actions

User

ChrisR/Y-SNP checking and certification

From ISOGG Wiki

< User:ChrisR

Collection of good practices to check novel SNPs and prepare them for haplogroup certification, sanger sequencing submission, etc. This is for now a personal page and no claim to show the best or ISOGG preferred way.

Sources of Y-SNPs

chrY region reliability

SNPs: quality, uniqueness and name registration

The following mainly manual workflow is used by Chris R. for J2-M172 research. Quicker results trough scripts and automatic processing would be of great help.

Filtering

Necessary for SNP lists from FTDNA or other outputs not optimized (prefiltered) for the Y.

  • Exclude all SNPs in the "bad region" b37/hg19 22216158-22513120 Conditional formatting in spreadsheets (GDocs etc.) is useful
  • Exclude all SNPs already registered in YBrowse unless they were registered from a sample or haplogroup under analysis: direct YBrowse check
    You may use spreadsheet formulas like for GDocs: =CONCATENATE("http://ybrowse.org/gb2/gbrowse/chrY/?name=ChrY%3A",COORDFIELD-20,"..",COORDFIELD+20)
    For every SNP it should be noted if he is from an STR-area, has a rs-number. SNPs which are included in popular SNP-Chips (with high uniqueness, see below) and are not found more then in one other haplogroup (YFull SNP search) may be kept for further research.
  • Exclude all SNPs which in a wide number of samples have no stable reads, for example use: YFull Y-Chr browser

Sorting by uniqueness

According to ISOGG Listing Criteria for SNP Inclusion (Tentative Quality Guidelines next-generation sequencing 6: 500 adjacent base pairs max. 95.5% similarity to another genome region)

  • Get 500bp reference sequence from YBrowse FastaDumper
    You may use spreadsheet formulas like for GDocs: =CONCATENATE("http://ybrowse.org/gb2/gbrowse/chrY/?plugin=FastaDumper;plugin_action=Go;view_start=",COORDFIELD-250,";view_stop=",COORDFIELD+250)
  • Find sequence similarities in the genome by using BLAT, remember to set correct reference Assembly: currently Feb. 2009 (GRCh37/hg19)
    from the second row (most similar sequence) save the whole output beginning with SCORE: 252 1 498 501 83.4% Y - 25886166 25886609 444
    for later analysis (when SCORE is under 100 I add "0" so to better sort in spreadsheet later).
  • Sort all remaining SNPs per candidate haplogroup by BLAT score to the most similar sequence from lowest to highest,

registration / publication

  • use this sorted list for Z-registration (naming) in Supplement to Community Spreadsheet (assign/reserve Z-numbers in Public 'whole genome' chrY data analysis), currently the following persons in J2 research (current and past) have access: Chris Rottensteiner, Al Aburto, Vince Tilroe, Bonnie Schrack, the spreadsheet is mainly managed by the ISOGG Y Haplogroup curator (Ray Banks).
  • after registration send the new SNPs to YBrowse (Admin Thomas Krahn) for public ISOGG registration (used by FGC, YFull, etc.)
  • find SNPs best suited for primary haplogroup definition: check BLAT IDENTITY percentage and SPAN length (under 95.5% per 500bp) as well as coordinates in the Wei et al 2012 region and if possible availability by sanger sequencing (YSEQ etc.). Include those SNPs (when names are public) in research trees and use them for FTDNA J2 Tree updating (Admin Michael Sager) as well as other important updates
  • certify new haplogroups for ISOGG by using ISOGG Listing Criteria for SNP Inclusion
  • after multiple new haplogroup discovery update FTDNA SNP Packs and YSEQ SNP Panels as well as other SNP tests open for updates