Geno 2.0 raw data
From ISOGG Wiki
This article explains how to manipulate the Geno 2.0 raw data files from the Genographic Project.
Download raw data
- Go to Genographic Project Results page
- Log in under Registered Users entering your User Name and Password. If not registered do so with your Geno 2.0 ID Code.
- Go to Profile
- Scroll to Expert Options and click the download consent
- Click on Download Genetic Data
- Save the YOURGPID.csv.gz file to a local folder
- If you are a men and would like to contribute and participate in Y-DNA haplogroup research, send the file to the appropriate citizen scientist.
Extract .csv file
To extract YOURGPID.csv.gz files:
- on Windows you can use the GZ decompression of free 7-Zip, WinZip or WinRAR.
- on Linux GZ extraction is integrated on most systems
View and edit .csv file
A .csv file is a spreadsheet format that can be opened by Excel 2007 and later versions, OpenOffice Calc, LibreOffice Calc, etc. With Excel 2003 or previous versions, in which the number of rows is limited to 65536 or less, you won't be able to open big files. As the .csv file is a text file format, it can also be opened by text editors such as notepad++.
Until summer 2013 all information was stored only in one file. Until summer 2014 the SNP-information was stored in four files: all.csv, autosomal.csv (Autosomal DNA + X-DNA), mtdna.csv (mtDNA), ychromo.csv (Y-DNA). Since then the information has been made available in again in one files:
- YOURGPID.csv - All SNPs
- Autosomal DNA: chr 1-22 ~126,307 SNPs
- X-DNA: chr X ~3803 SNPs
- Y-DNA: chr Y ~12,064 SNPs
- mtDNA: chr 0 Differences are reported from the Revised Cambridge Reference Sequence. The number of SNPs will vary from one individual to the next.
If you would like to save a copy with the SNP data only from a certain region (for example Y), delete all lines except those with the Chromosome id on the second (or third) position. An easy way to do that is to sort in descending order the column with the id: you then can delete all other lines.
Current Data format
At least since April 2013, probably since January 2013. Chromosome identifier on the second column
SNP;Chr;Allele1;Allele2 CTS100;Y;C;C CTS10004;Y;G;G CTS10009;Y;G;G ...
First Data format
Used in December 2012. Chromosome identifier on the third column
[Header] GSGT Version,1.9.4 Processing Date,11/25/2012 0:01 PM Content,,NGS_iSelect_v1_15030891_2012_B-wRS-2.bpm Num SNPs,154476 Total SNPs,169786 Num Samples,1850 Total Samples,2796 File ,1680 of 1850 [Data] GRC12122435_ChipNGv1_37760_F02,101SNP8856FG_A,0,G,G GRC12122435_ChipNGv1_37760_F02,101SNP8860FA_G,0,C,C GRC12122435_ChipNGv1_37760_F02,102SNP8856FG_A,0,G,G ...
Usually homozygous values showing A, C, G, T:
kgp2184507,1,C,C rs10490098,2,G,G Z715,Y,T,T 10915,Mt,A,A
INDELs have either I = insertion or D = deletion values:
No read is "-"
Almost all of the SNPs listed in the results file show two identical values (homozygous). A very small number show two different values (heterozygous). Heterozygous Y allele calls are interpreted as "derived" because what shows up as an AB allele is really a BB allele. This turned out to be quite reliable on many difficult Y-SNPs. 
kgp30024,15,T,C PF3518,Y,T,C PF2600,Y,A,G
- Thomas Krahn, Nov. 2012, http://tech.groups.yahoo.com/group/R1b-L21-Project/message/12752