Page Actions

Geno 2.0 raw data

From ISOGG Wiki

This article explains how to manipulate the Geno 2.0 raw data files from the Genographic Project.

Download raw data

Extract .csv file

To extract YOURGPID.csv.gz files:

  • on Windows you can use the GZ decompression of free 7-Zip, WinZip or WinRAR.
  • on Linux GZ extraction is integrated on most systems

View and edit .csv file

A .csv file is a spreadsheet format that can be opened by Excel 2007 and later versions, OpenOffice Calc, LibreOffice Calc, etc. With Excel 2003 or previous versions, in which the number of rows is limited to 65536 or less, you won't be able to open big files. As the .csv file is a text file format, it can also be opened by text editors such as notepad++.

Until summer 2013 all information was stored only in one file. Until summer 2014 the SNP-information was stored in four files: all.csv, autosomal.csv (Autosomal DNA + X-DNA), mtdna.csv (mtDNA), ychromo.csv (Y-DNA). Since then the information has been made available in again in one files:

  • YOURGPID.csv - All SNPs

You can identify the location of the SNPs by the column with the Chromosome id on the antepenultimate column:

  • Autosomal DNA: chr 1-22 ~126,307 SNPs
  • X-DNA: chr X ~3803 SNPs
  • Y-DNA: chr Y ~12,064 SNPs
  • mtDNA: chr 0 Differences are reported from the Revised Cambridge Reference Sequence. The number of SNPs will vary from one individual to the next.

If you would like to save a copy with the SNP data only from a certain region (for example Y), delete all lines except those with the Chromosome id on the second (or third) position. An easy way to do that is to sort in descending order the column with the id: you then can delete all other lines.

Current Data format

At least since April 2013, probably since January 2013. Chromosome identifier on the second column

SNP;Chr;Allele1;Allele2
CTS100;Y;C;C
CTS10004;Y;G;G
CTS10009;Y;G;G
...

First Data format

Used in December 2012. Chromosome identifier on the third column

[Header]
GSGT Version,1.9.4
Processing Date,11/25/2012 0:01 PM
Content,,NGS_iSelect_v1_15030891_2012_B-wRS-2.bpm
Num SNPs,154476
Total SNPs,169786
Num Samples,1850
Total Samples,2796
File ,1680 of 1850
[Data]
GRC12122435_ChipNGv1_37760_F02,101SNP8856FG_A,0,G,G
GRC12122435_ChipNGv1_37760_F02,101SNP8860FA_G,0,C,C
GRC12122435_ChipNGv1_37760_F02,102SNP8856FG_A,0,G,G
...

SNP values

Usually homozygous values showing A, C, G, T:

kgp2184507,1,C,C
rs10490098,2,G,G
Z715,Y,T,T
10915,Mt,A,A

INDELs have either I = insertion or D = deletion values:

Z77,Y,D,D
Z86,Y,I,I

No read is "-"

kgp10414556,4,-,-
PF5203,Y,-,-

Heterozygous values

Almost all of the SNPs listed in the results file show two identical values (homozygous). A very small number show two different values (heterozygous). Heterozygous Y allele calls are interpreted as "derived" because what shows up as an AB allele is really a BB allele. This turned out to be quite reliable on many difficult Y-SNPs. [1]

kgp30024,15,T,C
PF3518,Y,T,C
PF2600,Y,A,G

References