[adegenet-forum] Quality control

Federico Calboli f.calboli at imperial.ac.uk
Mon Dec 10 11:43:31 CET 2012


Greg, 

this is not strictly an adegenet question, it's more a r-help one.

> I am immersed in my first batch of SNPs and I've got to do some serious quality control. I've got over 500 individuals and 34,000 SNPs, so my dataset is quite bulky.

some would say you data is very light, ymmv!
> 
> My query: First, I need to develop some script that will allow me to eliminate individuals from my dataset that have less than 70% calls for each loci in my dataset.
> Second, I need to remove all loci with >30% NaN or no calls.

I assume you have one (1) standardized way of having failed SNPs, such as NA, -9, whatever.  Using NA for SNPs that have not been called is by a mile the sanest way forward.

> I can't figure out how to target the "NaN" values or "0000" genotype for removal.

if you have NaN (and why would you have NaN for a missing SNP?) or 0000 (which should then be a character type to be read as 4 zeros)

ind.qc = apply(my.data, 1, function(x){y = is.nan(x);  z = sum(y})
ind.qc = ind.qc/dim(my.data)[2]

[the above assumes the odd NaN for missing]

I guess you can then take over and just change the code above for 0000 and for having the QC of every SNP, rather than every individual.  

Incidentally, the code above also gives you the QC summary.

BW

F



> For the sake of documentation I'd also like to know how many of the loci or individuals were removed according to the above criteria but obviously I can't view a summary of each of the 34,000 SNPs. Can I produce a summary of the aforementioned editing exercises?
> 
> Thanks in advance.
> 
> Greg Puncher, PhD Student
> Molecular Genetics for Environmental & Fishery Resources Laboratory (GenMAP)
> University of Bologna
> Via S. Alberto 163, 48123 Ravenna (Italy)
> Ph: 39(0)544/937311  Fax: 39(0)544937411
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


--
Federico C. F. Calboli
Neuroepidemiology and Ageing Research
Imperial College, St. Mary's Campus
Norfolk Place, London W2 1PG

Tel +44 (0)20 75941602   Fax +44 (0)20 75943193

f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com



More information about the adegenet-forum mailing list