[adegenet-forum] Quality control

Jombart, Thibaut t.jombart at imperial.ac.uk
Mon Dec 10 12:50:50 CET 2012


Hello, 

if your dataset is a genlight object, have a look at the vignette 'adegenet-genomics', especially the function glNA and the accessor NA.posi.

This is the appropriate forum for this question BTW; an alternative would be R-sig-genetics, but both places are more appropriate than R-help.

Cheers

Thibaut


________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Federico Calboli [f.calboli at imperial.ac.uk]
Sent: 10 December 2012 10:43
To: Gregory Neils Puncher
Cc: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] Quality control

Greg,

this is not strictly an adegenet question, it's more a r-help one.

> I am immersed in my first batch of SNPs and I've got to do some serious quality control. I've got over 500 individuals and 34,000 SNPs, so my dataset is quite bulky.

some would say you data is very light, ymmv!
>
> My query: First, I need to develop some script that will allow me to eliminate individuals from my dataset that have less than 70% calls for each loci in my dataset.
> Second, I need to remove all loci with >30% NaN or no calls.

I assume you have one (1) standardized way of having failed SNPs, such as NA, -9, whatever.  Using NA for SNPs that have not been called is by a mile the sanest way forward.

> I can't figure out how to target the "NaN" values or "0000" genotype for removal.

if you have NaN (and why would you have NaN for a missing SNP?) or 0000 (which should then be a character type to be read as 4 zeros)

ind.qc = apply(my.data, 1, function(x){y = is.nan(x);  z = sum(y})
ind.qc = ind.qc/dim(my.data)[2]

[the above assumes the odd NaN for missing]

I guess you can then take over and just change the code above for 0000 and for having the QC of every SNP, rather than every individual.

Incidentally, the code above also gives you the QC summary.

BW

F



> For the sake of documentation I'd also like to know how many of the loci or individuals were removed according to the above criteria but obviously I can't view a summary of each of the 34,000 SNPs. Can I produce a summary of the aforementioned editing exercises?
>
> Thanks in advance.
>
> Greg Puncher, PhD Student
> Molecular Genetics for Environmental & Fishery Resources Laboratory (GenMAP)
> University of Bologna
> Via S. Alberto 163, 48123 Ravenna (Italy)
> Ph: 39(0)544/937311  Fax: 39(0)544937411
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


--
Federico C. F. Calboli
Neuroepidemiology and Ageing Research
Imperial College, St. Mary's Campus
Norfolk Place, London W2 1PG

Tel +44 (0)20 75941602   Fax +44 (0)20 75943193

f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


More information about the adegenet-forum mailing list