[adegenet-forum] Quality control

Federico Calboli f.calboli at imperial.ac.uk
Mon Dec 10 12:54:14 CET 2012


On 10 Dec 2012, at 11:50, "Jombart, Thibaut" <t.jombart at imperial.ac.uk> wrote:

> Hello, 
> 
> if your dataset is a genlight object, have a look at the vignette 'adegenet-genomics', especially the function glNA and the accessor NA.posi.
> 
> This is the appropriate forum for this question BTW; an alternative would be R-sig-genetics, but both places are more appropriate than R-help.

that obviously depends on how the data is stored ;)  unless you want to answer questions about apply/tapply/sapply on adegenet-sig.

F




> 
> Cheers
> 
> Thibaut
> 
> 
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Federico Calboli [f.calboli at imperial.ac.uk]
> Sent: 10 December 2012 10:43
> To: Gregory Neils Puncher
> Cc: adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Quality control
> 
> Greg,
> 
> this is not strictly an adegenet question, it's more a r-help one.
> 
>> I am immersed in my first batch of SNPs and I've got to do some serious quality control. I've got over 500 individuals and 34,000 SNPs, so my dataset is quite bulky.
> 
> some would say you data is very light, ymmv!
>> 
>> My query: First, I need to develop some script that will allow me to eliminate individuals from my dataset that have less than 70% calls for each loci in my dataset.
>> Second, I need to remove all loci with >30% NaN or no calls.
> 
> I assume you have one (1) standardized way of having failed SNPs, such as NA, -9, whatever.  Using NA for SNPs that have not been called is by a mile the sanest way forward.
> 
>> I can't figure out how to target the "NaN" values or "0000" genotype for removal.
> 
> if you have NaN (and why would you have NaN for a missing SNP?) or 0000 (which should then be a character type to be read as 4 zeros)
> 
> ind.qc = apply(my.data, 1, function(x){y = is.nan(x);  z = sum(y})
> ind.qc = ind.qc/dim(my.data)[2]
> 
> [the above assumes the odd NaN for missing]
> 
> I guess you can then take over and just change the code above for 0000 and for having the QC of every SNP, rather than every individual.
> 
> Incidentally, the code above also gives you the QC summary.
> 
> BW
> 
> F
> 
> 
> 
>> For the sake of documentation I'd also like to know how many of the loci or individuals were removed according to the above criteria but obviously I can't view a summary of each of the 34,000 SNPs. Can I produce a summary of the aforementioned editing exercises?
>> 
>> Thanks in advance.
>> 
>> Greg Puncher, PhD Student
>> Molecular Genetics for Environmental & Fishery Resources Laboratory (GenMAP)
>> University of Bologna
>> Via S. Alberto 163, 48123 Ravenna (Italy)
>> Ph: 39(0)544/937311  Fax: 39(0)544937411
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> 
> 
> --
> Federico C. F. Calboli
> Neuroepidemiology and Ageing Research
> Imperial College, St. Mary's Campus
> Norfolk Place, London W2 1PG
> 
> Tel +44 (0)20 75941602   Fax +44 (0)20 75943193
> 
> f.calboli [.a.t] imperial.ac.uk
> f.calboli [.a.t] gmail.com
> 
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


--
Federico C. F. Calboli
Neuroepidemiology and Ageing Research
Imperial College, St. Mary's Campus
Norfolk Place, London W2 1PG

Tel +44 (0)20 75941602   Fax +44 (0)20 75943193

f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com



More information about the adegenet-forum mailing list