[adegenet-forum] Quality control
Federico Calboli
f.calboli at imperial.ac.uk
Mon Dec 10 12:54:14 CET 2012
On 10 Dec 2012, at 11:50, "Jombart, Thibaut" <t.jombart at imperial.ac.uk> wrote:
> Hello,
>
> if your dataset is a genlight object, have a look at the vignette 'adegenet-genomics', especially the function glNA and the accessor NA.posi.
>
> This is the appropriate forum for this question BTW; an alternative would be R-sig-genetics, but both places are more appropriate than R-help.
that obviously depends on how the data is stored ;) unless you want to answer questions about apply/tapply/sapply on adegenet-sig.
F
>
> Cheers
>
> Thibaut
>
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Federico Calboli [f.calboli at imperial.ac.uk]
> Sent: 10 December 2012 10:43
> To: Gregory Neils Puncher
> Cc: adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Quality control
>
> Greg,
>
> this is not strictly an adegenet question, it's more a r-help one.
>
>> I am immersed in my first batch of SNPs and I've got to do some serious quality control. I've got over 500 individuals and 34,000 SNPs, so my dataset is quite bulky.
>
> some would say you data is very light, ymmv!
>>
>> My query: First, I need to develop some script that will allow me to eliminate individuals from my dataset that have less than 70% calls for each loci in my dataset.
>> Second, I need to remove all loci with >30% NaN or no calls.
>
> I assume you have one (1) standardized way of having failed SNPs, such as NA, -9, whatever. Using NA for SNPs that have not been called is by a mile the sanest way forward.
>
>> I can't figure out how to target the "NaN" values or "0000" genotype for removal.
>
> if you have NaN (and why would you have NaN for a missing SNP?) or 0000 (which should then be a character type to be read as 4 zeros)
>
> ind.qc = apply(my.data, 1, function(x){y = is.nan(x); z = sum(y})
> ind.qc = ind.qc/dim(my.data)[2]
>
> [the above assumes the odd NaN for missing]
>
> I guess you can then take over and just change the code above for 0000 and for having the QC of every SNP, rather than every individual.
>
> Incidentally, the code above also gives you the QC summary.
>
> BW
>
> F
>
>
>
>> For the sake of documentation I'd also like to know how many of the loci or individuals were removed according to the above criteria but obviously I can't view a summary of each of the 34,000 SNPs. Can I produce a summary of the aforementioned editing exercises?
>>
>> Thanks in advance.
>>
>> Greg Puncher, PhD Student
>> Molecular Genetics for Environmental & Fishery Resources Laboratory (GenMAP)
>> University of Bologna
>> Via S. Alberto 163, 48123 Ravenna (Italy)
>> Ph: 39(0)544/937311 Fax: 39(0)544937411
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
>
> --
> Federico C. F. Calboli
> Neuroepidemiology and Ageing Research
> Imperial College, St. Mary's Campus
> Norfolk Place, London W2 1PG
>
> Tel +44 (0)20 75941602 Fax +44 (0)20 75943193
>
> f.calboli [.a.t] imperial.ac.uk
> f.calboli [.a.t] gmail.com
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
--
Federico C. F. Calboli
Neuroepidemiology and Ageing Research
Imperial College, St. Mary's Campus
Norfolk Place, London W2 1PG
Tel +44 (0)20 75941602 Fax +44 (0)20 75943193
f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com
More information about the adegenet-forum
mailing list