[adegenet-forum] how do I know if missing data is affecting PCA or DAPC results

Federico Calboli f.calboli at imperial.ac.uk
Tue Sep 22 21:30:11 CEST 2015

> On 22 Sep 2015, at 22:16, Ella Bowles <ebowles at ucalgary.ca> wrote:
> Hello,
> I'm attempting to do a PCA and a DAPC on genomic data, 186 individuals spread over 11 putative populations, with just over 4000 loci. I have converted the data to a genlight object. I'm wondering, I know that I have some missing data (markers are present in at least 65% of individuals). In the Adegent manual it specifies that missing data could bias results. How do I know if I have too much missing data, or should I just get rid of all the loci that have missing values before doing the analysis?

As a general rule you should QC your data in some way, say remove all SNPs with more than X% missing — a 35% missing looks very generous to me, I would personally use a 5% threshold.  One way of testing the effects of your missing data is to run the PCA and DAPC multiple times, starting with ‘no missing data’ and each subsequent time with a less and less stringent threshold, until your results are unacceptably different from those obtained with the no missing dataset.



> With thanks,
> Ella 
> -- 
> Ella Bowles
> PhD Candidate 
> Biological Sciences
> University of Calgary
> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
> website: http://ellabowlesphd.wordpress.com/
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum

More information about the adegenet-forum mailing list