[adegenet-forum] how do I know if missing data is affecting PCA or DAPC results

Jombart, Thibaut t.jombart at imperial.ac.uk
Wed Sep 23 16:01:30 CEST 2015

Dear Ella,

there is no one-size-fits-all answer to this question, but some general ideas may be useful.

Missing data should ideally be i) not too numerous and ii) randomly distributed in the dataset. In a situation like yours, individuals are more precious than markers, so I would discard loci with a majority of NAs, and briefly check the structure of the remaining missing entries.

NAs are basically replaced to the mean allele frequency. This means individuals with NAs will tend to be placed closer to the origin. Also, individuals with similar patterns of NAs will be seen as more similar than they probably are in reality.

If you really have a big missing value problem, and lot of NAs you cannot discard, one possibility would be to get a matrix of 1 and 0 where '1' indicate NAs, and do the PCA of this. If you obtain a structure, then this is a sign of problem - your NAs are not randomly distributed.

Hope this helps.


Dr Thibaut Jombart
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
Norfolk Place, London W2 1PG, UK
Tel. : 0044 (0)20 7594 3658
Twitter: @thibautjombart

From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Ella Bowles [ebowles at ucalgary.ca]
Sent: 22 September 2015 20:16
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] how do I know if missing data is affecting PCA or DAPC results


I'm attempting to do a PCA and a DAPC on genomic data, 186 individuals spread over 11 putative populations, with just over 4000 loci. I have converted the data to a genlight object. I'm wondering, I know that I have some missing data (markers are present in at least 65% of individuals). In the Adegent manual it specifies that missing data could bias results. How do I know if I have too much missing data, or should I just get rid of all the loci that have missing values before doing the analysis?

With thanks,

Ella Bowles
PhD Candidate
Biological Sciences
University of Calgary

e-mail: ebowles at ucalgary.ca<mailto:ebowles at ucalgary.ca>, bowlese at gmail.com<mailto:bowlese at gmail.com>
website: http://ellabowlesphd.wordpress.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150923/e34872db/attachment.html>

More information about the adegenet-forum mailing list