[adegenet-forum] PCA sensitive to order of samples?

Tue Oct 14 12:31:52 CEST 2014

Hi there,

no, PCA is not sensitive to the ordering of samples.

Note: given the size of the dataset, it is probably easier to use the basic PCA procedure (dudi.pca). genlight objects are meant to be used whenever your computer could not otherwise store the data.

If your missing data are not randomly distributed, then many NAs is a problem: individuals with similar missing data will be seen as artificially similar, and SNPs with similar NAs will be seen as artificially correlated.

It is safer to use less data, of better quality. In this case, you may want to remove SNPs with many NAs.

Cheers
Thibaut

________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of zuzmus [zuzmus at gmail.com]
Sent: 09 October 2014 10:55
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] PCA sensitive to order of samples?

Dear colleagues,

I would like to perform the PCA in adegenet package and managed to go through the procedure till the end. The problem is that the results don't make sense and I see an obvious bias towards the order of the samples in the input matrix.

The matrix has 140 samples from 11 putative species and cca 2800 SNPs coming from the RAD-seq method (only biallelicm SNPs included; coded 0 - more frequent allele, 1 - heterozygote, 2 - rarer allele, NA - missing data).

I used the following code:

> data <- read.table("/Users/zuzana/Matrix_for_adegenet_cutSNPsTo2484_NoHybrids.txt")
> x <- new("genlight", data)
> pca1 <- glPca(x)
> scatter(pca1, posi="bottomleft")

The results always show first 5-7 individuals as strongly separated along the PC1 and 2 and the rest forms one cluster. When I repeated the same analysis after removing the first few individual from the matrix, the pattern stayed as it was - the new first individuals became separated.

[Vlozený obrázek 1]

I also tried to play with most of the options for glPca command following the manual or help in R, but always got the similar results...

Another issue is that I have quite some missing data (10 - 35 % per SNP, and cca 10 - 50% per individual) in my matrix, but this was the trade off of the experiment design ("sequence as much as possible as cheap as possible..."). But the first individuals in the list are quite well sequenced, so they are not the worst in sense of missing data...

I wonder if I missed some basics, if I did something wrong or if it is possible that there really is a bias of the order of the samples in the matrix? I would be very happy if somebody could help me to find out how to solve this issue.

Thank you very much of any help and suggestion!:-)

With regards,

Zuzana

---
Zuzana Musilova, PhD.
Zoological Institute
University of Basel
Vesalgasse 1 | 4051 Basel
Switzerland | Europe
)><(((@>....<@)))><(
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20141014/00c94174/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2014-10-09 at 11.22.14 AM.png
Type: image/png
Size: 27443 bytes
Desc: Screen Shot 2014-10-09 at 11.22.14 AM.png
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20141014/00c94174/attachment-0001.png>