[adegenet-forum] DAPC, number of retained PCs and number of saved LDFs

Tue Jun 2 14:03:09 CEST 2015

Hi Thibaut

I have a genetic dataset of 125 individuals belonging to 11 different closely related plant species (saved in the @pop slot) and I would like to model the species-genotype relationship using dapc().

Of course one major issue is to find the best number of retained principal components during the PCA step of DAPC. This is my approach to find the best n.pca:

- create 100 permuted training sets of my complete dataset, each containing 50% of the samples (sampling stratified for @pop since I have groups that contain only very few individuals)
- do DAPC with all 100 training sets  and each time predict the species of the validation samples

dapc.train <- dapc(training.set, n.pca = n.pca, n.da = n.pca)
val <- predict(dapc.train, newdata = validation.set)

- look at the prediction successes, calculate mean overall prediction success over the 100 runs that used the identical n.pca
- do the steps above for say n.pca = 1:30
- select the optimal n.pca for my validated model according to the first local prediction success maximum (alternatively, take the global maximum)

I think this is a similar procedure to doing

optim.a.score(dapc(complete.set, n.pca = 30, n.da = 30), smart = F, n.sim = 100, n.da = 30)

but the resulting best n.pca is somewhat larger  if I do it "by hand", and the resulting mean overall prediction successes are much larger than the respecitve mean a-scores.

Question 1)
Given these different results: where lies the difference between the two approaches (doing it "by hand" or using optim.a.score)? Does my approach make any sense?

In addition, I would like to compare the accuracy of different DAPC models using different datasets. I have a cpDNA dataset and a microsatellite dataset and would like to compare DAPC models that contain one,
the other or a combination of both datasets. To do this, I need to have the best n.pca for each case, and use the same procedure as described above. However, I observe that at
n.pca ≥ 10, less than n.pca discriminant functions are saved in the case of the cpDNA dataset. This behaviour is associated with some of the training sets only, and causes problems when I want
to automatize the script for different n.pca. I think this has something to do with the proportion of conserved variance, which reaches >0.98 at n.pca ≥ 10.

Question 2)
Why can't dapc() always save as many discriminant functions as there are available principal components (as indicated in the dapc argument n.da), and why is this is the case for some training sets only?

I sent you the the data and an R script that hopefully shows the problem.

With regards,
Simon

*********************************************
Simon Crameri

phD student
ETH Zurich
Plant Ecological Genetics

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150602/e9ea312e/attachment.html>