[adegenet-forum] xvalDapc and group prediction accuracy

Fri Jul 17 17:25:31 CEST 2015

Hi Thibaut

I am still working with my tree species whose genotypes I'd like to model using DAPC, and I am still aiming to use the results as a forensic tool to identify species genetically. Therefore, the whole approach needs to be as reliable as possible. I tried xvalDapc() to perform DAPC cross-validation and found an optimal n.pca:

> table(data at pop)

P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11
 11   5   5  16  10  15  34   4   4  11   4

> xval <- xvalDapc(data at tab, pop(data), training.set = 0.5, result = "groupMean", n.pca = 10:20, n.rep = 1000)

> xval$`Mean Successful Assignment by Number of PCs of PCA`[as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)]
       14
0.9953977

> xval$'Number of PCs Achieving Lowest MSE'
[1] "14"

> xval$DAPC$n.pca
[1] 14

It all works fine, the resulting best n.pca is still 14 if xvalDapc() is carried out multiple times using the same parameters, and even so when changing training.set to say 0.9. Now I use the validated model (xval$DAPC) to predict species membership of additional samples:

> predict(xval$DAPC, newdata=new.data)

Again, it's all working perfectly, but what I don't fully understand is this:

1) As it happens, I know the true group membership of the additional samples. Therefore I can assess the prediction accuracy of xval$DAPC. It turns out that 96.8% (group mean!) of the additional samples are correctly predicted by xval$DAPC. Why is this number slightly different from the expected 99.5%? May it be due to the different group sizes present in the full dataset (table(data at pop))?

2) If the full dataset contains groups of very different size, some of which are fairly small: would it be more reliable to predict group membership of additional samples using the above determined n.pca and all 1000 training sets (which have approximately equal group size) as a reference, instead of using the full dataset (where group sizes differ) and just one prediction? The resulting 1000 prediction outcomes could be screened for the groups most oftenly assinged to each new sample.

Any opinions / ideas? Thanks in advance,

Simon

*************
phD student
ETH Zurich
Plant Ecological Genetics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150717/cb758835/attachment.html>