[adegenet-forum] xvalDapc and group prediction accuracy
simon.crameri at env.ethz.ch
Thu Aug 6 16:43:24 CEST 2015
I'm writing to you because you are the author of xvalDapc. I'm still somewhat confused regarding question 2) of my first post.
You don't need to read it again, lets just consider this:
- I have a genetic dataset of 100 individuals, and I know the true group membership of every individual.
- I'd like to build a cross-validated DAPC "model" (let's call it DAPC model) which can be used to predict group membership of further individuals.
- I run xvalDapc on say 50% of the 100 individuals (the reason I can't take 90% lies in the small size of some groups).
- I get n.pca = 25 as the best n.pca for building the DAPC model, and xvalDapc automatically produces an according DAPC, albeit with 100% of the individuals.
Now comes the tricky question: Can I really use the DAPC produced by xvalDapc for prediction purposes? I still think that it is somewhat problematic to take the full dataset (100 individuals) to build a cross-validated DAPC model when the n.pca used in the PCA step of DAPC was determined from training sets of just 50 individuals. Perhaps this is the reason why you set training.set = 0.9 as a default value, to make this difference as small as possible?
An alternative approach would be to use xvalDapc as "just" a (wonderful!) tool to get an optimal n.pca for your data. But for prediction purposes, I'd suggest to build a DAPC model with a training set of in this case 50 individuals (from a stratified sampling) instead of all individuals. If you don't like to loose the information of the other 50 individuals, you even could produce say 30 permuted training sets in the same way as xvalDapc does it, build 30 DAPC models and predict your further individuals against all permuted 30 DAPC models separately, taking the group that was most oftenly assigned to an additional sample as the predicted group.
Do you have any comments on that? I know, it's all very complicated, but wouldn't that be statistically more appropriate?
Thank you in advance,
Date: Tue, 28 Jul 2015 11:52:41 +0000
From: "Jombart, Thibaut" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
To: "Crameri Simon" <simon.crameri at env.ethz.ch<mailto:simon.crameri at env.ethz.ch>>,
"<adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>"
<adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: Re: [adegenet-forum] xvalDapc and group prediction accuracy
<2CB2DA8E426F3541AB1907F98ABA6570ABF58B2D at icexch-m1.ic.ac.uk<http://icexch-m1.ic.ac.uk>>
Content-Type: text/plain; charset="iso-8859-1"
see the argument 'result' in xvalDapc. The difference you see is the difference between the mean % of successful prediction averaged over groups (default), and the overall % of successful prediction. These two quantities are increasingly different when sample size are unequal.
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Crameri Simon [simon.crameri at env.ethz.ch<mailto:simon.crameri at env.ethz.ch>]
Sent: 17 July 2015 16:25
To: <adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>>
Subject: [adegenet-forum] xvalDapc and group prediction accuracy
I am still working with my tree species whose genotypes I'd like to model using DAPC, and I am still aiming to use the results as a forensic tool to identify species genetically. Therefore, the whole approach needs to be as reliable as possible. I tried xvalDapc() to perform DAPC cross-validation and found an optimal n.pca:
table(data at pop)
P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11
11 5 5 16 10 15 34 4 4 11 4
xval <- xvalDapc(data at tab, pop(data), training.set = 0.5, result = "groupMean", n.pca = 10:20, n.rep = 1000)
xval$`Mean Successful Assignment by Number of PCs of PCA`[as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)]
xval$'Number of PCs Achieving Lowest MSE'
It all works fine, the resulting best n.pca is still 14 if xvalDapc() is carried out multiple times using the same parameters, and even so when changing training.set to say 0.9. Now I use the validated model (xval$DAPC) to predict species membership of additional samples:
Again, it's all working perfectly, but what I don't fully understand is this:
1) As it happens, I know the true group membership of the additional samples. Therefore I can assess the prediction accuracy of xval$DAPC. It turns out that 96.8% (group mean!) of the additional samples are correctly predicted by xval$DAPC. Why is this number slightly different from the expected 99.5%? May it be due to the different group sizes present in the full dataset (table(data at pop))?
2) If the full dataset contains groups of very different size, some of which are fairly small: would it be more reliable to predict group membership of additional samples using the above determined n.pca and all 1000 training sets (which have approximately equal group size) as a reference, instead of using the full dataset (where group sizes differ) and just one prediction? The resulting 1000 prediction outcomes could be screened for the groups most oftenly assinged to each new sample.
Any opinions / ideas? Thanks in advance,
Plant Ecological Genetics
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the adegenet-forum