[adegenet-forum] xvalDapc and group prediction accuracy

Wed Aug 19 19:51:44 CEST 2015

Hi Simon,

*First, I am not sure that using 50% is preferable to 90% for the training
set size.*

Think of the plot that xvalDapc produces (the one with the blue dots (with
n.pc on the x-axis and predictive success on the y-axis). Our aim in
cross-validation is just to identify the optimum number of PCs to use in
DAPC. Therefore, we want to see an arc on the plot produced by xvalDapc, as
in low predictive success at the lower limit of the x-axis (too little
information) and low predictive success at the upper limit of the x-axis
(too much noise), but better predictive success in the middle.

When we use *smaller training sets* (eg. if we were to use 50%), the result
is *reduced variability* in the predictive success response variable (ie.
within each n.pca, the black dots will be more likely to fall within a
tighter range on the y-axis) *but lower predictive success* than you would
get with larger training sets.

By contrast, when we use* larger training sets* (eg.90%), the result
is *increased
variability* (dots in each n.pca column will be more spread out on the
y-axis), *but a better picture of the true maximum and minimum predictive
success possible at each n.pca. *

*In this latter case, we are more likely to see the arc-like shape* we are
hoping for in any optimisation problem. The fact that the dots will be more
spread out on the y-axis (and the arc therefore more blurry) does not
prevent us from identifying the maximum (optimal number of PCs). By
contrast, in the former case, while the dots are more densely packed, there
will be a far greater chance that we will fail to see an arc-like shape and
fail to identify the true optimum number of PCs. You really don't want to
be losing 50% of the available information when building these models.

Even with small and uneven sample sizes, stratified cross-validation as
performed by xvalDapc is still designed to identify the number of PCs that
will give you the highest predictive success. *Whether you are trying to
build a model for explanatory or predictive purposes, I would suggest using
90% for the training set size. *

*Second, I want to thank you for drawing my attention to cases in which
xvalDapc needs to handle small groups with training.set sizes that are
not 90%: Thanks to your input, I've made some changes to xvalDapc that I
think may help you and other users of adegenet!*

The updated version of xvalDapc should handle groups smaller than 10
individuals more intelligently than the current stable version. *Please try
working with the** development** version of **adegenet* (which will become
the stable version, but not until the next release!) by using the following
steps:

## you'll need tha package devtools installed, if you don't already have it:
install.packages("devtools")
library(devtools)
## you may need to remove your previous version of adegenet before
installing and loading the devel version:
install_github("thibautjombart/adegenet")
library(adegenet)

Note that you will only notice the behaviour of xvalDapc change in the
following cases:
- The smallest group in your sample has less than 10 individuals *and*
training.set
is not set to 0.9.
- More than one n.pc gives the lowest RMSE (the old version would have
chosen the smallest of those n.pc; the new version will choose the largest)

*Third, it seems to me that what you are proposing in the final paragraph
of your e-mail effectively is what xvalDapc already does (but I may need
more clarification here). *

The 30 separate experiments you propose seem no different from the 30
repetitions in xvalDapc (ie. argument n.reps=30).

The only difference I can spot is in the last step. xvalDapc returns the
mean predictive success from each of the 30 runs while (and correct me if I
am wrong here) you propose instead to take, for each individual, the mode
as the "best-guess" predicted group (ie. the most frequent "group
membership" assignment for each individual). But I may need a little more
clarification. Could you explain a little further:

(i) How your proposed approach differs from the approach of stratified
cross-validation
(ii) What you are hoping to infer or to do after you have identified the
most frequent predicted group for each individual.

Right, I hope my first point offers a little bit of help. Please try
installing the devel version of adegenet as outlined in my second point,
play around with the updated version of xvalDapc, and see what you think.
And if you can give me your further thoughts on my third point, we can go
from there.

All the best,
Caitlin.

On Thu, Aug 6, 2015 at 3:43 PM, Crameri Simon <simon.crameri at env.ethz.ch>
wrote:

> Hi Caitlin
>
> I'm writing to you because you are the author of xvalDapc. I'm still
> somewhat confused regarding question 2) of my first post.
>
> You don't need to read it again, lets just consider this:
>
> - I have a genetic dataset of 100 individuals, and I know the true group
> membership of every individual.
> - I'd like to build a cross-validated DAPC "model" (let's call it DAPC
> model) which can be used to predict group membership of further individuals.
> - I run xvalDapc on say 50% of the 100 individuals (the reason I can't
> take 90% lies in the small size of some groups).
> - I get n.pca = 25 as the best n.pca for building the DAPC model, and
> xvalDapc automatically produces an according DAPC, albeit with 100% of the
> individuals.
>
> Now comes the tricky question: Can I really use the DAPC produced by
> xvalDapc for prediction purposes? I still think that it is somewhat
> problematic to take the full dataset (100 individuals) to build a
> cross-validated DAPC model when the n.pca used in the PCA step of DAPC was
> determined from training sets of just 50 individuals. Perhaps this is the
> reason why you set training.set = 0.9 as a default value, to make this
> difference as small as possible?
>
> An alternative approach would be to use xvalDapc as "just" a (wonderful!)
> tool to get an optimal n.pca for your data. But for prediction purposes,
> I'd suggest to build a DAPC model with a training set of in this case 50
> individuals (from a stratified sampling) instead of all individuals. If you
> don't like to loose the information of the other 50 individuals, you even
> could produce say 30 permuted training sets in the same way as xvalDapc
> does it, build 30 DAPC models and predict your further individuals against
> all permuted 30 DAPC models separately, taking the group that was most
> oftenly assigned to an additional sample as the predicted group.
>
> Do you have any comments on that? I know, it's all very complicated, but
> wouldn't that be statistically more appropriate?
>
> Thank you in advance,
> Simon
>
>
>
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 28 Jul 2015 11:52:41 +0000
> From: "Jombart, Thibaut" <t.jombart at imperial.ac.uk>
> To: "Crameri  Simon" <simon.crameri at env.ethz.ch>,
> "<adegenet-forum at lists.r-forge.r-project.org>"
> <adegenet-forum at lists.r-forge.r-project.org>
> Subject: Re: [adegenet-forum] xvalDapc and group prediction accuracy
> Message-ID:
> <2CB2DA8E426F3541AB1907F98ABA6570ABF58B2D at icexch-m1.ic.ac.uk>
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> Hi there
>
> see the argument 'result' in xvalDapc. The difference you see is the
> difference between the mean % of successful prediction averaged over groups
> (default), and the overall % of successful prediction. These two quantities
> are increasingly different when sample size are unequal.
>
> Cheers
> Thibaut
>
>
> ________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [
> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Crameri
> Simon [simon.crameri at env.ethz.ch]
> Sent: 17 July 2015 16:25
> To: <adegenet-forum at lists.r-forge.r-project.org>
> Subject: [adegenet-forum] xvalDapc and group prediction accuracy
>
> Hi Thibaut
>
> I am still working with my tree species whose genotypes I'd like to model
> using DAPC, and I am still aiming to use the results as a forensic tool to
> identify species genetically. Therefore, the whole approach needs to be as
> reliable as possible. I tried xvalDapc() to perform DAPC cross-validation
> and found an optimal n.pca:
>
> table(data at pop)
>
>
> P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11
> 11   5   5  16  10  15  34   4   4  11   4
>
> xval <- xvalDapc(data at tab, pop(data), training.set = 0.5, result =
> "groupMean", n.pca = 10:20, n.rep = 1000)
>
>
> xval$`Mean Successful Assignment by Number of PCs of
> PCA`[as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)]
>
>       14
> 0.9953977
>
> xval$'Number of PCs Achieving Lowest MSE'
>
> [1] "14"
>
> xval$DAPC$n.pca
>
> [1] 14
>
>
> It all works fine, the resulting best n.pca is still 14 if xvalDapc() is
> carried out multiple times using the same parameters, and even so when
> changing training.set to say 0.9. Now I use the validated model (xval$DAPC)
> to predict species membership of additional samples:
>
> predict(xval$DAPC, newdata=new.data)
>
>
> Again, it's all working perfectly, but what I don't fully understand is
> this:
>
> 1) As it happens, I know the true group membership of the additional
> samples. Therefore I can assess the prediction accuracy of xval$DAPC. It
> turns out that 96.8% (group mean!) of the additional samples are correctly
> predicted by xval$DAPC. Why is this number slightly different from the
> expected 99.5%? May it be due to the different group sizes present in the
> full dataset (table(data at pop))?
>
> 2) If the full dataset contains groups of very different size, some of
> which are fairly small: would it be more reliable to predict group
> membership of additional samples using the above determined n.pca and all
> 1000 training sets (which have approximately equal group size) as a
> reference, instead of using the full dataset (where group sizes differ) and
> just one prediction? The resulting 1000 prediction outcomes could be
> screened for the groups most oftenly assinged to each new sample.
>
>
> Any opinions / ideas? Thanks in advance,
>
> Simon
>
> *************
> phD student
> ETH Zurich
> Plant Ecological Genetics
>
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20150819/71013cd5/attachment-0001.html>