<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">

<div>Hi Caitlin</div>

<div><br>

</div>

<div>I'm writing to you because you are the author of xvalDapc. I'm still somewhat confused regarding question 2) of my first post.</div>

<div><br>

</div>

<div>You don't need to read it again, lets just consider this: </div>

<div><br>

</div>

<div>- I have a genetic dataset of 100 individuals, and I know the true group membership of every individual. </div>

<div>- I'd like to build a cross-validated DAPC "model" (let's call it DAPC model) which can be used to predict group membership of further individuals.</div>

<div>- I run xvalDapc on say 50% of the 100 individuals (the reason I can't take 90% lies in the small size of some groups).</div>

<div>- I get n.pca = 25 as the best n.pca for building the DAPC model, and xvalDapc automatically produces an according DAPC, albeit with 100% of the individuals.</div>

<div><br>

</div>

<div>Now comes the tricky question: Can I really use the DAPC produced by xvalDapc for prediction purposes? I still think that it is somewhat problematic to take the full dataset (100 individuals) to build a cross-validated DAPC model when the n.pca used in

 the PCA step of DAPC was determined from training sets of just 50 individuals. Perhaps this is the reason why you set

<font face="Courier">training.set = 0.9</font> as a default value, to make this difference as small as possible?</div>

<div><br>

</div>

<div>An alternative approach would be to use xvalDapc as "just" a (wonderful!) tool to get an optimal n.pca for your data. But for prediction purposes, I'd suggest to build a DAPC model with a training set of in this case 50 individuals (from a stratified sampling)

 instead of all individuals. If you don't like to loose the information of the other 50 individuals, you even could produce say 30 permuted training sets in the same way as xvalDapc does it, build 30 DAPC models and predict your further individuals against

 all permuted 30 DAPC models separately, taking the group that was most oftenly assigned to an additional sample as the predicted group. </div>

<div><br>

</div>

<div>Do you have any comments on that? I know, it's all very complicated, but wouldn't that be statistically more appropriate?</div>

<div><br>

</div>

<div>Thank you in advance,</div>

<div>Simon</div>

<div><br>

</div>

<div><br>

</div>

<div><br>

</div>

<div><br>

</div>

<div>

<div>

<div>

<blockquote type="cite"><br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Tue, 28 Jul 2015 11:52:41 +0000<br>

From: "Jombart, Thibaut" <<a href="mailto:t.jombart@imperial.ac.uk">t.jombart@imperial.ac.uk</a>><br>

To: "Crameri  Simon" <<a href="mailto:simon.crameri@env.ethz.ch">simon.crameri@env.ethz.ch</a>>,<br>

<span class="Apple-tab-span" style="white-space:pre"></span>"<<a href="mailto:adegenet-forum@lists.r-forge.r-project.org">adegenet-forum@lists.r-forge.r-project.org</a>>"<br>

<span class="Apple-tab-span" style="white-space:pre"></span><<a href="mailto:adegenet-forum@lists.r-forge.r-project.org">adegenet-forum@lists.r-forge.r-project.org</a>><br>

Subject: Re: [adegenet-forum] xvalDapc and group prediction accuracy<br>

Message-ID:<br>

<span class="Apple-tab-span" style="white-space:pre"></span><2CB2DA8E426F3541AB1907F98ABA6570ABF58B2D@<a href="http://icexch-m1.ic.ac.uk">icexch-m1.ic.ac.uk</a>><br>

Content-Type: text/plain; charset="iso-8859-1"<br>

<br>

<br>

Hi there<br>

<br>

see the argument 'result' in xvalDapc. The difference you see is the difference between the mean % of successful prediction averaged over groups (default), and the overall % of successful prediction. These two quantities are increasingly different when sample

 size are unequal.<br>

<br>

Cheers<br>

Thibaut<br>

<br>

<br>

________________________________<br>

From: <a href="mailto:adegenet-forum-bounces@lists.r-forge.r-project.org">adegenet-forum-bounces@lists.r-forge.r-project.org</a> [<a href="mailto:adegenet-forum-bounces@lists.r-forge.r-project.org">adegenet-forum-bounces@lists.r-forge.r-project.org</a>] on

 behalf of Crameri Simon [<a href="mailto:simon.crameri@env.ethz.ch">simon.crameri@env.ethz.ch</a>]<br>

Sent: 17 July 2015 16:25<br>

To: <<a href="mailto:adegenet-forum@lists.r-forge.r-project.org">adegenet-forum@lists.r-forge.r-project.org</a>><br>

Subject: [adegenet-forum] xvalDapc and group prediction accuracy<br>

<br>

Hi Thibaut<br>

<br>

I am still working with my tree species whose genotypes I'd like to model using DAPC, and I am still aiming to use the results as a forensic tool to identify species genetically. Therefore, the whole approach needs to be as reliable as possible. I tried xvalDapc()

 to perform DAPC cross-validation and found an optimal n.pca:<br>

<br>

<blockquote type="cite">table(data@pop)<br>

</blockquote>

<br>

P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11<br>

11   5   5  16  10  15  34   4   4  11   4<br>

<br>

<blockquote type="cite">xval <- xvalDapc(data@tab, pop(data), training.set = 0.5, result = "groupMean", n.pca = 10:20, n.rep = 1000)<br>

</blockquote>

<br>

<blockquote type="cite">xval$`Mean Successful Assignment by Number of PCs of PCA`[as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)]<br>

</blockquote>

      14<br>

0.9953977<br>

<br>

<blockquote type="cite">xval$'Number of PCs Achieving Lowest MSE'<br>

</blockquote>

[1] "14"<br>

<br>

<blockquote type="cite">xval$DAPC$n.pca<br>

</blockquote>

[1] 14<br>

<br>

<br>

It all works fine, the resulting best n.pca is still 14 if xvalDapc() is carried out multiple times using the same parameters, and even so when changing training.set to say 0.9. Now I use the validated model (xval$DAPC) to predict species membership of additional

 samples:<br>

<br>

<blockquote type="cite">predict(xval$DAPC, newdata=new.data)<br>

</blockquote>

<br>

Again, it's all working perfectly, but what I don't fully understand is this:<br>

<br>

1) As it happens, I know the true group membership of the additional samples. Therefore I can assess the prediction accuracy of xval$DAPC. It turns out that 96.8% (group mean!) of the additional samples are correctly predicted by xval$DAPC. Why is this number

 slightly different from the expected 99.5%? May it be due to the different group sizes present in the full dataset (table(data@pop))?<br>

<br>

2) If the full dataset contains groups of very different size, some of which are fairly small: would it be more reliable to predict group membership of additional samples using the above determined n.pca and all 1000 training sets (which have approximately

 equal group size) as a reference, instead of using the full dataset (where group sizes differ) and just one prediction? The resulting 1000 prediction outcomes could be screened for the groups most oftenly assinged to each new sample.<br>

<br>

<br>

Any opinions / ideas? Thanks in advance,<br>

<br>

Simon<br>

<br>

*************<br>

phD student<br>

ETH Zurich<br>

Plant Ecological Genetics<br>

</blockquote>

</div>

<br>

</div>

</div>

</body>

</html>