<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=koi8-r">

<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

</head>

<body ocsi="0" fpstyle="1" style="word-wrap:break-word">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">Hello,

<br>

<br>

it looks like you have reinvented xvalDapc.. maybe worth trying it? ;) <br>

<br>

?xvalDapc<br>

<br>

Cheers<br>

Thibaut<br>

<div><br>

<div style="font-family:Tahoma; font-size:13px">

<div class="BodyFragment"><font size="2"><span style="font-size:10pt">

<div class="PlainText">                <br>

==============================<br>

Dr Thibaut Jombart<br>

MRC Centre for Outbreak Analysis and Modelling<br>

Department of Infectious Disease Epidemiology<br>

Imperial College - School of Public Health<br>

Norfolk Place, London W2 1PG, UK<br>

Tel. : 0044 (0)20 7594 3658<br>

http://sites.google.com/site/thibautjombart/<br>

http://sites.google.com/site/therepiproject/<br>

http://adegenet.r-forge.r-project.org/<br>

Twitter: @thibautjombart<br>

<br>

<br>

</div>

</span></font></div>

</div>

</div>

<div style="font-family: Times New Roman; color: #000000; font-size: 16px">

<hr tabindex="-1">

<div style="direction: ltr;" id="divRpF714189"><font face="Tahoma" color="#000000" size="2"><b>From:</b> adegenet-forum-bounces@lists.r-forge.r-project.org [adegenet-forum-bounces@lists.r-forge.r-project.org] on behalf of Crameri Simon [simon.crameri@env.ethz.ch]<br>

<b>Sent:</b> 02 June 2015 13:03<br>

<b>To:</b> adegenet-forum@lists.r-forge.r-project.org<br>

<b>Subject:</b> [adegenet-forum] DAPC, number of retained PCs and number of saved LDFs<br>

</font><br>

</div>

<div></div>

<div>

<div>

<div style="word-wrap:break-word">Hi Thibaut<br>

<div><br>

</div>

<div>I have a genetic dataset of 125 individuals belonging to 11 different closely related plant species (saved in the @pop slot) and I would like to model the species-genotype relationship using dapc().</div>

<div><br>

</div>

<div>Of course one major issue is to find the best number of retained principal components during the PCA step of DAPC. This is my approach to find the best n.pca:</div>

<div><br>

</div>

<div>- create 100 permuted training sets of my complete dataset, each containing 50% of the samples (sampling stratified for @pop since I have groups that contain only very few individuals)</div>

<div>- do DAPC with all 100 training sets  and each time predict the species of the validation samples</div>

<div><br>

</div>

<div><font face="Courier">dapc.train <- dapc(training.set, n.pca = n.pca, n.da = n.pca)</font></div>

<div><font face="Courier">val <- predict(dapc.train, newdata = validation.set)</font></div>

<div><br>

</div>

<div>- look at the prediction successes, calculate mean overall prediction success over the 100 runs that used the identical n.pca</div>

<div>- do the steps above for say n.pca = 1:30</div>

<div>- select the optimal n.pca for my validated model according to the first local prediction success maximum (alternatively, take the global maximum)</div>

<div><br>

</div>

<div>I think this is a similar procedure to doing</div>

<div><br>

</div>

<div><font face="Courier">optim.a.score(dapc(complete.set, n.pca = 30, n.da = 30), smart = F, n.sim = 100, n.da = 30) </font></div>

<div><br>

</div>

<div>but the resulting best n.pca is somewhat larger  if I do it "by hand", and the resulting mean overall prediction successes are much larger than the respecitve mean a-scores.</div>

<div><br>

</div>

<div>Question 1)</div>

<div>Given these different results: where lies the difference between the two approaches (doing it "by hand" or using optim.a.score)? Does my approach make any sense?</div>

<div><br>

</div>

<div><br>

</div>

<div>In addition, I would like to compare the accuracy of different DAPC models using different datasets. I have a cpDNA dataset and a microsatellite dataset and would like to compare DAPC models that contain one, </div>

<div>the other or a combination of both datasets. To do this, I need to have the best n.pca for each case, and use the same procedure as described above. However, I observe that at </div>

<div>n.pca ™ 10, less than n.pca discriminant functions are saved in the case of the cpDNA dataset. This behaviour is associated with some of the training sets only, and causes problems when I want </div>

<div>to automatize the script for different n.pca. I think this has something to do with the proportion of conserved variance, which reaches >0.98 at n.pca ™ 10.</div>

<div><br>

</div>

<div>Question 2)</div>

<div>Why can't dapc() always save as many discriminant functions as there are available principal components (as indicated in the dapc argument n.da), and why is this is the case for some training sets only?</div>

<div><br>

</div>

<div>I sent you the the data and an R script that hopefully shows the problem. </div>

<div><br>

</div>

<div><br>

</div>

<div>With regards,</div>

<div>Simon</div>

</div>

<div style="word-wrap:break-word">

<div>

<div>

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word">

<div style="word-wrap:break-word"><br>

</div>

<div style="word-wrap:break-word">*********************************************</div>

Simon Crameri</div>

<div style="word-wrap:break-word"><br>

</div>

<div style="word-wrap:break-word">phD student</div>

<div style="word-wrap:break-word">ETH Zurich<br>

Plant Ecological Genetics</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

</div>

<br>

</div>

</div>

</div>

</body>

</html>