<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">
Hi Thibaut
<div><br>
</div>
<div>I am still working with my tree species whose genotypes I'd like to model using DAPC, and I am still aiming to use the results as a forensic tool to identify species genetically. Therefore, the whole approach needs to be as reliable as possible. I tried
<font face="Courier">xvalDapc() </font>to perform DAPC cross-validation and found an optimal n.pca:</div>
<div><font face="Courier"><br>
</font></div>
<div>
<div><font face="Courier">> table(data@pop)</font></div>
<div><font face="Courier"><br>
</font></div>
<div><font face="Courier">P01 P02 P03 P04 P05 P06 P07 P08 P09 P10 P11 </font></div>
<div><font face="Courier"> 11 5 5 16 10 15 34 4 4 11 4 </font></div>
</div>
<div><br>
</div>
<div><font face="Courier">> xval <- xvalDapc(data@tab, pop(data), </font><font face="Courier">training.set = 0.5, result = "groupMean", n.pca = 10:2</font><span style="font-family: Courier; ">0, n.rep = 1000)</span><span style="font-family: Courier; "> </span></div>
<div><font face="Courier"><br>
</font></div>
<div><font face="Courier">> xval$`Mean Successful Assignment by Number of PCs of PCA`[as.numeric(xval$`Number of PCs Achieving Highest Mean Success`)]</font></div>
<div><font face="Courier">
<div> 14 </div>
<div>0.9953977 </div>
</font></div>
<div><font face="Courier">
<div><br>
</div>
<div>> xval$'Number of PCs Achieving Lowest MSE'</div>
<div>[1] "14"</div>
<div><br>
</div>
</font></div>
<div><font face="Courier">> xval$DAPC$n.pca</font></div>
<div><span style="font-family: Courier; ">[1] 14</span></div>
<div><br>
</div>
<div><br>
</div>
<div>It all works fine, the resulting best n.pca is still 14 if <font face="Courier">
xvalDapc()</font> is carried out multiple times using the same parameters, and even so when changing
<font face="Courier">training.set</font> to say 0.9. Now I use the validated model (xval$DAPC) to predict species membership of additional samples:</div>
<div><br>
</div>
<div><font face="Courier">> predict(xval$DAPC, newdata=new.data)</font></div>
<div><br>
</div>
<div>Again, it's all working perfectly, but what I don't fully understand is this: </div>
<div><br>
</div>
<div>1) As it happens, I know the true group membership of the additional samples. Therefore I can assess the prediction accuracy of xval$DAPC. It turns out that 96.8% (group mean!) of the additional samples are correctly predicted by xval$DAPC. Why is this
number slightly different from the expected 99.5%? May it be due to the different group sizes present in the full dataset (<font face="Courier">table(data@pop)</font>)?</div>
<div><br>
</div>
<div>2) If the full dataset contains groups of very different size, some of which are fairly small: would it be more reliable to predict group membership of additional samples using the above determined n.pca and all 1000 training sets (which have approximately
equal group size) as a reference, instead of using the full dataset (where group sizes differ) and just one prediction? The resulting 1000 prediction outcomes could be screened for the groups most oftenly assinged to each new sample.</div>
<div><br>
</div>
<div><br>
</div>
<div>Any opinions / ideas? Thanks in advance,</div>
<div><br>
</div>
<div>Simon</div>
<div><br>
</div>
<div>*************</div>
<div>phD student</div>
<div>ETH Zurich</div>
<div>Plant Ecological Genetics</div>
</body>
</html>