<div dir="ltr"><div class="gmail_default" style="color:rgb(0,0,255)"><font size="1">ps I just tried to fun find.cluster while only retaining 10 PCs, and got a super strange result. It's calling a giant number of groups as best. Seems like this is resolving too much variation. Seems best to stick with xval suggested 40.</font></div><div class="gmail_default" style="color:rgb(0,0,255)"><font size="1"><br></font></div><div class="gmail_default" style="color:rgb(0,0,255)"><div class="gmail_default"><font size="1"> NumClust <- find.clusters(data_full, max.n.clust=100)</font></div><div class="gmail_default"><font size="1">Choose the number PCs to retain (>=1): 10</font></div><div class="gmail_default"><font size="1">Choose the number of clusters (>=2: 25</font></div><div class="gmail_default"><font size="1">> head(NumClust$Kstat, 30)</font></div><div class="gmail_default"><font size="1"> K=1 K=2 K=3 K=4 K=5 K=6 K=7 K=8 K=9 K=10 K=11 K=12 </font></div><div class="gmail_default"><font size="1">864.7344 810.3223 729.0304 669.8737 619.2427 573.9809 544.6057 481.2244 473.3314 468.2758 434.5868 429.4302 </font></div><div class="gmail_default"><font size="1"> K=13 K=14 K=15 K=16 K=17 K=18 K=19 K=20 K=21 K=22 K=23 K=24 </font></div><div class="gmail_default"><font size="1">424.6336 423.1422 413.2484 414.3202 410.0086 407.0822 411.1878 408.5134 418.5212 413.8698 411.2578 401.9535 </font></div><div class="gmail_default"><font size="1"> K=25 K=26 K=27 K=28 K=29 K=30 </font></div><div class="gmail_default"><font size="1">413.3403 405.8296 417.6782 403.9047 407.2553 406.8078 </font></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Oct 21, 2015 at 10:36 AM, Ella Bowles <span dir="ltr"><<a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_default" style="font-size:large;color:rgb(0,0,255)">Many thanks for this. Couple quick questions in follow-up.</div><div class="gmail_extra"><br><div class="gmail_quote"><span class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div style="direction:ltr;font-family:Tahoma;color:rgb(0,0,0);font-size:10pt"><br>
<br>
#2 if you have clusters defined already this graph may not be very useful; it just compares previous cluster definition to Kmean's<br></div></div></blockquote><div><br></div></span><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline">>>I have populations identified using the "pop" option. But I don't have clusters identified per se. If this is the case, does my plot look okay?</div> </div><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255)"><span style="font-family:'Times New Roman',serif;font-size:10pt"> </span></div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255)"><img src="cid:ii_150868359321bbb7" alt="Inline image 1" width="280" height="280"></div><br></div><span class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div style="direction:ltr;font-family:Tahoma;color:rgb(0,0,0);font-size:10pt">
<br>
#3 ?scatter.dapc -> argument 'col', which you are using already<br></div></div></blockquote></span><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline">>>I should have been more clear here. I don't know which population is being represented by which colour, and would ideally like to know this so that I can see how they are being grouped. Is there a function that I can use to ask for this information? Do the numbers that NumClust$grp give me represent the clusters that the individuals are being assigned to? If this is the case, then this question is answered. </div></div><span class=""><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline"><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div style="direction:ltr;font-family:Tahoma;color:rgb(0,0,0);font-size:10pt">
#4 there are K-1 discriminant functions, so '300' will just retain K-1<br>
<br></div></div></blockquote></span><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline">>>is 300 a good number though? I just don't know how to know if I'm making a good choice.</div></div><span class=""><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline"></div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div style="direction:ltr;font-family:Tahoma;color:rgb(0,0,0);font-size:10pt">
#5 if in doubt, use Xval - more advanced and easier to interpret; in your case your data are very well separated in just a few dimensions; 10 PCs should do the trick<br></div></div></blockquote><div><br></div></span><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline">>>So I should use 10 even though xval says 40?</div></div><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline"><br></div></div><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline">Thank you again,</div></div><div><div class="gmail_default" style="font-size:large;color:rgb(0,0,255);display:inline">Ella</div> </div><div><div class="h5"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div style="direction:ltr;font-family:Tahoma;color:rgb(0,0,0);font-size:10pt"><div><div style="font-family:Tahoma;font-size:13px"><div><font size="2"><span style="font-size:10pt"><div>
</div>
</span></font></div>
</div>
</div>
<div style="font-family:'Times New Roman';color:rgb(0,0,0);font-size:16px">
<hr>
<div style="direction:ltr"><font color="#000000" face="Tahoma" size="2"><b>From:</b> <a href="mailto:adegenet-forum-bounces@lists.r-forge.r-project.org" target="_blank">adegenet-forum-bounces@lists.r-forge.r-project.org</a> [<a href="mailto:adegenet-forum-bounces@lists.r-forge.r-project.org" target="_blank">adegenet-forum-bounces@lists.r-forge.r-project.org</a>] on behalf of Ella Bowles [<a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>]<br>
<b>Sent:</b> 20 October 2015 19:45<br>
<b>To:</b> <a href="mailto:adegenet-forum@lists.r-forge.r-project.org" target="_blank">adegenet-forum@lists.r-forge.r-project.org</a><br>
<b>Subject:</b> Re: [adegenet-forum] a.score versus cross validation and number of discriminant functions to retain<br>
</font><br>
</div><div><div>
<div></div>
<div>
<div dir="ltr">
<div style="font-size:large;color:rgb(0,0,255)">ps Also, which function do I use to get numeric values for the percentage of variation that is explained by the two principle components that are reflected on the scatter plot?</div>
<div style="font-size:large;color:rgb(0,0,255)"><br>
</div>
<div style="font-size:large;color:rgb(0,0,255)">with thanks</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Tue, Oct 20, 2015 at 12:40 PM, Ella Bowles <span dir="ltr">
<<a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div dir="ltr">
<div>
<p class="MsoNormal" style="color:rgb(0,0,255);font-size:large;margin-bottom:0.0001pt">
<span style="font-size:10pt;font-family:'Times New Roman',serif">Hello,</span></p>
<p class="MsoNormal" style="color:rgb(0,0,255);font-size:large;margin-bottom:0.0001pt">
<span style="font-family:'Times New Roman',serif;font-size:10pt"><br>
</span></p>
<p class="MsoNormal" style="color:rgb(0,0,255);font-size:large;margin-bottom:0.0001pt">
<span style="font-family:'Times New Roman',serif;font-size:10pt">I think I have worked my way through a DAPC analysis, and it's pretty neat. I have five questions though. </span><span style="font-family:'Times New Roman',serif;font-size:10pt">By way of background,
I am using a SNP dataset with 11 putative populations (clusters), containing 4099 SNPs. I've converted a structure file to genInd, and am using that.</span></p>
<p class="MsoNormal" style="color:rgb(0,0,255);font-size:large;margin-bottom:0.0001pt">
<br>
</p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><font color="#0000ff" face="Times New Roman, serif"><span style="font-size:13.3333px">1) Am I correct in understanding that the number of clusters you find should inform the number of colours that you list
for your DAPC plot?</span></font></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-family:'Times New Roman',serif;font-size:10pt;color:rgb(0,0,255)"><br>
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-family:'Times New Roman',serif;font-size:10pt;color:rgb(0,0,255)">2) I'm not quite sure how to interpret the following. How do I know if the fit is good?</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-family:'Times New Roman',serif;font-size:10pt;color:rgb(0,0,255)"> </span></p>
<p class="MsoNormal" style="color:rgb(0,0,255);font-size:large;margin-bottom:0.0001pt">
<img src="cid:ii_150868359321bbb7" alt="Inline image 1" height="280" width="280"><br>
</p>
</div>
<div><br>
</div>
<div>
<div style="font-size:large;color:rgb(0,0,255)">3 and 4) Is there a function that I can use to correlate the colours with my original populations. I do have this information in the datafile that I fed in. And, does 300 sound reasonable
for the number of discriminant functions to retain?</div>
<div>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><font color="#0000ff" face="Times New Roman, serif"><span style="font-size:13.3333px">> dapc1 <- dapc(data_full, NumClust$grp)</span></font></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><font color="#0000ff" face="Times New Roman, serif"><span style="font-size:13.3333px">Choose the number PCs to retain (>=1): 40</span></font></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><font color="#0000ff" face="Times New Roman, serif"><span style="font-size:13.3333px">Choose the number discriminant functions to retain (>=1): 300</span></font></p>
</div>
<div style="font-size:large;color:rgb(0,0,255)">
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">#making colours for 9 clusters, since optimal k was 9 with the data containing zeros</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">myCol <- c("red", "orange", "yellow", "green", "blue", "purple", "violet", "grey", "brown")</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">scatter(dapc1, scree.da=FALSE, bg="white", pch=20, cell=0, cstar=0, col=myCol, solid=.4, cex=1, clab=0, leg=TRUE, txt.leg=paste("Cluster",
1:9))</span></p>
</div>
<div style="font-size:large;color:rgb(0,0,255)"><img src="cid:ii_1508685356bc662a" alt="Inline image 2" height="280" width="280"></div>
<div style="font-size:large;color:rgb(0,0,255)"></div>
<div style="font-size:large;color:rgb(0,0,255)">5) I don't really understand the difference between the optim a score and the cross validation analyses. Both seem to be determining what is the best number of PCs to retain. However, they
give very different results. Am I misunderstanding what they are?</div>
<div style="font-size:large;color:rgb(0,0,255)">
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">#for "data_full" dataset</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">dapc2 <- dapc(data_full, n.da=300, n.pca=50)</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">temp <- optim.a.score(dapc2)</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">#graph shows that highest alpha seems to be 8</span></p>
</div>
<div style="font-size:large;color:rgb(0,0,255)"><img src="cid:ii_15086881f2cd8715" alt="Inline image 3" height="280" width="280"></div>
<div style="font-size:large;color:rgb(0,0,255)"><span style="font-family:'Times New Roman',serif;font-size:10pt;color:rgb(34,34,34)">#cross-validation for number of PCs to retain –can only do using data_full (this is called “mat” here),
couldn’t get it to work using data with zeros</span></div>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">mat <- scaleGen(data, NA.method="mean")</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">grp <- pop(data)</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">xval <- xvalDapc(mat, grp, n.pca.max = 100, training.set = 0.9, result = "groupMean", center = TRUE, scale = FALSE, n.pca = NULL, n.rep = 30,
xval.plot = TRUE)</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">xval[2:6]</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<div style="font-size:large;color:rgb(0,0,255)"><span style="font-family:'Times New Roman',serif;font-size:10pt;color:rgb(34,34,34)">#results</span></div>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">Confidence Interval for Random Chance`</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> 2.5% 50% 97.5%
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">0.05659207 0.09212947 0.14164194
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">$`Mean Successful Assignment by Number of PCs of PCA`</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> 10 20 30 40 50 60 70 80 90
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">0.8409091 0.8348485 0.8439394 0.8530303 0.8136364 0.8227273 0.8000000 0.8075758 0.8075758
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">$`Number of PCs Achieving Highest Mean Success`</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">[1] "40"</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">$`Root Mean Squared Error by Number of PCs of PCA`</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> 10 20 30 40 50 60 70 80 90
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">0.1702777 0.1770200 0.1649359 0.1607061 0.2007218 0.1864929 0.2138458 0.2051338 0.2074707
</span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif"> </span></p>
<p class="MsoNormal" style="margin-bottom:0.0001pt"><span style="font-size:10pt;font-family:'Times New Roman',serif">$`Number of PCs Achieving Lowest MSE`</span></p>
<div style="font-size:large;color:rgb(0,0,255)"><span style="font-family:'Times New Roman',serif;font-size:10pt;color:rgb(34,34,34)">[1] "40"</span></div>
<div style="font-size:large;color:rgb(0,0,255)"><img src="cid:ii_15086894cc94945b" alt="Inline image 4" height="291" width="343"></div>
<br>
</div>
<div>
<div style="font-size:large;color:rgb(0,0,255)">Thank you very much for your time, and sincerely,</div>
<div style="font-size:large;color:rgb(0,0,255)">Ella Bowles</div>
<span><font color="#888888"><br>
</font></span></div>
<span><font color="#888888">-- <br>
<div>
<div dir="ltr">
<div>Ella Bowles<br>
PhD Candidate </div>
<div>Biological Sciences</div>
<div>University of Calgary<br>
<br>
e-mail: <a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>,
<a href="mailto:bowlese@gmail.com" target="_blank">bowlese@gmail.com</a></div>
<div>website: <a href="http://ellabowlesphd.wordpress.com/" rel="nofollow me" style="color:rgb(59,89,152);font-family:'lucida grande',tahoma,verdana,arial,sans-serif;font-size:11.2px;line-height:17px" target="_blank">http://<span style="display:inline-block"></span>ellabowlesphd.wordpre<span style="display:inline-block"></span>ss.com/</a></div>
</div>
</div>
</font></span></div>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div>
<div dir="ltr">
<div>Ella Bowles<br>
PhD Candidate </div>
<div>Biological Sciences</div>
<div>University of Calgary<br>
<br>
e-mail: <a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>,
<a href="mailto:bowlese@gmail.com" target="_blank">bowlese@gmail.com</a></div>
<div>website: <a href="http://ellabowlesphd.wordpress.com/" rel="nofollow me" style="color:rgb(59,89,152);font-family:'lucida grande',tahoma,verdana,arial,sans-serif;font-size:11.2px;line-height:17px" target="_blank">http://<span style="display:inline-block"></span>ellabowlesphd.wordpre<span style="display:inline-block"></span>ss.com/</a></div>
</div>
</div>
</div>
</div>
</div></div></div>
</div>
</div>
</blockquote></div></div></div><div><div class="h5"><br><br clear="all"><div><br></div>-- <br><div><div dir="ltr"><div>Ella Bowles<br>PhD Candidate </div><div>Biological Sciences</div>
<div>University of Calgary<br><br>e-mail: <a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>, <a href="mailto:bowlese@gmail.com" target="_blank">bowlese@gmail.com</a></div><div>website: <a href="http://ellabowlesphd.wordpress.com/" rel="nofollow me" style="color:rgb(59,89,152);font-family:'lucida grande',tahoma,verdana,arial,sans-serif;font-size:11.2px;line-height:17px" target="_blank">http://<span style="display:inline-block"></span>ellabowlesphd.wordpre<span style="display:inline-block"></span>ss.com/</a></div></div></div>
</div></div></div></div>
</blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature"><div dir="ltr"><div>Ella Bowles<br>PhD Candidate </div><div>Biological Sciences</div>
<div>University of Calgary<br><br>e-mail: <a href="mailto:ebowles@ucalgary.ca" target="_blank">ebowles@ucalgary.ca</a>, <a href="mailto:bowlese@gmail.com" target="_blank">bowlese@gmail.com</a></div><div>website: <a href="http://ellabowlesphd.wordpress.com/" rel="nofollow me" style="color:rgb(59,89,152);font-family:'lucida grande',tahoma,verdana,arial,sans-serif;font-size:11.199999809265137px;line-height:17px" target="_blank">http://<span style="display:inline-block"></span>ellabowlesphd.wordpre<span style="display:inline-block"></span>ss.com/</a></div></div></div>
</div>