[adegenet-forum] a.score versus cross validation and number of discriminant functions to retain

Ella Bowles ebowles at ucalgary.ca
Wed Oct 21 18:52:10 CEST 2015


ps I just tried to fun find.cluster while only retaining 10 PCs, and got a
super strange result. It's calling a giant number of groups as best. Seems
like this is resolving too much variation. Seems best to stick with xval
suggested 40.

 NumClust <- find.clusters(data_full, max.n.clust=100)
Choose the number PCs to retain (>=1): 10
Choose the number of clusters (>=2: 25
> head(NumClust$Kstat, 30)
     K=1      K=2      K=3      K=4      K=5      K=6      K=7      K=8
 K=9     K=10     K=11     K=12
864.7344 810.3223 729.0304 669.8737 619.2427 573.9809 544.6057 481.2244
473.3314 468.2758 434.5868 429.4302
    K=13     K=14     K=15     K=16     K=17     K=18     K=19     K=20
K=21     K=22     K=23     K=24
424.6336 423.1422 413.2484 414.3202 410.0086 407.0822 411.1878 408.5134
418.5212 413.8698 411.2578 401.9535
    K=25     K=26     K=27     K=28     K=29     K=30
413.3403 405.8296 417.6782 403.9047 407.2553 406.8078

On Wed, Oct 21, 2015 at 10:36 AM, Ella Bowles <ebowles at ucalgary.ca> wrote:

> Many thanks for this. Couple quick questions in follow-up.
>
>
>>
>> #2 if you have clusters defined already this graph may not be very
>> useful; it just compares previous cluster definition to Kmean's
>>
>
> ​>>I have populations identified using the "pop" option. But I don't have
> clusters identified per se. If this is the case, does my plot look okay?​
>
>> [image: Inline image 1]​
>
>
>> #3 ?scatter.dapc -> argument 'col', which you are using already
>>
> ​>>I should have been more clear here. I don't know which population is
> being represented by which colour, and would ideally like to know this so
> that I can see how they are being grouped. Is there a function that I can
> use to ask for this information? Do the numbers that NumClust$grp give me
> represent the clusters that the individuals are being assigned to? If this
> is the case, then this question is answered.
>
> #4 there are K-1 discriminant functions, so '300' will just retain K-1
>>
>> ​>>is 300 a good number though? I just don't know how to know if I'm
> making a good choice.
>>
>
>> #5 if in doubt, use Xval - more advanced and easier to interpret; in your
>> case your data are very well separated in just a few dimensions; 10 PCs
>> should do the trick
>>
>
> ​>>So I should use 10 even though xval says 40?
>
> ​Thank you again,
> Ella​
>
>
>> ------------------------------
>> *From:* adegenet-forum-bounces at lists.r-forge.r-project.org [
>> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Ella
>> Bowles [ebowles at ucalgary.ca]
>> *Sent:* 20 October 2015 19:45
>> *To:* adegenet-forum at lists.r-forge.r-project.org
>> *Subject:* Re: [adegenet-forum] a.score versus cross validation and
>> number of discriminant functions to retain
>>
>> ps Also, which function do I use to get numeric values for the percentage
>> of variation that is explained by the two principle components that are
>> reflected on the scatter plot?
>>
>> with thanks
>>
>> On Tue, Oct 20, 2015 at 12:40 PM, Ella Bowles <ebowles at ucalgary.ca>
>> wrote:
>>
>>> Hello,
>>>
>>>
>>> I think I have worked my way through a DAPC analysis, and it's pretty
>>> neat. I have five questions though. By way of background, I am using a
>>> SNP dataset with 11 putative populations (clusters), containing 4099 SNPs.
>>> I've converted a structure file to genInd, and am using that.
>>>
>>>
>>> 1) Am I correct in understanding that the number of clusters you find
>>> should inform the number of colours that you list for your DAPC plot?
>>>
>>>
>>> 2) I'm not quite sure how to interpret the following. How do I know if
>>> the fit is good?
>>>
>>>
>>>
>>> [image: Inline image 1]
>>>
>>> ​3 and 4) Is there a function that I can use to correlate the colours
>>> with my original populations. I do have this information in the datafile
>>> that I fed in. And, does 300 sound reasonable for the number of
>>> discriminant functions to retain?
>>>
>>> > dapc1 <- dapc(data_full, NumClust$grp)
>>>
>>> Choose the number PCs to retain (>=1): 40
>>>
>>> Choose the number discriminant functions to retain (>=1): 300
>>>
>>> #making colours for 9 clusters, since optimal k was 9 with the data
>>> containing zeros
>>>
>>> myCol <- c("red", "orange", "yellow", "green", "blue", "purple",
>>> "violet", "grey", "brown")
>>>
>>> scatter(dapc1, scree.da=FALSE, bg="white", pch=20, cell=0, cstar=0,
>>> col=myCol, solid=.4, cex=1, clab=0, leg=TRUE, txt.leg=paste("Cluster", 1:9))
>>> [image: Inline image 2]​
>>>>>> 5) I don't really understand the difference between the optim a score
>>> and the cross validation analyses. Both seem to be determining what is the
>>> best number of PCs to retain. However, they give very different results. Am
>>> I misunderstanding what they are?
>>>
>>> #for "data_full" dataset
>>>
>>> dapc2 <- dapc(data_full, n.da=300, n.pca=50)
>>>
>>>
>>>
>>> temp <- optim.a.score(dapc2)
>>>
>>>
>>>
>>> #graph shows that highest alpha seems to be 8
>>> ​[image: Inline image 3]​
>>> ​#cross-validation for number of PCs to retain –can only do using
>>> data_full (this is called “mat” here), couldn’t get it to work using data
>>> with zeros
>>>
>>> mat <- scaleGen(data, NA.method="mean")
>>>
>>> grp <- pop(data)
>>>
>>>
>>>
>>>
>>>
>>> xval <- xvalDapc(mat, grp, n.pca.max = 100, training.set = 0.9, result =
>>> "groupMean", center = TRUE, scale = FALSE, n.pca = NULL, n.rep = 30,
>>> xval.plot = TRUE)
>>>
>>>
>>>
>>> xval[2:6]
>>>
>>>
>>> #results
>>>
>>> Confidence Interval for Random Chance`
>>>
>>>       2.5%        50%      97.5%
>>>
>>> 0.05659207 0.09212947 0.14164194
>>>
>>>
>>>
>>> $`Mean Successful Assignment by Number of PCs of PCA`
>>>
>>>        10        20        30        40        50        60
>>> 70        80        90
>>>
>>> 0.8409091 0.8348485 0.8439394 0.8530303 0.8136364 0.8227273 0.8000000
>>> 0.8075758 0.8075758
>>>
>>>
>>>
>>> $`Number of PCs Achieving Highest Mean Success`
>>>
>>> [1] "40"
>>>
>>>
>>>
>>> $`Root Mean Squared Error by Number of PCs of PCA`
>>>
>>>        10        20        30        40        50        60
>>> 70        80        90
>>>
>>> 0.1702777 0.1770200 0.1649359 0.1607061 0.2007218 0.1864929 0.2138458
>>> 0.2051338 0.2074707
>>>
>>>
>>>
>>> $`Number of PCs Achieving Lowest MSE`
>>> [1] "40"
>>> [image: Inline image 4]​
>>>
>>> ​Thank you very much for your time, and sincerely,
>>> Ella Bowles​
>>>
>>> --
>>> Ella Bowles
>>> PhD Candidate
>>> Biological Sciences
>>> University of Calgary
>>>
>>> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
>>> website: http://ellabowlesphd.wordpress.com/
>>>
>>
>>
>>
>> --
>> Ella Bowles
>> PhD Candidate
>> Biological Sciences
>> University of Calgary
>>
>> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
>> website: http://ellabowlesphd.wordpress.com/
>>
>
>
>
> --
> Ella Bowles
> PhD Candidate
> Biological Sciences
> University of Calgary
>
> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
> website: http://ellabowlesphd.wordpress.com/
>



-- 
Ella Bowles
PhD Candidate
Biological Sciences
University of Calgary

e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
website: http://ellabowlesphd.wordpress.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/7d3caa88/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 12190 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/7d3caa88/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 20171 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/7d3caa88/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 46303 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/7d3caa88/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 14492 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/7d3caa88/attachment-0007.png>


More information about the adegenet-forum mailing list