[adegenet-forum] a.score versus cross validation and number of discriminant functions to retain

Wed Oct 21 20:05:18 CEST 2015

PS As a different way of looking at the DAPC plot, would it be possible to
plot the populations according to cluster, as they are in my plot, but
colour code by population (as assigned in the input file)?

On Wed, Oct 21, 2015 at 10:52 AM, Ella Bowles <ebowles at ucalgary.ca> wrote:

> ps I just tried to fun find.cluster while only retaining 10 PCs, and got a
> super strange result. It's calling a giant number of groups as best. Seems
> like this is resolving too much variation. Seems best to stick with xval
> suggested 40.
>
>  NumClust <- find.clusters(data_full, max.n.clust=100)
> Choose the number PCs to retain (>=1): 10
> Choose the number of clusters (>=2: 25
> > head(NumClust$Kstat, 30)
>      K=1      K=2      K=3      K=4      K=5      K=6      K=7      K=8
>    K=9     K=10     K=11     K=12
> 864.7344 810.3223 729.0304 669.8737 619.2427 573.9809 544.6057 481.2244
> 473.3314 468.2758 434.5868 429.4302
>     K=13     K=14     K=15     K=16     K=17     K=18     K=19     K=20
>   K=21     K=22     K=23     K=24
> 424.6336 423.1422 413.2484 414.3202 410.0086 407.0822 411.1878 408.5134
> 418.5212 413.8698 411.2578 401.9535
>     K=25     K=26     K=27     K=28     K=29     K=30
> 413.3403 405.8296 417.6782 403.9047 407.2553 406.8078
>
> On Wed, Oct 21, 2015 at 10:36 AM, Ella Bowles <ebowles at ucalgary.ca> wrote:
>
>> Many thanks for this. Couple quick questions in follow-up.
>>
>>
>>>
>>> #2 if you have clusters defined already this graph may not be very
>>> useful; it just compares previous cluster definition to Kmean's
>>>
>>
>> >>I have populations identified using the "pop" option. But I don't have
>> clusters identified per se. If this is the case, does my plot look okay?
>>
>> 
>> [image: Inline image 1]
>>
>>
>>> #3 ?scatter.dapc -> argument 'col', which you are using already
>>>
>> >>I should have been more clear here. I don't know which population is
>> being represented by which colour, and would ideally like to know this so
>> that I can see how they are being grouped. Is there a function that I can
>> use to ask for this information? Do the numbers that NumClust$grp give me
>> represent the clusters that the individuals are being assigned to? If this
>> is the case, then this question is answered.
>>
>> #4 there are K-1 discriminant functions, so '300' will just retain K-1
>>>
>>> >>is 300 a good number though? I just don't know how to know if I'm
>> making a good choice.
>> 
>>
>>
>>> #5 if in doubt, use Xval - more advanced and easier to interpret; in
>>> your case your data are very well separated in just a few dimensions; 10
>>> PCs should do the trick
>>>
>>
>> >>So I should use 10 even though xval says 40?
>>
>> Thank you again,
>> Ella
>>
>>
>>> ------------------------------
>>> *From:* adegenet-forum-bounces at lists.r-forge.r-project.org [
>>> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Ella
>>> Bowles [ebowles at ucalgary.ca]
>>> *Sent:* 20 October 2015 19:45
>>> *To:* adegenet-forum at lists.r-forge.r-project.org
>>> *Subject:* Re: [adegenet-forum] a.score versus cross validation and
>>> number of discriminant functions to retain
>>>
>>> ps Also, which function do I use to get numeric values for the
>>> percentage of variation that is explained by the two principle components
>>> that are reflected on the scatter plot?
>>>
>>> with thanks
>>>
>>> On Tue, Oct 20, 2015 at 12:40 PM, Ella Bowles <ebowles at ucalgary.ca>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>>
>>>> I think I have worked my way through a DAPC analysis, and it's pretty
>>>> neat. I have five questions though. By way of background, I am using a
>>>> SNP dataset with 11 putative populations (clusters), containing 4099 SNPs.
>>>> I've converted a structure file to genInd, and am using that.
>>>>
>>>>
>>>> 1) Am I correct in understanding that the number of clusters you find
>>>> should inform the number of colours that you list for your DAPC plot?
>>>>
>>>>
>>>> 2) I'm not quite sure how to interpret the following. How do I know if
>>>> the fit is good?
>>>>
>>>>
>>>>
>>>> [image: Inline image 1]
>>>>
>>>> 3 and 4) Is there a function that I can use to correlate the colours
>>>> with my original populations. I do have this information in the datafile
>>>> that I fed in. And, does 300 sound reasonable for the number of
>>>> discriminant functions to retain?
>>>>
>>>> > dapc1 <- dapc(data_full, NumClust$grp)
>>>>
>>>> Choose the number PCs to retain (>=1): 40
>>>>
>>>> Choose the number discriminant functions to retain (>=1): 300
>>>>
>>>> #making colours for 9 clusters, since optimal k was 9 with the data
>>>> containing zeros
>>>>
>>>> myCol <- c("red", "orange", "yellow", "green", "blue", "purple",
>>>> "violet", "grey", "brown")
>>>>
>>>> scatter(dapc1, scree.da=FALSE, bg="white", pch=20, cell=0, cstar=0,
>>>> col=myCol, solid=.4, cex=1, clab=0, leg=TRUE, txt.leg=paste("Cluster", 1:9))
>>>> [image: Inline image 2]
>>>> 
>>>> 5) I don't really understand the difference between the optim a score
>>>> and the cross validation analyses. Both seem to be determining what is the
>>>> best number of PCs to retain. However, they give very different results. Am
>>>> I misunderstanding what they are?
>>>>
>>>> #for "data_full" dataset
>>>>
>>>> dapc2 <- dapc(data_full, n.da=300, n.pca=50)
>>>>
>>>>
>>>>
>>>> temp <- optim.a.score(dapc2)
>>>>
>>>>
>>>>
>>>> #graph shows that highest alpha seems to be 8
>>>> [image: Inline image 3]
>>>> #cross-validation for number of PCs to retain –can only do using
>>>> data_full (this is called “mat” here), couldn’t get it to work using data
>>>> with zeros
>>>>
>>>> mat <- scaleGen(data, NA.method="mean")
>>>>
>>>> grp <- pop(data)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> xval <- xvalDapc(mat, grp, n.pca.max = 100, training.set = 0.9, result
>>>> = "groupMean", center = TRUE, scale = FALSE, n.pca = NULL, n.rep = 30,
>>>> xval.plot = TRUE)
>>>>
>>>>
>>>>
>>>> xval[2:6]
>>>>
>>>>
>>>> #results
>>>>
>>>> Confidence Interval for Random Chance`
>>>>
>>>>       2.5%        50%      97.5%
>>>>
>>>> 0.05659207 0.09212947 0.14164194
>>>>
>>>>
>>>>
>>>> $`Mean Successful Assignment by Number of PCs of PCA`
>>>>
>>>>        10        20        30        40        50        60
>>>> 70        80        90
>>>>
>>>> 0.8409091 0.8348485 0.8439394 0.8530303 0.8136364 0.8227273 0.8000000
>>>> 0.8075758 0.8075758
>>>>
>>>>
>>>>
>>>> $`Number of PCs Achieving Highest Mean Success`
>>>>
>>>> [1] "40"
>>>>
>>>>
>>>>
>>>> $`Root Mean Squared Error by Number of PCs of PCA`
>>>>
>>>>        10        20        30        40        50        60
>>>> 70        80        90
>>>>
>>>> 0.1702777 0.1770200 0.1649359 0.1607061 0.2007218 0.1864929 0.2138458
>>>> 0.2051338 0.2074707
>>>>
>>>>
>>>>
>>>> $`Number of PCs Achieving Lowest MSE`
>>>> [1] "40"
>>>> [image: Inline image 4]
>>>>
>>>> Thank you very much for your time, and sincerely,
>>>> Ella Bowles
>>>>
>>>> --
>>>> Ella Bowles
>>>> PhD Candidate
>>>> Biological Sciences
>>>> University of Calgary
>>>>
>>>> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
>>>> website: http://ellabowlesphd.wordpress.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Ella Bowles
>>> PhD Candidate
>>> Biological Sciences
>>> University of Calgary
>>>
>>> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
>>> website: http://ellabowlesphd.wordpress.com/
>>>
>>
>>
>>
>> --
>> Ella Bowles
>> PhD Candidate
>> Biological Sciences
>> University of Calgary
>>
>> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
>> website: http://ellabowlesphd.wordpress.com/
>>
>
>
>
> --
> Ella Bowles
> PhD Candidate
> Biological Sciences
> University of Calgary
>
> e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
> website: http://ellabowlesphd.wordpress.com/
>

-- 
Ella Bowles
PhD Candidate
Biological Sciences
University of Calgary

e-mail: ebowles at ucalgary.ca, bowlese at gmail.com
website: http://ellabowlesphd.wordpress.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/561f2570/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 20171 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/561f2570/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 14492 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/561f2570/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 12190 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/561f2570/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 46303 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151021/561f2570/attachment-0007.png>