[adegenet-forum] xcal/optim.a.score consistency

Thibaut Jombart thibautjombart at gmail.com
Mon Oct 3 13:20:20 CEST 2016


Hi Alexandre,

thanks for the figure, it is very useful. Yes, around 15. To fine-tune it,
I would run the analysis for all numbers of PCA axes between 1 and 20, and
increase the number of replicates (30 or more if you can, maybe up to 50).

Best
Thibaut


--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: https://repidemicsconsortium.org
https://sites.google.com/site/thibautjombart/
https://github.com/thibautjombart
Twitter: @TeebzR <https://twitter.com/TeebzR>

On 3 October 2016 at 12:14, Alexandre Lemo <alexandros.lemopoulos at gmail.com>
wrote:

> Dear Dr Jombart,
>
> Thanks a lot for your answer! I think it does help indeed as I understand
> better why there is fluctuation in the xval's results.
>
> When I run xval, my curve will often looke like that, with Mean Successful
> Assignment varying between 0.81 and 0.82 when reaching the plateau.
> [image: Images intégrées 1]
> If I understand correctly, all solutions in the plateau are more or less
> equivalent . Structure of the DAPC should not change drastically if I
> select 79 or 20 PC in that case. However, from what I understand it is
> still better to select the fewest PC possible. That would mean that the
> optimal PCs to select would be at the beggining of the plateau (in this
> exemple it would grossely be around 15).  Is this correct?
>
> Thanks a lot again,
>
> Best,
>
> Alexandre
>
> 2016-10-03 13:59 GMT+03:00 Thibaut Jombart <thibautjombart at gmail.com>:
>
>> Hi Alexandre,
>>
>> I would not trust the automatic selection of the optimal space dimension
>> unless you are looking at simulated data and you need to run the analysis
>> 100s of times. There are 2 questions here:
>>
>> # stability of xvalDapc output
>> As this is a stochastic process, changing results are to be expected. It
>> may be the case that you need to increase the number of replicates for
>> results to stabilise a bit.
>>
>> If you haven't yet, check the tutorials for some guidelines on this, but
>> basically you want to select the smallest number of dimensions that gives
>> the best classification outcome (i.e. the 'elbow' in the curve). If there
>> is no elbow, there may be no structure in the data - check that the %
>> successful re-assignment is better than expected at random. If the %
>>  successful re-assignment plateaus, various numbers of PCs might lead to
>> equivalent solutions, but at the very least the structures should remain
>> stable.
>>
>> # cross validation vs optim.a.score
>> Simple: go with cross validation. The 'a-score' was meant as a crude
>> measure of goodness of fit of DAPC results, but cross-validation makes more
>> sense.
>>
>> Hope this helps
>>
>> Thibaut
>>
>>
>> --
>> Dr Thibaut Jombart
>> Lecturer, Department of Infectious Disease Epidemiology, Imperial
>> College London
>> Head of RECON: https://repidemicsconsortium.org
>> https://sites.google.com/site/thibautjombart/
>> https://github.com/thibautjombart
>> Twitter: @TeebzR <https://twitter.com/TeebzR>
>>
>> On 29 September 2016 at 10:02, Alexandre Lemo <
>> alexandros.lemopoulos at gmail.com> wrote:
>>
>>> Dear Dr. Jombart and *adegenet* users,
>>>
>>> I am trying to run a DPCA on a dataset of 3975 SNPS obtained through RAD
>>> sequencing. Tere are 11 populations and 306 individuals examined here
>>> (minmum 16 ind /pop). Note that I am not using the find.cluster function.
>>>
>>> My problem is that I can't get any consistency in the number of PC that
>>> I should use for the DPCA. Actually, everytime I run *optim.a.score* or
>>> *xval*, I get different results. I tried changing the training set
>>> (tried 0.7, 0.8 and 0.9) but still the optimal PC retained change in each
>>> run.
>>>
>>>
>>> Here is an example of my script:
>>>
>>> #str is a genind object
>>>
>>>
>>>
>>> *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9),
>>> pop(str),                               n.pca = 5:100, n.rep = 1000,
>>>                               parallel = "snow", ncpus = 4L*
>>>
>>>
>>>
>>>
>>>
>>>
>>> *optim_PC_2<- xvalDapc(tab(str, NA.method = "mean", training.set =0.9),
>>> pop(str),                               n.pca = 5:100, n.rep = 1000,
>>>                               parallel = "snow", ncpus = 4L*What
>>> happens here is that optim_PC will give me an optimal PC of (e.g) 76 while
>>> optim_PC_2 will give me 16. I tried running this several times and
>>> everytime results are different.
>>>
>>>
>>> I also tried using optim.a.score() :
>>>
>>>
>>>
>>> *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca =
>>> 100,n.da = NULL)*
>>> *optim.a.score (dapc.str)*
>>>
>>> Here, the number of PC will change everytime I run the function.
>>>
>>>
>>> Does anyone have an idea of why this is happening or had several issues?
>>> I am quite confused as results obviously change a lot depending on how many
>>> PC are used...
>>>
>>> Thanks for your help and for this great adegenet package!
>>>
>>> Best,
>>>
>>> Alexandre
>>>
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo
>>> /adegenet-forum
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20161003/89f6fb63/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dapcxval.png
Type: image/png
Size: 35757 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20161003/89f6fb63/attachment-0001.png>


More information about the adegenet-forum mailing list