[adegenet-forum] xcal/optim.a.score consistency
Thibaut Jombart
thibautjombart at gmail.com
Mon Oct 3 14:25:34 CEST 2016
Hi
Please keep the forum in when replying.
Yes, I would keep all of the DA axes for the cross validation. As for the
DAPC itself, keep as many axes as you want / need to look at.
Best
Thibaut
--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: https://repidemicsconsortium.org
https://sites.google.com/site/thibautjombart/
https://github.com/thibautjombart
Twitter: @TeebzR <https://twitter.com/TeebzR>
On 3 October 2016 at 12:31, Alexandre Lemo <alexandros.lemopoulos at gmail.com>
wrote:
> Hi,
>
> Thanks a lot for the help,it has been very usefull. I will try exactly
> that.
>
> One last questions if I may ask:
>
> How does n.da influence xvaL? I ran several replicates and I think that
> there was more stability when using a really high number of da. For
> example, I felt xval was more consistent when using n.da= 100 (meaning all
> of them).
>
> Also, if I use a number of da for xval, I should logically use the same
> number for the actual dapc. Am I right on this one?
>
> Here is the code for more clarity
>
>
>
>
>
> *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9),
> pop(str), n.da=100, n.pca = 5:100, n.rep =
> 1000, parallel = "snow", ncpus = 4L*
> # So, if 15 is the best PC obtained, then:
>
>
>
> *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca = 13,n.da
> = 100)*
> But could I run it *n.da = 3 or* would I need to perform xval again with
> *n.da=3* and then perform the dapc again?
>
> I hope I am clear enough with my question.
>
> Thanks again,
>
> Best,
>
> Alexandre
>
>
>
> 2016-10-03 14:20 GMT+03:00 Thibaut Jombart <thibautjombart at gmail.com>:
>
>> Hi Alexandre,
>>
>> thanks for the figure, it is very useful. Yes, around 15. To fine-tune
>> it, I would run the analysis for all numbers of PCA axes between 1 and 20,
>> and increase the number of replicates (30 or more if you can, maybe up to
>> 50).
>>
>> Best
>> Thibaut
>>
>>
>> --
>> Dr Thibaut Jombart
>> Lecturer, Department of Infectious Disease Epidemiology, Imperial
>> College London
>> Head of RECON: https://repidemicsconsortium.org
>> https://sites.google.com/site/thibautjombart/
>> https://github.com/thibautjombart
>> Twitter: @TeebzR <https://twitter.com/TeebzR>
>>
>> On 3 October 2016 at 12:14, Alexandre Lemo <alexandros.lemopoulos at gmail.c
>> om> wrote:
>>
>>> Dear Dr Jombart,
>>>
>>> Thanks a lot for your answer! I think it does help indeed as I
>>> understand better why there is fluctuation in the xval's results.
>>>
>>> When I run xval, my curve will often looke like that, with Mean
>>> Successful Assignment varying between 0.81 and 0.82 when reaching the
>>> plateau.
>>> [image: Images intégrées 1]
>>> If I understand correctly, all solutions in the plateau are more or less
>>> equivalent . Structure of the DAPC should not change drastically if I
>>> select 79 or 20 PC in that case. However, from what I understand it is
>>> still better to select the fewest PC possible. That would mean that the
>>> optimal PCs to select would be at the beggining of the plateau (in this
>>> exemple it would grossely be around 15). Is this correct?
>>>
>>> Thanks a lot again,
>>>
>>> Best,
>>>
>>> Alexandre
>>>
>>> 2016-10-03 13:59 GMT+03:00 Thibaut Jombart <thibautjombart at gmail.com>:
>>>
>>>> Hi Alexandre,
>>>>
>>>> I would not trust the automatic selection of the optimal space
>>>> dimension unless you are looking at simulated data and you need to run the
>>>> analysis 100s of times. There are 2 questions here:
>>>>
>>>> # stability of xvalDapc output
>>>> As this is a stochastic process, changing results are to be expected.
>>>> It may be the case that you need to increase the number of replicates for
>>>> results to stabilise a bit.
>>>>
>>>> If you haven't yet, check the tutorials for some guidelines on this,
>>>> but basically you want to select the smallest number of dimensions that
>>>> gives the best classification outcome (i.e. the 'elbow' in the curve). If
>>>> there is no elbow, there may be no structure in the data - check that the %
>>>> successful re-assignment is better than expected at random. If the %
>>>> successful re-assignment plateaus, various numbers of PCs might lead to
>>>> equivalent solutions, but at the very least the structures should remain
>>>> stable.
>>>>
>>>> # cross validation vs optim.a.score
>>>> Simple: go with cross validation. The 'a-score' was meant as a crude
>>>> measure of goodness of fit of DAPC results, but cross-validation makes more
>>>> sense.
>>>>
>>>> Hope this helps
>>>>
>>>> Thibaut
>>>>
>>>>
>>>> --
>>>> Dr Thibaut Jombart
>>>> Lecturer, Department of Infectious Disease Epidemiology, Imperial
>>>> College London
>>>> Head of RECON: https://repidemicsconsortium.org
>>>> https://sites.google.com/site/thibautjombart/
>>>> https://github.com/thibautjombart
>>>> Twitter: @TeebzR <https://twitter.com/TeebzR>
>>>>
>>>> On 29 September 2016 at 10:02, Alexandre Lemo <
>>>> alexandros.lemopoulos at gmail.com> wrote:
>>>>
>>>>> Dear Dr. Jombart and *adegenet* users,
>>>>>
>>>>> I am trying to run a DPCA on a dataset of 3975 SNPS obtained through
>>>>> RAD sequencing. Tere are 11 populations and 306 individuals examined here
>>>>> (minmum 16 ind /pop). Note that I am not using the find.cluster function.
>>>>>
>>>>> My problem is that I can't get any consistency in the number of PC
>>>>> that I should use for the DPCA. Actually, everytime I run
>>>>> *optim.a.score* or *xval*, I get different results. I tried changing
>>>>> the training set (tried 0.7, 0.8 and 0.9) but still the optimal PC retained
>>>>> change in each run.
>>>>>
>>>>>
>>>>> Here is an example of my script:
>>>>>
>>>>> #str is a genind object
>>>>>
>>>>>
>>>>>
>>>>> *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9),
>>>>> pop(str), n.pca = 5:100, n.rep = 1000,
>>>>> parallel = "snow", ncpus = 4L*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *optim_PC_2<- xvalDapc(tab(str, NA.method = "mean", training.set
>>>>> =0.9), pop(str), n.pca = 5:100, n.rep = 1000,
>>>>> parallel = "snow", ncpus = 4L*What
>>>>> happens here is that optim_PC will give me an optimal PC of (e.g) 76 while
>>>>> optim_PC_2 will give me 16. I tried running this several times and
>>>>> everytime results are different.
>>>>>
>>>>>
>>>>> I also tried using optim.a.score() :
>>>>>
>>>>>
>>>>>
>>>>> *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca =
>>>>> 100,n.da = NULL)*
>>>>> *optim.a.score (dapc.str)*
>>>>>
>>>>> Here, the number of PC will change everytime I run the function.
>>>>>
>>>>>
>>>>> Does anyone have an idea of why this is happening or had several
>>>>> issues? I am quite confused as results obviously change a lot depending on
>>>>> how many PC are used...
>>>>>
>>>>> Thanks for your help and for this great adegenet package!
>>>>>
>>>>> Best,
>>>>>
>>>>> Alexandre
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> adegenet-forum mailing list
>>>>> adegenet-forum at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo
>>>>> /adegenet-forum
>>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20161003/5718d0be/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dapcxval.png
Type: image/png
Size: 35757 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20161003/5718d0be/attachment-0001.png>
More information about the adegenet-forum
mailing list