[adegenet-forum] snapclust when HW is not expected

Mon Feb 19 09:41:54 CET 2018

Again, it'd be fun to give it a try on an actual case study ;)

If one treated clonal data as independent loci (rather than as a single
one), this will result in fully correlated allele frequencies, but this
shouldn't change the clustering itself. It will change summary statistics
(AIC etc), but as both the deviance and the number of parameters will be
overestimated not sure how much this would impact the choice of the
'optimal K'.

Best
Thibaut

--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 16 February 2018 at 22:56, brian knaus <briank.lists at gmail.com> wrote:

> Thank you for a very thoughtful response! I think a summary is that we can
> bend the rules, just try not to break things. And I think that was a
> message expressed by Pritchard's group. They had a paper where they used
> STRUCTURE on Helicobacter pylori. I think an issue though is that there
> are many in the biological community do not understand the methods well
> enough to know if and when they may have gone too far. I appreciate your
> recommendations, but for many of these projects we have a reason to expect
> mixed mating modes, but we do not know how much of any particular mode to
> expect. In fact, the research goal is frequently to infer mating mode. Or
> perhaps which groups of samples may be outcrossing and which are not. I
> suspect that might be a lot to ask for.
>
> I appreciate your insights! And I find it encouraging that you would like
> to see more work on this. Perhaps we'll get to that one day?
> Brian
>
> On Fri, Feb 16, 2018 at 9:24 AM, Thibaut Jombart <thibautjombart at gmail.com
> > wrote:
>
>> Hi Brian
>>
>> thanks for reposting your question here. I am assuming that by 'DAPC' you
>> mean the K-means clustering presented in the DAPC paper, not the factorial
>> method itself. It is an interesting topic, and there are many possible
>> answers. I'll try to mention a few.
>>
>> snapclust uses HW to compute the likelihood, like most other model-based
>> (likelihood, bayesian) clustering methods I know of. Similarly, it assumes
>> independence of loci, as that: (global log-likelihood) = sum(likelihood of
>> every loci)
>>
>> Deviation from HW and linkage between loci will have the same kind of
>> effect: the computed likelihood will be an approximation of the true,
>> unknown likelihood. How good the approximation is in a particular case? I
>> don't think we know, in general, but I'd like to see such a study
>> published. And then, the next question is: how does it change the
>> clustering solution? Again, more work would be interesting on this topic.
>>
>> I suspect attitudes will vary, pretty much depending on whether one
>> decides to be purist or pragmatic. As an anecdote, developing various
>> Bayesian of ML methods, it happened several times to realise the likelihood
>> was 'wrong' (coding error), sometimes even one full component of the
>> likelihood was entirely left out, and the reason I had not flagged it out
>> before was results were still okay. Similarly, a linear regression may
>> still give sensible results despite non-normally distributed results.
>> k-means clustering is often used without checking that groups have similar
>> within-group variances. And ML phylogenies from full alignments are
>> commonplace, while the likelihood also assumes independence of loci - see
>> Joe Felsenstein's cheeky comment on that in his pruning algorithm paper.
>>
>> In short: it could be a problem, but we (at least, I) don't know which
>> impact it'll have. I know, disappointing. My 2 cents would be:
>> - fairly evenly distributed LD: snapclust should be fine
>> - a bit of clonality mixed up with some recombination / sexual
>> reproduction: should be worth looking at
>> - full clonality: work on haplotype frequencies / MLST type of markers
>> (see apex package), and then snapclust will be fine
>> - never rely on a single method if you can avoid it; I like using a
>> hierarchical clustering and further exploration using factorial methods
>> (PCA, DAPC) as a complement
>>
>> Please feel free to comment / discuss, everyone. I might put this in a
>> podcast, time allowing.
>>
>> Best
>> Thibaut
>>
>>
>>
>> --
>> Dr Thibaut Jombart
>> Lecturer, Department of Infectious Disease Epidemiology, Imperial College
>> London
>> Head of RECON: repidemicsconsortium.org
>> WHO Consultant - outbreak analysis
>> https://thibautjombart.netlify.com
>> Twitter: @TeebzR
>> +44(0)20 7594 3658 <+44%2020%207594%203658>
>>
>> On 16 February 2018 at 15:57, brian knaus <briank.lists at gmail.com> wrote:
>>
>>> Hi and congrats on your snapclust paper! I was thinking of trying the
>>> method on a couple of projects I'm working on. However, I work with fungi
>>> and fungus-like plant pathogens that exhibit a mixture of reproductive
>>> modes (e.g., selfing, clonality, mitotic reproduction). This means that we
>>> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual
>>> seems to come out pretty early stating that HW is important. I would guess
>>> that linkage disequilibrium (non-independence of loci) may be an issue
>>> also. So this raises my question: in systems where HW may not be assumed
>>> and where there may be linkage disequilibrium would I be better of using
>>> DAPC than snapclust?
>>>
>>> Thanks!
>>> Brian
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo
>>> /adegenet-forum
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180219/8ed86297/attachment.html>