[adegenet-forum] snapclust when HW is not expected

Thibaut Jombart thibautjombart at gmail.com
Fri Feb 16 18:24:54 CET 2018


Hi Brian

thanks for reposting your question here. I am assuming that by 'DAPC' you
mean the K-means clustering presented in the DAPC paper, not the factorial
method itself. It is an interesting topic, and there are many possible
answers. I'll try to mention a few.

snapclust uses HW to compute the likelihood, like most other model-based
(likelihood, bayesian) clustering methods I know of. Similarly, it assumes
independence of loci, as that: (global log-likelihood) = sum(likelihood of
every loci)

Deviation from HW and linkage between loci will have the same kind of
effect: the computed likelihood will be an approximation of the true,
unknown likelihood. How good the approximation is in a particular case? I
don't think we know, in general, but I'd like to see such a study
published. And then, the next question is: how does it change the
clustering solution? Again, more work would be interesting on this topic.

I suspect attitudes will vary, pretty much depending on whether one decides
to be purist or pragmatic. As an anecdote, developing various Bayesian of
ML methods, it happened several times to realise the likelihood was 'wrong'
(coding error), sometimes even one full component of the likelihood was
entirely left out, and the reason I had not flagged it out before was
results were still okay. Similarly, a linear regression may still give
sensible results despite non-normally distributed results. k-means
clustering is often used without checking that groups have similar
within-group variances. And ML phylogenies from full alignments are
commonplace, while the likelihood also assumes independence of loci - see
Joe Felsenstein's cheeky comment on that in his pruning algorithm paper.

In short: it could be a problem, but we (at least, I) don't know which
impact it'll have. I know, disappointing. My 2 cents would be:
- fairly evenly distributed LD: snapclust should be fine
- a bit of clonality mixed up with some recombination / sexual
reproduction: should be worth looking at
- full clonality: work on haplotype frequencies / MLST type of markers (see
apex package), and then snapclust will be fine
- never rely on a single method if you can avoid it; I like using a
hierarchical clustering and further exploration using factorial methods
(PCA, DAPC) as a complement

Please feel free to comment / discuss, everyone. I might put this in a
podcast, time allowing.

Best
Thibaut



--
Dr Thibaut Jombart
Lecturer, Department of Infectious Disease Epidemiology, Imperial College
London
Head of RECON: repidemicsconsortium.org
WHO Consultant - outbreak analysis
https://thibautjombart.netlify.com
Twitter: @TeebzR
+44(0)20 7594 3658

On 16 February 2018 at 15:57, brian knaus <briank.lists at gmail.com> wrote:

> Hi and congrats on your snapclust paper! I was thinking of trying the
> method on a couple of projects I'm working on. However, I work with fungi
> and fungus-like plant pathogens that exhibit a mixture of reproductive
> modes (e.g., selfing, clonality, mitotic reproduction). This means that we
> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual
> seems to come out pretty early stating that HW is important. I would guess
> that linkage disequilibrium (non-independence of loci) may be an issue
> also. So this raises my question: in systems where HW may not be assumed
> and where there may be linkage disequilibrium would I be better of using
> DAPC than snapclust?
>
> Thanks!
> Brian
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180216/14b53072/attachment.html>


More information about the adegenet-forum mailing list