[adegenet-forum] Selecting the number of clusters in find.clusters under ever decreasing BIC values

valeria montano mirainoshojo at gmail.com
Fri Jun 3 12:07:25 CEST 2011


Hi there,

I premise that I am quite ignorant about plant, in the best moment of my
life I was able to distinguish gymnosperms from angiosperms and I was even
proud of that. Anyway, I just wanted to say that, thought chloroplast DNA is
maternally transmitted, if your plants are hermaphroditic with a high
percentage of self-fertilising reproduction, the cDNA distribution could be
reliable to infer population structure. In this case indeed you wouldn't
have the issue related to sex-biased dispersal and autosomal and cDNA loci
are kind of transmitted jointly (not as for mtDNA in animals), so they
should show a concordant distribution. This could be a good argumentation to
use of cDNA for the clustering...(but I may be completely wrong).

It's true that you are not lucky at all with the SNPs data, which I guess
it's quite frustrating, considering how much we lean on autosomal loci.
Probably for species A there's nothing to do to obtain something better. For
species B, the fact that almost the half of the SNPs are present in only one
individual gives the impression of a population in rapid expansion. Even if
the 1% is the threshold value to consider a SNPs as a real variant, it's
probable that these mutations will be lost in a few generations, so maybe if
you include them in the dataset you may be adding a bit of noise. But I have
no idea if it can change something in the DAPC.

Best regards

Valeria

On 2 June 2011 18:58, Jombart, Thibaut <t.jombart at imperial.ac.uk> wrote:

> Dear Alastair,
>
> thanks for your questions and the extensive background on your problem.
> Some of your questions might be covered by the vignette on DAPC which will
> be part of the next release of adegenet; in the meantime, I uploaded the pdf
> on adegenet website, section "documents". In particular, see section 2.3
> "How many clusters are there really in the data?". It does not exactly
> address all of your problems, but it is probably worth reading. Essentially,
> what it says is that clusters are often not biological reality, but merely
> tools used to describe the data. So the short answer is "there is often no
> 'true' k". Some of the curves you describe just mean BIC cannot be used to
> identify the optimal number of clusters - in particular, I would probably
> consult an exorcist for the normal-looking curve.
>
> Your questions still make sense, of course. Odd shapes of the decrease of
> BIC can occur for several reasons. The possible explanations I can think of
> are:
> a) there are no clearly identifiable clusters in the data.
>
> b) there are clusters to be identified, but not enough information to
> disentangle different values of k. In your case this seems very likely:
> there are few SNPs, and if half of them are specific to one individual they
> are not informative in terms of clusters.
>
> c) the method does not work for these data; k-means is very flexible, and I
> would be surprised to see it fail completely where other clustering methods
> would succeed. However, BIC is by no way the ultimate criterion for choosing
> k. It gave the best results on a set of simulations, it is fairly consistent
> statistically, but that's it. I would not be surprised to find some specific
> cases in which it would give poor results.
>
>
> One possible approach to your problem is replicate data under a sensible
> model (e.g. island model, IBD, ...) with similar numbers of SNPs and
> individuals to what you have, and see how find.clusters performs. This will
> at least tell you if you can rule out power issues. Easypop (
> http://www.unil.ch/dee/page36926_fr.html) will allow you to simulate data
> easily.
>
> Lastly, if clusters make sense and are congruent across different datasets,
> then you should probably go for a given number of clusters - whatever is
> useful to describe your data - and explain that identifying the optimal
> number of clusters was not possible because of insufficient information.
>
> All the best
>
> Thibaut.
>
> ________________________________________
> From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [
> adegenet-forum-bounces at r-forge.wu-wien.ac.at] on behalf of Alastair Potts
> [potts.a at gmail.com]
> Sent: 02 June 2011 15:39
> To: adegenet-forum at r-forge.wu-wien.ac.at
> Subject: [adegenet-forum] Selecting the number of clusters in find.clusters
> under ever decreasing BIC values
>
> Dear Thibaut and adegenet users,
>
> I was wondering if you could help shed some insight on how to select the
> number of clusters if the BIC value is continuously decreasing? I
> suppose a more important question is whether this means that my data
> does not have enough information for the DAPC approach?
>
> A bit of background:
>
> I am using linked nuclear SNPs from two plant species to infer the
> clusters and possible cluster relationships. The one species (A) has 27
> SNPs and the other (B) has 127 SNPs, but many of these (~45%) are unique
> to one individual. I have 100 individuals per species. What is
> interesting about these two plant species is that although they share
> very similar life histories they have very different dispersal
> ecologies. One is dispersed by very territorial bird species (A), and
> the other is dispersed by a range of mega-herbivores (B), the most
> notable of which are elephants. My chloroplast sequence data indicates
> high genetic structuring in the bird-dispersed species (A) and no
> genetic structuring in the elephant dispersed species (B). This makes a
> lot of sense given the historical migration patterns of elephants in my
> study area plus the long-residence time of seeds in the gut and that
> elephants are not great at digesting their food (so seeds pass through
> relatively unscathed).
>
> The problem:
>
> If I use find.clusters and retain the majority of the variance (~95%)
> and set the max.n.clust=50 then I obtain a # of clusters vs BIC curve
> that I am not sure what to do with.
>
> For species B - the curve starts as a plateau that slowly declines with
> increasing # of clusters, but as it gets to about 20 clusters it steeply
> declines. I imagine this is the point where individuals are starting to
> form their own clusters (probably due to the unique SNPs?).
> (1) If I do not expect to find too many clusters in this species (i.e. I
> consider the population panmictic because of its dispersal agent) would
> this pattern be expected if the population is panmictic?
>
> For species A - the curve continuously decreases until round about 30
> clusters and then just free falls into large negative BIC numbers.
>
> (2) I was really hoping that the number of clusters would correspond to
> those we found in the chloroplast data. How should I (or can I) decide
> on the number of clusters?
>
> If I keep almost all of the cumulative variance, then the pattern # of
> clusters vs BIC curve remains the same for species A, but changes
> dramatically for species B : I get a curve looking like a normal
> distribution - i.e. BIC starts increasing from 1, hits a high point
> around 20 clusters and decreases to below the BIC starting value of 1.
>
> (3) Any ideas what this means? It seems that including just a little bit
> more of the cumulative variation near 100% drastically changes the shape
> of the graph for species B.
>
> I still think that the find.cluster + DAPC methods offer insights into
> my system. For example, when I set the number of clusters to the number
> of populations (or clades) found in the chloroplast data for species A
> then the corresponding clustering of individuals has a very good match
> to the clustering of individuals found in the cpDNA datasets. A similar
> result is achieved if I use the number of clades/clusters found in the
> cpDNA for species B - the assignment of individuals to clusters again
> matches very closely to that of the cpDNA - but there is just no
> geographical association with the clusters.
>
> This looks even more promising when I use the colorplot of the first
> three DAPC eigenvalues - the nDNA color clusters match up very nicely
> with the cpDNA.
>
> Oh, and I just wanted to add that I can eyeball the data from both
> species, and kind of see the clinal nDNA clusters that correspond to the
> cpDNA in species A, and also the lack of geographic structuring in the
> nDNA data for species B. So, my gut feeling is that there is a pattern
> there that corresponds to my cpDNA data, but traditional phylogeographic
> methods can't pick it apart. Hence my hope that I can use DAPC which has
> proven very promising, except for the part where I need to select clusters.
>
> I know there are no hard and fast rules to PCA in general - and probably
> more so for DAPC - but any hints or suggestions would be greatly
> appreciated (I am trying to head off reviewers criticisms for using this
> method and the lack of a clear means to determine the number of clusters).
>
> Cheers
> Alastair Potts
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20110603/752aa482/attachment.htm>


More information about the adegenet-forum mailing list