[adegenet-forum] Selecting the number of clusters in find.clusters under ever decreasing BIC values

Jombart, Thibaut t.jombart at imperial.ac.uk
Thu Jun 2 18:58:01 CEST 2011

Dear Alastair, 

thanks for your questions and the extensive background on your problem. Some of your questions might be covered by the vignette on DAPC which will be part of the next release of adegenet; in the meantime, I uploaded the pdf on adegenet website, section "documents". In particular, see section 2.3 "How many clusters are there really in the data?". It does not exactly address all of your problems, but it is probably worth reading. Essentially, what it says is that clusters are often not biological reality, but merely tools used to describe the data. So the short answer is "there is often no 'true' k". Some of the curves you describe just mean BIC cannot be used to identify the optimal number of clusters - in particular, I would probably consult an exorcist for the normal-looking curve.

Your questions still make sense, of course. Odd shapes of the decrease of BIC can occur for several reasons. The possible explanations I can think of are:
a) there are no clearly identifiable clusters in the data.

b) there are clusters to be identified, but not enough information to disentangle different values of k. In your case this seems very likely: there are few SNPs, and if half of them are specific to one individual they are not informative in terms of clusters.

c) the method does not work for these data; k-means is very flexible, and I would be surprised to see it fail completely where other clustering methods would succeed. However, BIC is by no way the ultimate criterion for choosing k. It gave the best results on a set of simulations, it is fairly consistent statistically, but that's it. I would not be surprised to find some specific cases in which it would give poor results.

One possible approach to your problem is replicate data under a sensible model (e.g. island model, IBD, ...) with similar numbers of SNPs and individuals to what you have, and see how find.clusters performs. This will at least tell you if you can rule out power issues. Easypop (http://www.unil.ch/dee/page36926_fr.html) will allow you to simulate data easily. 

Lastly, if clusters make sense and are congruent across different datasets, then you should probably go for a given number of clusters - whatever is useful to describe your data - and explain that identifying the optimal number of clusters was not possible because of insufficient information.

All the best


From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] on behalf of Alastair Potts [potts.a at gmail.com]
Sent: 02 June 2011 15:39
To: adegenet-forum at r-forge.wu-wien.ac.at
Subject: [adegenet-forum] Selecting the number of clusters in find.clusters under ever decreasing BIC values

Dear Thibaut and adegenet users,

I was wondering if you could help shed some insight on how to select the
number of clusters if the BIC value is continuously decreasing? I
suppose a more important question is whether this means that my data
does not have enough information for the DAPC approach?

A bit of background:

I am using linked nuclear SNPs from two plant species to infer the
clusters and possible cluster relationships. The one species (A) has 27
SNPs and the other (B) has 127 SNPs, but many of these (~45%) are unique
to one individual. I have 100 individuals per species. What is
interesting about these two plant species is that although they share
very similar life histories they have very different dispersal
ecologies. One is dispersed by very territorial bird species (A), and
the other is dispersed by a range of mega-herbivores (B), the most
notable of which are elephants. My chloroplast sequence data indicates
high genetic structuring in the bird-dispersed species (A) and no
genetic structuring in the elephant dispersed species (B). This makes a
lot of sense given the historical migration patterns of elephants in my
study area plus the long-residence time of seeds in the gut and that
elephants are not great at digesting their food (so seeds pass through
relatively unscathed).

The problem:

If I use find.clusters and retain the majority of the variance (~95%)
and set the max.n.clust=50 then I obtain a # of clusters vs BIC curve
that I am not sure what to do with.

For species B - the curve starts as a plateau that slowly declines with
increasing # of clusters, but as it gets to about 20 clusters it steeply
declines. I imagine this is the point where individuals are starting to
form their own clusters (probably due to the unique SNPs?).
(1) If I do not expect to find too many clusters in this species (i.e. I
consider the population panmictic because of its dispersal agent) would
this pattern be expected if the population is panmictic?

For species A - the curve continuously decreases until round about 30
clusters and then just free falls into large negative BIC numbers.

(2) I was really hoping that the number of clusters would correspond to
those we found in the chloroplast data. How should I (or can I) decide
on the number of clusters?

If I keep almost all of the cumulative variance, then the pattern # of
clusters vs BIC curve remains the same for species A, but changes
dramatically for species B : I get a curve looking like a normal
distribution - i.e. BIC starts increasing from 1, hits a high point
around 20 clusters and decreases to below the BIC starting value of 1.

(3) Any ideas what this means? It seems that including just a little bit
more of the cumulative variation near 100% drastically changes the shape
of the graph for species B.

I still think that the find.cluster + DAPC methods offer insights into
my system. For example, when I set the number of clusters to the number
of populations (or clades) found in the chloroplast data for species A
then the corresponding clustering of individuals has a very good match
to the clustering of individuals found in the cpDNA datasets. A similar
result is achieved if I use the number of clades/clusters found in the
cpDNA for species B - the assignment of individuals to clusters again
matches very closely to that of the cpDNA - but there is just no
geographical association with the clusters.

This looks even more promising when I use the colorplot of the first
three DAPC eigenvalues - the nDNA color clusters match up very nicely
with the cpDNA.

Oh, and I just wanted to add that I can eyeball the data from both
species, and kind of see the clinal nDNA clusters that correspond to the
cpDNA in species A, and also the lack of geographic structuring in the
nDNA data for species B. So, my gut feeling is that there is a pattern
there that corresponds to my cpDNA data, but traditional phylogeographic
methods can't pick it apart. Hence my hope that I can use DAPC which has
proven very promising, except for the part where I need to select clusters.

I know there are no hard and fast rules to PCA in general - and probably
more so for DAPC - but any hints or suggestions would be greatly
appreciated (I am trying to head off reviewers criticisms for using this
method and the lack of a clear means to determine the number of clusters).

Alastair Potts

adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org

More information about the adegenet-forum mailing list