[adegenet-forum] Selecting the number of clusters in find.clusters under ever decreasing BIC values

Thu Jun 2 16:39:37 CEST 2011

Dear Thibaut and adegenet users,

I was wondering if you could help shed some insight on how to select the 
number of clusters if the BIC value is continuously decreasing? I 
suppose a more important question is whether this means that my data 
does not have enough information for the DAPC approach?

A bit of background:

I am using linked nuclear SNPs from two plant species to infer the 
clusters and possible cluster relationships. The one species (A) has 27 
SNPs and the other (B) has 127 SNPs, but many of these (~45%) are unique 
to one individual. I have 100 individuals per species. What is 
interesting about these two plant species is that although they share 
very similar life histories they have very different dispersal 
ecologies. One is dispersed by very territorial bird species (A), and 
the other is dispersed by a range of mega-herbivores (B), the most 
notable of which are elephants. My chloroplast sequence data indicates 
high genetic structuring in the bird-dispersed species (A) and no 
genetic structuring in the elephant dispersed species (B). This makes a 
lot of sense given the historical migration patterns of elephants in my 
study area plus the long-residence time of seeds in the gut and that 
elephants are not great at digesting their food (so seeds pass through 
relatively unscathed).

The problem:

If I use find.clusters and retain the majority of the variance (~95%) 
and set the max.n.clust=50 then I obtain a # of clusters vs BIC curve 
that I am not sure what to do with.

For species B - the curve starts as a plateau that slowly declines with 
increasing # of clusters, but as it gets to about 20 clusters it steeply 
declines. I imagine this is the point where individuals are starting to 
form their own clusters (probably due to the unique SNPs?).
(1) If I do not expect to find too many clusters in this species (i.e. I 
consider the population panmictic because of its dispersal agent) would 
this pattern be expected if the population is panmictic?

For species A - the curve continuously decreases until round about 30 
clusters and then just free falls into large negative BIC numbers.

(2) I was really hoping that the number of clusters would correspond to 
those we found in the chloroplast data. How should I (or can I) decide 
on the number of clusters?

If I keep almost all of the cumulative variance, then the pattern # of 
clusters vs BIC curve remains the same for species A, but changes 
dramatically for species B : I get a curve looking like a normal 
distribution - i.e. BIC starts increasing from 1, hits a high point 
around 20 clusters and decreases to below the BIC starting value of 1.

(3) Any ideas what this means? It seems that including just a little bit 
more of the cumulative variation near 100% drastically changes the shape 
of the graph for species B.

I still think that the find.cluster + DAPC methods offer insights into 
my system. For example, when I set the number of clusters to the number 
of populations (or clades) found in the chloroplast data for species A 
then the corresponding clustering of individuals has a very good match 
to the clustering of individuals found in the cpDNA datasets. A similar 
result is achieved if I use the number of clades/clusters found in the 
cpDNA for species B - the assignment of individuals to clusters again 
matches very closely to that of the cpDNA - but there is just no 
geographical association with the clusters.

This looks even more promising when I use the colorplot of the first 
three DAPC eigenvalues - the nDNA color clusters match up very nicely 
with the cpDNA.

Oh, and I just wanted to add that I can eyeball the data from both 
species, and kind of see the clinal nDNA clusters that correspond to the 
cpDNA in species A, and also the lack of geographic structuring in the 
nDNA data for species B. So, my gut feeling is that there is a pattern 
there that corresponds to my cpDNA data, but traditional phylogeographic 
methods can't pick it apart. Hence my hope that I can use DAPC which has 
proven very promising, except for the part where I need to select clusters.

I know there are no hard and fast rules to PCA in general - and probably 
more so for DAPC - but any hints or suggestions would be greatly 
appreciated (I am trying to head off reviewers criticisms for using this 
method and the lack of a clear means to determine the number of clusters).

Cheers
Alastair Potts