[adegenet-forum] Selecting the number of clusters in find.clusters under ever decreasing BIC values
Alastair Potts
potts.a at gmail.com
Thu Jun 2 16:39:37 CEST 2011
Dear Thibaut and adegenet users,
I was wondering if you could help shed some insight on how to select the
number of clusters if the BIC value is continuously decreasing? I
suppose a more important question is whether this means that my data
does not have enough information for the DAPC approach?
A bit of background:
I am using linked nuclear SNPs from two plant species to infer the
clusters and possible cluster relationships. The one species (A) has 27
SNPs and the other (B) has 127 SNPs, but many of these (~45%) are unique
to one individual. I have 100 individuals per species. What is
interesting about these two plant species is that although they share
very similar life histories they have very different dispersal
ecologies. One is dispersed by very territorial bird species (A), and
the other is dispersed by a range of mega-herbivores (B), the most
notable of which are elephants. My chloroplast sequence data indicates
high genetic structuring in the bird-dispersed species (A) and no
genetic structuring in the elephant dispersed species (B). This makes a
lot of sense given the historical migration patterns of elephants in my
study area plus the long-residence time of seeds in the gut and that
elephants are not great at digesting their food (so seeds pass through
relatively unscathed).
The problem:
If I use find.clusters and retain the majority of the variance (~95%)
and set the max.n.clust=50 then I obtain a # of clusters vs BIC curve
that I am not sure what to do with.
For species B - the curve starts as a plateau that slowly declines with
increasing # of clusters, but as it gets to about 20 clusters it steeply
declines. I imagine this is the point where individuals are starting to
form their own clusters (probably due to the unique SNPs?).
(1) If I do not expect to find too many clusters in this species (i.e. I
consider the population panmictic because of its dispersal agent) would
this pattern be expected if the population is panmictic?
For species A - the curve continuously decreases until round about 30
clusters and then just free falls into large negative BIC numbers.
(2) I was really hoping that the number of clusters would correspond to
those we found in the chloroplast data. How should I (or can I) decide
on the number of clusters?
If I keep almost all of the cumulative variance, then the pattern # of
clusters vs BIC curve remains the same for species A, but changes
dramatically for species B : I get a curve looking like a normal
distribution - i.e. BIC starts increasing from 1, hits a high point
around 20 clusters and decreases to below the BIC starting value of 1.
(3) Any ideas what this means? It seems that including just a little bit
more of the cumulative variation near 100% drastically changes the shape
of the graph for species B.
I still think that the find.cluster + DAPC methods offer insights into
my system. For example, when I set the number of clusters to the number
of populations (or clades) found in the chloroplast data for species A
then the corresponding clustering of individuals has a very good match
to the clustering of individuals found in the cpDNA datasets. A similar
result is achieved if I use the number of clades/clusters found in the
cpDNA for species B - the assignment of individuals to clusters again
matches very closely to that of the cpDNA - but there is just no
geographical association with the clusters.
This looks even more promising when I use the colorplot of the first
three DAPC eigenvalues - the nDNA color clusters match up very nicely
with the cpDNA.
Oh, and I just wanted to add that I can eyeball the data from both
species, and kind of see the clinal nDNA clusters that correspond to the
cpDNA in species A, and also the lack of geographic structuring in the
nDNA data for species B. So, my gut feeling is that there is a pattern
there that corresponds to my cpDNA data, but traditional phylogeographic
methods can't pick it apart. Hence my hope that I can use DAPC which has
proven very promising, except for the part where I need to select clusters.
I know there are no hard and fast rules to PCA in general - and probably
more so for DAPC - but any hints or suggestions would be greatly
appreciated (I am trying to head off reviewers criticisms for using this
method and the lack of a clear means to determine the number of clusters).
Cheers
Alastair Potts
More information about the adegenet-forum
mailing list