[adegenet-forum] find.clusters producing different 'best' solutions in different runs

Jombart, Thibaut t.jombart at imperial.ac.uk
Wed Mar 16 12:37:36 CET 2011


Dear Pip, 

yes, this is usual behaviour since the K-means algorithm is heuristic, so in general with empirical data we do not get twice the exact same solution. One possible issue is that the algorithm did not converge. To ensure convergence, use the arguments:
 n.iter=1e5, n.start=100

that is, running K-means 100 times with a maximum of 100,000 iterations. I think the default in the current stable release of adegenet is much lower for n.iter (is 1e5 in the current devel).

As for the number of PCs, keeping more/all is recommended in your case, since there are not many genetic variables in this dataset. 

If the number of clusters still varies, this is not a huge issue: it simply means that your data can be modeled using say 6 fairly well-defined groups, and the remaining genetic variation is not arranged into clear-cut clusters.

All the best

Thibaut

________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] On Behalf Of Pip Griffin [pip.griffin at gmail.com]
Sent: 15 March 2011 03:49
To: adegenet-forum at r-forge.wu-wien.ac.at
Subject: [adegenet-forum] find.clusters producing different 'best' solutions    in different runs

Dear Thibaut and Adegenet users,

I have a polyploid dataset coded as binary (PA datatype) containing 297 individuals and 97 'loci' (microsatellite alleles). I've been implementing the find.clusters command, retaining 40 PCA axes to capture >95% of the variance.

The issue is that I get different 'best' solutions for the number of K clusters in different find.clusters runs, with a modal value of 9, but ranging from 6-12.  Obviously the actual differences in BIC value are pretty small, but even when I designate a 'cut-off' (e.g. when the BIC value must decrease by at least 2 for the solution to be 'better' than the previous K), there is variation in the solution.

This variability is even higher when I choose fewer PCA axes to retain (e.g. retaining 80% of the variance), as would be expected, but even when I use 100 PCA axes (>>95% of variance), the value varies between 'runs'.

Has anyone else observed this - and do you have any advice?

Thanks for your help

Pip


More information about the adegenet-forum mailing list