[adegenet-forum] Very different number of clusters in different datasets.

Mon Nov 9 15:43:35 CET 2015

Hi there,

there is a bunch of questions there, and I may miss one or two.

In a nutshell:

- It happens that k-means finds clusters where STRUCTURE fails (see original paper); this is not necessarily a sign that find.clusters is wrong; in your case, for the microsat data, it looks like if there are any clusters these are not linked to the geographical locations; hard to say more without seeing outputs/the data

- The graph of your second analysis (SNPs) shows no structure. k=1 is not nonsensical, it is just a suggestion that there are no clusters in your data.

- xvalDapc has not been implemented (yet) for genlight objects; to convert data into a suitable format try as.matrix(...).

- cross validation is to be preferred to the a-score

- MDS is not a clustering method

- MDS optimizes overall diversity so may fail to detect group structure

Cheers
Thibaut

________________________________

From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Peri Bolton [peri.bolton at students.mq.edu.au]
Sent: 30 October 2015 11:40
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Very different number of clusters in different datasets.

Dear adegenet developers and users,

I have a dataset with 50 individuals across 5 sampling locations in a microsatellite dataset, and roughly equivalent numbers of individuals in a SNP dataset with 3839 loci.
I have just been interested in finding whether there is any population structure in my species. However, when I run the different datasets I get different answers, and some of them look strange.

microsatellite dataset.
Fst, mantel test for IBD and STRUCTURE both find zero evidence of structure...

find.clusters says k=4 or 5
then I run optima.a.score and xvalDapc to find the best number of PCs to retain for a dapc, and I have nice groups in the final answer, with apparently good assignment power back to the original groups.
However, my alpha scores for that dapc run is as follows
        1         2         3         4
0.4905714 0.5570149 0.7075510 0.5962500

Further, when I visualise this as a compoplot there is no evidence that these structures actually represent any kind of geographic structure in the data, as the groups are just randomly dispersed through my individuals.

I have read on topics in the forums that if there is enough space in the data it will find an optimal clustering solution, no matter whether it is biologically realistic. I have also read that find.clusters shouldn't find an optimal solution for k=1 because it is meant to be a non-sense solution for a cluster. Indeed this makes sense because when you use sampling locality as a prior in dapc it all comes out as one big cluster.

HOWEVER, when I run my SNP dataset things get really strange.

I ran essentially all the same procedures and I've come up against a number of hurdles:

1. I can't get the xvalDapc to work on a genlight object. I keep getting an error:

Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class "structure("SNPbin", package = "adegenet")" to a data.frame
In addition: Warning message:
In min(dim(x)) : no non-missing arguments to min; returning Inf

Obviously this is because genlight doesn't store the genetic data in the same way as the genind objects do. Is there a work around for using this function?

So far I have got xvalDapc to work on my genind objects, but I do get a bunch of "warning messages  "49: In if (result == "overall") { ... :
  the condition has length > 1 and only the first element will be used", but it seems to spit out an output at least....

2. when I run find.clusters my cumulative variance plot is nearly linear... as is my BICvsK plot, with the optimal solution being the supposedly non-sensical k=1 (see the attached pdf of the output)? Is there something weird with my data? Or, is that the genuine signal coming through?  When I use other clustering methods such as fastSTRUCTURE and mds I don't get any indication of structure either. HOWEVER, I don't know how to reconcile the two clustering solutions from the two nuclear data sources.

3. When I run an a.score analysis it is basically a flat line, and although it finds an "optimal" pca retention it doesn't seem very reliable to me (see also attached)

So I am aware that there are a few problems there, but hopefully the itemisation and the context of my questions help any good hearted helping people out there.

Sincerely,

Peri

--
Peri Bolton
PhD Candidate, Griffith Lab <http://bio.mq.edu.au/avianbehaviouralecology/>
Department of Biological Sciences
Macquarie University, NSW 2109, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151109/05f8c783/attachment.html>