Hi there,<div><br></div><div>I premise that I am quite ignorant about plant, in the best moment of my life I was able to distinguish gymnosperms from angiosperms and I was even proud of that. Anyway, I just wanted to say that, thought chloroplast DNA is maternally transmitted, if your plants are hermaphroditic with a high percentage of self-fertilising reproduction, the cDNA distribution could be reliable to infer population structure. In this case indeed you wouldn&#39;t have the issue related to sex-biased dispersal and autosomal and cDNA loci are kind of transmitted jointly (not as for mtDNA in animals), so they should show a concordant distribution. This could be a good argumentation to use of cDNA for the clustering...(but I may be completely wrong).</div>

<div><br></div><div>It&#39;s true that you are not lucky at all with the SNPs data, which I guess it&#39;s quite frustrating, considering how much we lean on autosomal loci. Probably for species A there&#39;s nothing to do to obtain something better. For species B, the fact that almost the half of the SNPs are present in only one individual gives the impression of a population in rapid expansion. Even if the 1% is the threshold value to consider a SNPs as a real variant, it&#39;s probable that these mutations will be lost in a few generations, so maybe if you include them in the dataset you may be adding a bit of noise. But I have no idea if it can change something in the DAPC.</div>

<div><br></div><div>Best regards</div><div><br></div><div>Valeria</div><div><br><div class="gmail_quote">On 2 June 2011 18:58, Jombart, Thibaut <span dir="ltr">&lt;<a href="mailto:t.jombart@imperial.ac.uk" target="_blank">t.jombart@imperial.ac.uk</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Dear Alastair,<br>

<br>

thanks for your questions and the extensive background on your problem. Some of your questions might be covered by the vignette on DAPC which will be part of the next release of adegenet; in the meantime, I uploaded the pdf on adegenet website, section &quot;documents&quot;. In particular, see section 2.3 &quot;How many clusters are there really in the data?&quot;. It does not exactly address all of your problems, but it is probably worth reading. Essentially, what it says is that clusters are often not biological reality, but merely tools used to describe the data. So the short answer is &quot;there is often no &#39;true&#39; k&quot;. Some of the curves you describe just mean BIC cannot be used to identify the optimal number of clusters - in particular, I would probably consult an exorcist for the normal-looking curve.<br>


<br>

Your questions still make sense, of course. Odd shapes of the decrease of BIC can occur for several reasons. The possible explanations I can think of are:<br>

a) there are no clearly identifiable clusters in the data.<br>

<br>

b) there are clusters to be identified, but not enough information to disentangle different values of k. In your case this seems very likely: there are few SNPs, and if half of them are specific to one individual they are not informative in terms of clusters.<br>


<br>

c) the method does not work for these data; k-means is very flexible, and I would be surprised to see it fail completely where other clustering methods would succeed. However, BIC is by no way the ultimate criterion for choosing k. It gave the best results on a set of simulations, it is fairly consistent statistically, but that&#39;s it. I would not be surprised to find some specific cases in which it would give poor results.<br>


<br>

<br>

One possible approach to your problem is replicate data under a sensible model (e.g. island model, IBD, ...) with similar numbers of SNPs and individuals to what you have, and see how find.clusters performs. This will at least tell you if you can rule out power issues. Easypop (<a href="http://www.unil.ch/dee/page36926_fr.html" target="_blank">http://www.unil.ch/dee/page36926_fr.html</a>) will allow you to simulate data easily.<br>


<br>

Lastly, if clusters make sense and are congruent across different datasets, then you should probably go for a given number of clusters - whatever is useful to describe your data - and explain that identifying the optimal number of clusters was not possible because of insufficient information.<br>


<br>

All the best<br>

<br>

Thibaut.<br>

<br>

________________________________________<br>

From: <a href="mailto:adegenet-forum-bounces@r-forge.wu-wien.ac.at" target="_blank">adegenet-forum-bounces@r-forge.wu-wien.ac.at</a> [<a href="mailto:adegenet-forum-bounces@r-forge.wu-wien.ac.at" target="_blank">adegenet-forum-bounces@r-forge.wu-wien.ac.at</a>] on behalf of Alastair Potts [<a href="mailto:potts.a@gmail.com" target="_blank">potts.a@gmail.com</a>]<br>


Sent: 02 June 2011 15:39<br>

To: <a href="mailto:adegenet-forum@r-forge.wu-wien.ac.at" target="_blank">adegenet-forum@r-forge.wu-wien.ac.at</a><br>

Subject: [adegenet-forum] Selecting the number of clusters in find.clusters under ever decreasing BIC values<br>

<div><div></div><div><br>

Dear Thibaut and adegenet users,<br>

<br>

I was wondering if you could help shed some insight on how to select the<br>

number of clusters if the BIC value is continuously decreasing? I<br>

suppose a more important question is whether this means that my data<br>

does not have enough information for the DAPC approach?<br>

<br>

A bit of background:<br>

<br>

I am using linked nuclear SNPs from two plant species to infer the<br>

clusters and possible cluster relationships. The one species (A) has 27<br>

SNPs and the other (B) has 127 SNPs, but many of these (~45%) are unique<br>

to one individual. I have 100 individuals per species. What is<br>

interesting about these two plant species is that although they share<br>

very similar life histories they have very different dispersal<br>

ecologies. One is dispersed by very territorial bird species (A), and<br>

the other is dispersed by a range of mega-herbivores (B), the most<br>

notable of which are elephants. My chloroplast sequence data indicates<br>

high genetic structuring in the bird-dispersed species (A) and no<br>

genetic structuring in the elephant dispersed species (B). This makes a<br>

lot of sense given the historical migration patterns of elephants in my<br>

study area plus the long-residence time of seeds in the gut and that<br>

elephants are not great at digesting their food (so seeds pass through<br>

relatively unscathed).<br>

<br>

The problem:<br>

<br>

If I use find.clusters and retain the majority of the variance (~95%)<br>

and set the max.n.clust=50 then I obtain a # of clusters vs BIC curve<br>

that I am not sure what to do with.<br>

<br>

For species B - the curve starts as a plateau that slowly declines with<br>

increasing # of clusters, but as it gets to about 20 clusters it steeply<br>

declines. I imagine this is the point where individuals are starting to<br>

form their own clusters (probably due to the unique SNPs?).<br>

(1) If I do not expect to find too many clusters in this species (i.e. I<br>

consider the population panmictic because of its dispersal agent) would<br>

this pattern be expected if the population is panmictic?<br>

<br>

For species A - the curve continuously decreases until round about 30<br>

clusters and then just free falls into large negative BIC numbers.<br>

<br>

(2) I was really hoping that the number of clusters would correspond to<br>

those we found in the chloroplast data. How should I (or can I) decide<br>

on the number of clusters?<br>

<br>

If I keep almost all of the cumulative variance, then the pattern # of<br>

clusters vs BIC curve remains the same for species A, but changes<br>

dramatically for species B : I get a curve looking like a normal<br>

distribution - i.e. BIC starts increasing from 1, hits a high point<br>

around 20 clusters and decreases to below the BIC starting value of 1.<br>

<br>

(3) Any ideas what this means? It seems that including just a little bit<br>

more of the cumulative variation near 100% drastically changes the shape<br>

of the graph for species B.<br>

<br>

I still think that the find.cluster + DAPC methods offer insights into<br>

my system. For example, when I set the number of clusters to the number<br>

of populations (or clades) found in the chloroplast data for species A<br>

then the corresponding clustering of individuals has a very good match<br>

to the clustering of individuals found in the cpDNA datasets. A similar<br>

result is achieved if I use the number of clades/clusters found in the<br>

cpDNA for species B - the assignment of individuals to clusters again<br>

matches very closely to that of the cpDNA - but there is just no<br>

geographical association with the clusters.<br>

<br>

This looks even more promising when I use the colorplot of the first<br>

three DAPC eigenvalues - the nDNA color clusters match up very nicely<br>

with the cpDNA.<br>

<br>

Oh, and I just wanted to add that I can eyeball the data from both<br>

species, and kind of see the clinal nDNA clusters that correspond to the<br>

cpDNA in species A, and also the lack of geographic structuring in the<br>

nDNA data for species B. So, my gut feeling is that there is a pattern<br>

there that corresponds to my cpDNA data, but traditional phylogeographic<br>

methods can&#39;t pick it apart. Hence my hope that I can use DAPC which has<br>

proven very promising, except for the part where I need to select clusters.<br>

<br>

I know there are no hard and fast rules to PCA in general - and probably<br>

more so for DAPC - but any hints or suggestions would be greatly<br>

appreciated (I am trying to head off reviewers criticisms for using this<br>

method and the lack of a clear means to determine the number of clusters).<br>

<br>

Cheers<br>

Alastair Potts<br>

<br>

<br>

_______________________________________________<br>

adegenet-forum mailing list<br>

<a href="mailto:adegenet-forum@lists.r-forge.r-project.org" target="_blank">adegenet-forum@lists.r-forge.r-project.org</a><br>

<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum</a><br>

_______________________________________________<br>

adegenet-forum mailing list<br>

<a href="mailto:adegenet-forum@lists.r-forge.r-project.org" target="_blank">adegenet-forum@lists.r-forge.r-project.org</a><br>

<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum</a><br>

</div></div></blockquote></div><br></div>