[adegenet-forum] find.clusters and optim.a.score
t.jombart at imperial.ac.uk
Mon Feb 28 19:48:15 CET 2011
it is virtually impossible to check one's analyses without going through the data and re-analysing them. However, I can comment on what you describe.
The analysis seems to make sense. I would not be too surprised that DAPC finds patterns overlooked by STRUCTURE, since even using very simple simulations we showed this is likely to happen quite often. This does not necessarily imply that the overlooked structures are weak: STRUCTURE can overlook strong patterns as well. There can be many reasons to the existence of genetic clusters, they do not have to correspond to geographic patterns (e.g. different ancestries for sympatric individuals). The function optim.a.score is meant to select the optimal number of PCs for assignment purpose only, and is unrelated to the selection of the number of clusters in find.clusters. It is safe to keep as much info as possible for find.clusters.
If you identify 4 clusters with a.scores above 90% in each group, then these are certainly not artefactual, but clear-cut clusters. To quantify how strong the differentiation is, you can also use pairwise.fst, or fstat functions. As for the loading plot, you can use the function loadingplot; e.g.:
From: jeff [5jr29 at queensu.ca]
Sent: 28 February 2011 18:08
To: adegenet-forum at r-forge.wu-wien.ac.at
Subject: find.clusters and optim.a.score
I just have a few questions regarding the find.clusters and the the optim.a.score functions. Basically, I am trying to use DAPC analysis to determine the number of genetic clusters in my dataset and because I have no real prior assumptions on the number or extent of populations I and using the find.clusters function. Using an assignment test (in STRUCTURE) I find two relatively strong genetic clusters, but when I use the find.clusters function, the BIC scores suggest that there are 4 clusters and essentially divides one of the 2 clusters identified in STRUCTURE into 3. The problem is that these 3 clusters do not really map out very well geographically. I have a few ideas of why this might be the case, but just want to make sure I am running the analysis correctly before I dive into this much further.
I think my main problem I have is how many PCA axis (n.pca) to save for this analysis when using the find.clusters function. Because I do not have any prior population delineation I do not think it makes sense to use the optim.a.score to determine this. I have tried a few different values and they give different results, but what I ended up doing was setting this to a high value to capture a large amount of the variation (~95%), which seems to be what was done in the BMC genetics paper? Once I have the number of clusters (4 in this case) I assigned individuals to the 4 groups (using n.pca =100 again) and then used the optim.a.score function to determine the optimal number of PCA axis in assigning individuals to these 4 groups. I then reclassed individuals, determined posterior membership probabilities and produced scatter plots. Can anyone provide any comments/suggestions on if this is a proper way to proceed or if I am missing anything? Based on the geographic distribution of these clusters, my concern is that I am picking up some genetic structure that is very weak and does not really have any biological meaning, but using the optimal number of PCA axis (13) the classification rate is over 90% for all the 4 groups, compared to 30-40% when I randomly shuffle the individuals so I don't want to discount it. I should probably also mention that I am using 17 microsat loci to conduct this analysis.
Lastly, if I am running this analysis correctly, I want to try and identify the particular loci and alleles that are driving this structure and so am wondering if there was any code or examples that I could use to produce plots similar to figure 9 in the BMC genetics paper?
Jeffrey R. Row
Department of Biology, Queen's University
Phone: 613-533-6000 x 75051
More information about the adegenet-forum