[adegenet-forum] A few basic questions
t.jombart at imperial.ac.uk
Thu Apr 4 11:27:00 CEST 2013
sorry in advance if I missed some of your points. But briefly, a few points that might help clarifying issues:
- the a-score helps deciding how many PCs of PCA to retain (not PCs of the DA, aka discriminant functions); there is no tool currently to decide how many discriminant functions (DF) to retain.
- it is normal that assignment changes when the number of DF changes, as the space on which assignment is based changes. Think for instance of a very simple situation where each DF differentiates two populations; removing a given DF will erase discrimination for this pop.
- about the instability you observe: this is quite possible a sign of ad-hoc discrimination due to a discriminating space too big compared to the number of observations. Cross-validation would be the way to go, and should not be too much of a pain to implement. Basically, run DAPC on a random sample of the data, and validate classification using the remaining individuals. Do this repeatedly with varying numbers of PCs (of PCA) retained, and pick the number of components optimizing cross-classification.
* message to the list * : I will offer a pint to the person who will implement this feature in adegenet; nothing complicated, but I just don't have time for this at the moment
- DAPC is good at finding an optimal typology of groups; cluster assignment is merely a by-product, useful but limited. This is where model-based classifiers will be better. I recommend using BAPS, especially on microsat data since it should run quite fast.
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary’s Campus
London W2 1PG
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thomas Vignaud [thomfromsea at gmail.com]
Sent: 02 April 2013 09:10
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] A few basic questions
I find Adegenet -DAPC- to be very usefull -yet I don't fully understand all the subtilities.
I'll here try to ask a few simple questions with associated screenshots. I'll mostly use examples to ask my questions as I believe it a very efficient way to do it.
(I'm working with 17 microsats on animals)
I'm sorry if all this sounds newbie - please feel free to redirect me to any .pdf I might have miss.
I believe the two main questions I want to answer with DAPC are :
1 - How different my clusters are ? (I know this depend on a lot of things and that I can't compare with other species/genes)
I feel like one way to do it is to check is a few components still finds a lot of structure.
Another is, using alpha scores and the whole classic process, to visually see how assigned to their cluster the individuals are.
2 - Is there any sub-(genetic)clusters in my sample? for example, I have sampled 50 ids in the same location. But maybe there is two (sub) population here and I sample 40 of the first one and 10 of the other. I want to see that (i.e. compoplot), to go back to my data and to check if I can find patterns related with what the genetic tells me.
Now here is my problem : depending what number of discriminant function I'm using, I get totally different results with the same sub-dataset.
And, with the same number of discriminant function but with adding another population (very structured) to my first sub-dataset, then the first sub-dataset will be different again.
---> I'm a little lost in what to choose as a number of discriminant function (I understand the alpha-score, but sometimes it will tell me "21", when using only "5" will give me the same exact compoplot).
It would not be such a problem if differences would be small, but here it is : often all my individuals are 100% in one color, but it's never the same pattern.
One compoplot I'll have ids 1, 2, 5, 6 that are 100% red, and 3, 4, 7 that are 100% blue.
Then I just redo the analysis changing the number of discriminant function and I get 1, 3, 7 100% red and 2, 4, 5, 6 100% blue.
See attached screenshots A, B and C from the SAME dataset. (I'm trying to use small number of DF as I don't like my ids to be 100% in one color, I feel I miss some information)
---> the same thing happen if I add other populations. The whole pattern change again. See screenshot D
So is there any guideline that would give me something a little less absolute that totally different results?
If I want, for example, to note all my outliers (ids that does not belong the their original geographic cluster) and check for their caracteristic (size, sex etc...) how am I supposed to do that if outliers change depending on priors ? especially with more than 700 individuals and 16 geographic clusters.
If I want to account for how much different 3 clusters are, and if using the opt alpha score gives me three 100% differenciated clusters, but using a lower one start to create a mix between two of the clusters : can I just decide to use a lot of different numbers of discriminant function to explore the dataset ? or is it "wrong" ?
Additional information :
my 'exploring' workflow looks like :
> grp <- find.clusters(obj, max.n.clust = 35)
x (depend what I want to see)
> dapc1 <- dapc(obj, grp$grp)
x (N/3 or 100 if N is large)
x (either alpha score number or smaller because I have a strong structure)
> compoplot(dapc1, grp$grp)
Any imput or help more than welcome.
More information about the adegenet-forum