[adegenet-forum] (no subject)
Jombart, Thibaut
t.jombart at imperial.ac.uk
Wed Feb 2 11:46:09 CET 2011
Hello,
well, this is pretty much what I explained in my last email. Please re-read the second part on the a-score and not selecting too many principal components during the prior PCA step.
Best
Thibaut
________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] On Behalf Of Jonker, Rudy [Rudy.Jonker at wur.nl]
Sent: 01 February 2011 20:24
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] (no subject)
Dear Thibaut,
Thanks for your answer.
When I follow the steps below, I indeed get probabilities for each individuals assignment, which looks like this:
1 2 3
G1 5.181058e-02 9.296156e-01 1.857384e-02
G10 1.257639e-03 9.987406e-01 1.772024e-06
G11 2.103888e-05 9.999700e-01 8.934747e-06
G6 8.517392e-02 9.148260e-01 3.589007e-08
G8 9.994113e-01 4.644952e-04 1.241935e-04
K1 3.915097e-11 1.382041e-08 1.000000e+00
K10 8.885704e-10 4.278880e-03 9.957211e-01
K11 4.094563e-13 7.356615e-07 9.999993e-01
K12 1.442422e-13 4.376972e-09 1.000000e+00
K13 1.532497e-11 1.940046e-05 9.999806e-01
K14 2.983051e-12 1.657649e-07 9.999998e-01
K15 1.283778e-09 1.001622e-05 9.999900e-01
K16 1.705139e-08 1.280568e-04 9.998719e-01
I find these assignment probabilities insanely high. When retaining the nr of pca's I chose the highest possible (in my case 350) explaining 99.9% of the variation (using 374 SNP markers). Is that causing these high probabilities? Or am I just lucky?
Thanks
Rudy
________________________________________
From: Jombart, Thibaut [t.jombart at imperial.ac.uk]
Sent: 01 February 2011 18:05
To: Jonker, Rudy; 'adegenet-forum at lists.r-forge.r-project.org'
Subject: RE: [adegenet-forum] (no subject)
Dear Rudy,
thanks for reposting your question on the forum. Kmeans clustering does not give probabilities of assignment of one individual cluster, but discriminant analysis does. After doing your DAPC analysis with the clusters defined by find.clusters, these probabilities are stored in $posterior; also see 'summary' for averages of successful assignment per groups. Note that the result depends on the number of principal components retained in the DAPC step. Please have a look at ?a.score and at the posts on the forum about this topic (search the archives for "a.score", see 'contact section on the website).
Here's a simple example using one of the simulated datasets of the DAPC paper:
###
> data(dapcIllus)
> x=dapcIllus$a
> grp=find.clusters(x, n.pca=20, n.clust
> dapc1 = dapc(x, pop=grp$grp, n.pca=20, n.da=100) # retain all discriminant functions, n.pca is arbitrary
> scatter(dapc1)
> head(dapc1$posterior) # these are the proba of assignment of individuals (in row) to groups
1 2 3 4 5 6
001 1.117731e-08 3.121147e-03 2.451437e-05 3.424392e-08 0.9968310 2.333178e-05
002 1.906086e-15 4.588542e-14 1.508931e-14 4.006893e-11 1.0000000 1.473177e-13
003 2.700339e-10 8.584223e-13 1.330813e-10 8.821750e-12 1.0000000 6.636315e-11
004 7.611857e-09 8.791814e-08 1.145752e-07 2.186551e-07 0.3800647 6.199349e-01
005 5.454810e-10 1.810788e-13 8.479315e-13 3.848505e-09 0.9999998 1.629225e-07
006 5.531436e-10 4.806872e-12 1.696606e-11 3.232611e-07 0.9999861 1.356142e-05
###
One plot designed to represent this information is the assignplot; e.g. for the first 10 individuals:
###
> assignplot(dapc1,subset=1:10)
###
See ?assignplot.
However, optim.a.score tells us that 20 PCs (n.pca=20) is probably an overkill, and there are risks of over-fitting. According to:
> optim.a.score(dapc1, smart=FALSE)
5 PCs should be our best option.
So, we can just re-run the DAPC and then interprete assignments, e.g.:
> dapc1 = dapc(x, pop=grp$grp, n.pca=5, n.da=100)
> assignplot(dapc1,subset=1:30)
> summary(dapc1)$assign.per.pop # re-assignment
> a.score(dapc1, n.sim=50)$pop.score # corrected re-assignment
Best regards,
Thibaut
________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] On Behalf Of Jonker, Rudy [Rudy.Jonker at wur.nl]
Sent: 01 February 2011 13:51
To: 'adegenet-forum at lists.r-forge.r-project.org'
Subject: [adegenet-forum] (no subject)
Dear Thibaut,
I am using your program DAPC to define the number of clusters in a dataset of 400 individuals and 374 SNPs. With find.clusters I get a assignment per individual to each cluster. What I am looking for is the probability of assignment to each of the (for example) 3 groups, when k=3. The idea is to make a graph like figure 4 in the attached paper.
Is that possible? I think it should be because within the find.clusters the program must use some stats in the assignment of the clusters to each individual. When I use the $posterior command on the clusterfile it gives me NULL as answer. And the posterior command gives with predefined groups only 0 or 1 for the probabilities of assignment, corresponding to what was predefined.
Thanks in advance,
Rudy Jonker
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
More information about the adegenet-forum
mailing list