[adegenet-forum] DAPC and components retaining

Jombart, Thibaut t.jombart at imperial.ac.uk
Mon Nov 8 12:26:27 CET 2010


Hello Vladimir, 

thanks for this interesting post. Selection of the number of PC to retain in the prior PCA step is indeed an issue worth looking into. 

Horn’s Parallel Analysis seems interesting. However, a few things should be kept in mind when using PC-selection methods designed for PCA in DAPC.

1) Information criteria in Discriminant Analysis (DA) differs from PCA: PCA finds axes maximizing the variance of the scores of individuals, while DA maximizes the variance between groups while minimizing the variance within groups. Therefore, 'meaningful' principal components in PCA are not necessarily the same in DA. 

2) PC selection in PCA seeks interpretable axes. PC selection in DA does not seek interpretable axes; it just provides the raw material on which DA is performed. DA axes are interpreted, not PCA axes.

However, it is true that the number of PCs retained during the prior PCA step can matter, in particular when it comes to examining re-assignment success. The % of individuals correctly re-assigned using the discriminant functions ($assign.per.pop of the summary of a dapc object) gives an idea of the discriminating power of the reduced space, or, conversely, of the degree of admixture between groups. However, retaining too many PCs increases the chances of finding ad hoc discriminant functions, which would work very well for the sampled individuals, and would work poorly on new individuals. 

I am at the moment developing a simple approach for selecting PCs so as to minimize the chances of finding such ad hoc solutions, while still conserving a maximum discriminating power. You can have a look at a.score and optim.a.score functions in adegenet. Note, however, that this issue only matters if one is interested in the % of successful re-assignment, or in using the discriminant functions for prediction (e.g. assignment of new individuals to one of the existing clusters).

Best regards

Thibaut

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - Faculty of Medicine
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Vladimir Mikryukov [vmikryukov at gmail.com]
Sent: 06 November 2010 09:15
To: adegenet forum
Subject: [adegenet-forum] DAPC and components retaining

Dear Dr. Jombart,
Thank you for the link for the DAPC-paper.
I found it quite interesting.

But the question about the formal criterion to determine a number of PC axes that should be retained is still opened.
Maybe it has sense to use the Horn’s Parallel Analysis (see Glorfeld)?
It’s already implemented in R package paran by Alexis Dinno, and according to Peres-Neto it performed quite well in comparative tests with other criteria.

But the big disadvantage of PA is its speed - even with 500 iterations it's slow.
And maybe it's too conservative.
Taking example from your paper: for the island model (a) PA retained only 25 principal components (around 77,5% of variance). And with it, observed proportions of overall correct assignment and correct assignment per group are higher then in the example (with 100 PCs retained).

 library(adegenet)
 library(paran)
 data(dapcIllus)
 attach(dapcIllus)

 dim <- paran(a at tab, iterations=500, centile=95, graph=TRUE)
 clust.a_PA <- find.clusters(a, n.pca=dim$Retained, n.clust=6)
 dapc.a_PA <- dapc.genind(x = a, pop = clust.a_PA$grp, n.pca=dim$Retained, n.da = 5)

 summary(dapc.a_PA)
 scatter.dapc(dapc.a_PA)


Well, I almost sure that you already thought about the dimension-reduction problem.
Anyway criterion choice is up to researcher.

Best regards,
Vladimir


References:

Glorfeld L.W. An Improvement on Horn's Parallel Analysis Methodology for Selecting the Correct Number of Factors to Retain // Educ Psychol Meas. 1995. V. 55. № 3. P. 377-393.

Peres-Neto P.R., Jackson D.A., Somers K.M. How many principal components? stopping rules for determining the number of non-trivial axes revisited // Comput Stat Data An. 2005. V. 49. № 4. P. 974-997.


--
Vladimir Mikryukov
PhD student
Institute of Plant & Animal Ecology UD RAS,
Lab. of Population and Community Ecotoxicology
[8 Marta 202, 620144, Ekaterinburg, Russia]


More information about the adegenet-forum mailing list