Jombart, Thibaut t.jombart at imperial.ac.uk
Fri Feb 18 11:52:30 CET 2011

```Dear Tommaso,

I did not know about AWclust before your email, thanks for pointing this out. After reading the paper(s), the short answer is: results are different because the approaches are different.
However, there are some similarities that are worth commenting on, and at least one major pitfall in AWclust that is worth pointing out.

== on the similarities ==
AWclust uses Ward's hierarchical clustering to define clusters, and the Gap statistic to identify the number of clusters. Ward's clustering is actually closely related to K-means, since they both aim at finding groups with minimum variance within groups (WSS). However, Ward is hierarchical, while K-means is aggregative, so we should probably not expect quantitatively identical results.

The second aspect is the number of clusters to be retained. Tibshirani et al's Gap statistic relies on the decrease of WSS, and attempts to identify an "elbow" in this decrease. It relies on computing the expected WSS for a given number of clusters, and under a given distributional assumption, which does not strike me as necessarily obvious in the case of genetic data. In adegenet, find.clusters uses the BIC by default. We also make a distributional assumption here - multivariate normal - to compute a likelihood. If this assumption differs from AWclust, we may see differences in the results. But more importantly, BIC is, in this case, the WSS penalized for the number of clusters. So BIC should be more parsimonious on the number of clusters retained. Empirically, when I developed the method, elbow in the BIC distribution were always more marked than in WSS, hence the choice of using BIC by default.

== one serious issue ==
To visualize the data, AWclust uses MDS (aka Principal Coordinate Analysis, PCoA) using 'allele sharing distance' (ASD). This distance is simply the proportion of alleles that differ between pairs of individuals. This distance IS NOT Euclidean, and cannot be used as such in MDS. It needs transformations, but these can alter subsequently the shape of the cloud of points, and it is no longer sure what we're looking at in terms of plots. This can be illustrated simply in adegenet and ade4, since propShared computes the complement to 1 of ASD:
####
> data(sim2pop)
> X=1-propShared(sim2pop) # this is the ASD distance matrix
> pco1=dudi.pco(as.dist(X))
Select the number of axes: 3
Warning message:
In dudi.pco(as.dist(X)) : Non euclidean distance # NON-EUCLIDEAN

####
If you look at the barplot displayed by dudi.pco, you'll notive the big tail of negative eigenvalues, typical of non-Euclidean distances. A more appropriate approach would be:
####
pco2=dudi.pco(cailliez(as.dist(X)), scannf=FALSE) # make the distance Euclidean
####

In any case, such plot does not aim to display differences between groups (see critic of the PCA in the DAPC paper: BMC Genetics11:94). This is easy to see here. The dataset 'sim2pop' contains two simulated, fairly well-differentiated populations. This is far from obvious in the above MDS:
####
> s.class(pco2\$li, pop(sim2pop), col=c("red","blue"))
####
By comparison, the DAPC gives a very clear-cut separation of the groups in one single dimension (please use the devel version for this plot - improved version with individuals and legend):
####
> dapc1=dapc(sim2pop, n.pca=10, n.da=2)
> scatter(dapc1)
####

So, to sum up:
- as far as clustering goes, AWclust's approach is probably worth trying, but does not necessarily provide the same results as find.clusters for a number of reasons
- as far as visualization of the data is concerned, I would strongly recommend DAPC over a MDS/PCoA on a non-Euclidean distance which does not anyway show differences between groups.

Best regards

Thibaut.

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - Faculty of Medicine
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk

________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] On Behalf Of Dragani Tommaso [Tommaso.Dragani at istitutotumori.mi.it]
Sent: 18 February 2011 08:37

Hello,

I am testing genetic relatedness in a general population series, and I have noticed that using the AWclust program (Gao X, Starmer JD. AWclust: point-and-click software for non-parametric population structure analysis. BMC Bioinformatics. 2008 Jan 31;9:77.), the number of clusters and the individual grouping in each cluster are slightly different than using adegenet. AWclust carries out a non-parametric analysis; do you think that this may be a reason for the differences? Do you have any experience in comparison of adegenet results with AWclust results?
With many thanks and best regards.

Tommaso

Dr. Tommaso A. Dragani
"Molecular basis of genetic risk, polygenic models"
Fondazione IRCCS Istituto Nazionale Tumori
Via Amadeo 42 - 20133 Milan - Italy
Tel.: +39-0223902642

Il tuo 5 per mille per finanziare la ricerca e la cura.
Inserisci il nostro Codice Fiscale 800 182 301 53 nel riquadro “Finanziamento della Ricerca Sanitaria” della Tua dichiarazione dei redditi.
Da oltre 80 anni all’avanguardia nella ricerca e nella cura dei tumori.
La presente comunicazione, che potrebbe contenere informazioni riservate e/o protette da segreto professionale, è indirizzata esclusivamente ai destinatari della medesima qui indicati. Ogni informazione qui contenuta, che non sia relativa alla nostra attività caratteristica, deve essere considerata come non inviata. Nel caso in cui abbiate ricevuto per errore la presente comunicazione, vogliate cortesemente darcene immediata notizia, rispondendo a questo stesso indirizzo di e-mail, e poi procedere alla cancellazione di questo messaggio dal Vostro sistema. E' strettamente proibito e potrebbe essere fonte di violazione di legge qualsiasi uso, comunicazione, copia o diffusione dei contenuti di questa comunicazione da parte di chi la abbia ricevuta per errore o in violazione degli scopi della presente. Ricordiamo che la tecnologia di trasmissione utilizzata non consente di garantire l’autenticità del mittente né l’integrità dei dati

This communication, which may contain confidential and/or legally privileged information, is intended solely for the use of the intended addressees. All information or advice contained in this communication is subject to the terms and conditions provided by the agreement governing each particular client engagement. If you have received this communication in error, please notify us immediately by responding to this email; then please delete it from your system. Any use, disclosure, copying or distribution of the contents of this communication by a not-intended recipient or in violation of the purposes of this communication is strictly prohibited and may be unlawful. The transmission technology used to send this mail can grant neither the sender identity nor the data integrity
_______________________________________________