[Traminer-users] WeightedCluster: a new library for (sequences) clustering in R

Matthias Studer Matthias.Studer at unige.ch
Wed Mar 6 13:49:49 CET 2013


Dear TraMineR Users,

I have the pleasure to announce the first official release of the WeightedCluster R library. This library greatly facilitates the clustering of state's sequences and, more generally, weighted data. The main functionalities of this library include:

  *   Aggregation of identical sequences (in order to save memory and cluster a bigger number of sequences).
  *   Computation of several clustering quality measure.
  *   Methods facilitating the choice of the number of groups and cluster algorithm based on cluster quality measures.
  *   Clustering of weighted data using a distance matrix (for instance, using sampling weights or aggregated sequences).
  *   An optimized PAM clustering algorithm.
  *   Graphical representation of hierarchical clustering of state sequence (you need to install GraphViz http://www.graphviz.org before launching R)
The library comes with the "WeightedCluster<http://mephisto.unige.ch/weightedcluster/WeightedCluster.pdf> Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R", also available in French<http://mephisto.unige.ch/weightedcluster/WeightedCluster-fr.pdf>. Aside from presenting the library, this manual discusses several important issues when clustering state's sequences (or any other object) in the social sciences, such as cluster validation and the usual sociological assumptions, for instance.
A short script (that can be easily reproduced) illustrating the functionalities of the library is available at the WeightedCluster website: http://mephisto.unige.ch/weightedcluster/ or below

The library can be installed with the following command (R version 2.15 or higher is mandatory):
install.packages("WeightedCluster")
library(WeightedCluster)
## To get the manuals, please run:
   vignette("WeightedCluster") ## complete manual in English
   vignette("WeightedCluster-fr") ## complete manual in French
   vignette("WeightedClusterPreview") ## short preview in English

Any comments, suggestions or bug reports are very welcome.

Kind regards,
Matthias Studer

## Loading the library
library(WeightedCluster)

## Loading the mvad dataset
data(mvad)

## aggregating identical sequence
aggMvad <- wcAggregateCases(mvad[, 17:86])
print(aggMvad)
uniqueMvad <- mvad[aggMvad$aggIndex, 17:86]

## defining the state sequence object
mvad.seq <- seqdef(uniqueMvad, weights=aggMvad$aggWeights)
## Computing Hamming distance between sequence
diss <- seqdist(mvad.seq, method="HAM")

## Clustering the sequences using "average" hierarchical clustering
## Here, we need to set the weights (members argument) to account for identical sequence aggregation
averageClust <- hclust(as.dist(diss), method="average", members=aggMvad$aggWeights)

## Representing the hierarchical clustering as a tree
averageTree <- as.seqtree(averageClust, seqdata=mvad.seq, diss=diss, ncluster=6)
## Graphical representation of the tree (you need to have Graphviz installed before lauchning R)
seqtreedisplay(averageTree, type="d", border=NA,  showdepth=TRUE)

## Compute several clustering quality measure for partition in 2, 3, 4, ... 10 groups.
avgClustQual <- as.clustrange(averageClust, diss, weights=aggMvad$aggWeights, ncluster=10)

## Plot the evolution of the clustering quality according to number of clusters.
plot(avgClustQual)

## The same, but using normalized values.
plot(avgClustQual, norm="zscore")

## Print the 2 best number of group according to each quality measure
summary(avgClustQual, max.rank=2)

## Compute PAM clustering and cluster quality measure for different number of groups (ranging from 2 to 10)
pamClustRange <- wcKMedRange(diss, kvals=2:10, weights=aggMvad$aggWeights)

## Print the 2 best number of group according to each quality measure for the PAM clustering
summary(pamClustRange, max.rank=2)

## The best clustering was found using average clustering in 5 groups according to ASW (average silhouette width)
seqdplot(mvad.seq, group=avgClustQual$clustering$cluster5, border=NA)

## Clustering was made on distinct sequences
## Recover the clustering solution in the original (full) dataset
uniqueCluster5 <- avgClustQual$clustering$cluster5
mvad$cluster5 <- uniqueCluster5[aggMvad$disaggIndex]

## Compute association between clustering and father unemployment
chisq.test(table(mvad$funemp, mvad$cluster5))


---
Matthias Studer
Institut d'études démographiques et du parcours de vie
et Département des sciences économiques
Uni-Mail, bureau 5205
40, bd du Pont d'Arve
1211 Genève 4
Tel: +41 22 379 82 15
Fax: +41 22 379 82 99

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/traminer-users/attachments/20130306/ec153603/attachment.html>


More information about the Traminer-users mailing list