[NMF-user] NMF v0.17.5: difference between consensus silhouette values

Tue Aug 27 12:08:32 CEST 2013

Thanks Renaud. I'll try controlling the seed.

Gordon

On 2013-08-26, at 11:39 PM, Renaud Gaujoux wrote:

Hi Gordon,

the silhouette values are indeed expected to be different because:
- the two set of runs are independent, i.e. different initial random seed (as you mention).
- the number of runs is different, so the consensus matrix is computed on more fits when nrun=200 than when nrun=30.

To satisfy your concern you could try setting the random seed to the same value (e.g. 123) on both nmf calls and use the same number of runs (see code below).

Bests,
Renaud

#### REPRODUCING SILHOUETTE WIDTH ####
use.this.k <- 4
estim.r <- nmfEstimateRank( V.matrix, range=4:5, nrun=30, .opt='v', .pbackend=7, seed = 123)
s <- silhouette(estim.r$fit[[1L]], 'consensus')
res <- nmf( V.matrix, use.this.k, nrun=30, .options='tv', .pbackend=7, seed = 123)
s2 <- silhouette(res, 'consensus')
identical(s, s2)

library(cluster)
x <- consensus(res)
hc <- hclust(as.dist(1-x), method='average')
cl <- cutree(hc, k = use.this.k)
sil <- silhouette( cutree(hc, k = use.this.k), as.dist(1-x) )

# samples in consensus silhouettes (in object `s`) are ordered to match the sample order in the consensus heatmap
dr <- as.dendrogram(hc)
o <- order.dendrogram(reorder(dr, rowMeans(consensus(res), na.rm=TRUE)))
identical(setNames(s[, 'sil_width'], NULL), sil[o, 'sil_width'])

On 27 August 2013 07:40, Gordon Robertson <grobertson at bcgsc.ca<mailto:grobertson at bcgsc.ca>> wrote:
Renaud,

I've been using v0.17.5 on R 3.0.1 since you made it available recently. I'm on a Macbook Pro with OS X 10.7.5. Now, it's being installed on the Linux system at the GSC, so that more people will soon use this version.

Today I realized that the consensus silhouette width that I write out from the 30-iteration rank survey, for, say, 4 groups, can be a few percent different from the average silhouette width that I calculate from a 200-iteration, 4 group, main run. E.g. the consensus silhouette = 0.891 from the rank survey and 0.86 from the main run, for 4 groups and 260 miRNA-seq samples.

As we discussed quite some time ago, I calculate a silhouette width, after an NMF run, with this:
...
# rank survey, now calculates silhouette width
estim.r <- nmfEstimateRank( V.matrix, range=2:12, nrun=30, .opt='v', .pbackend=7 )
write.table( estim.r$measures, "rank.survey.txt", sep="\t", quote=FALSE, row.names=F )

# Now do the main run
use.this.k <- 4
res <- nmf( V.matrix, use.this.k, nrun=200, .options='tvP', .pbackend=7 )
...
library(cluster)
...
x <- consensus(res)
hc <- hclust(as.dist(1-x), method='average')
cl <- cutree(hc, k = use.this.k)
cl.hp <- cl[hp$colInd]
sil <- silhouette( cutree(hc, k = use.this.k), as.dist(1-x) )
write.table( sil, "silhouette.UNsorted.txt", sep="\t" )
pdf(file="consensusmap.silhouette.pdf")
plot(sil)
dev.off()
sil.summary <- summary(sil)
write(sil.summary$avg.width, "silhouette.avg.width.txt")
...

I'd not noticed this difference before. Some differences are very likely expected, given that the rank survey run and the main run are independent, and NMF is stochastic rather than deterministic.

Am I understanding this difference correctly? Or is the silhouette calculated differently in the rank survey?

Thank you,

Gordon
--
Gordon Robertson
Michael Smith Genome Sciences Centre
BC Cancer Agency
Vancouver BC Canada
www.bcgsc.ca<http://www.bcgsc.ca/>

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] RColorBrewer_1.0-5  doParallel_1.0.3    iterators_1.0.6
 [4] foreach_1.4.1       ggplot2_0.9.3.1     NMF_0.17.5
 [7] bigmemory_4.4.3     BH_1.51.0-1         bigmemory.sri_0.1.2
[10] Biobase_2.20.1      BiocGenerics_0.6.0  digest_0.6.3
[13] rngtools_1.2        pkgmaker_0.16       registry_0.2
[16] cluster_1.14.4      edgeR_3.2.4         limma_3.16.6

loaded via a namespace (and not attached):
 [1] codetools_0.2-8  colorspace_1.2-2 compiler_3.0.1   dichromat_2.0-0
 [5] gridBase_0.4-6   gtable_0.1.2     labeling_0.2     MASS_7.3-27
 [9] munsell_0.4.2    plyr_1.8         proto_0.3-10     reshape2_1.2.2
[13] scales_0.2.3     stringr_0.6.2    xtable_1.7-1

--
Renaud Gaujoux, PhD
Computational Biology - University of Cape Town, South Africa

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/nmf-user/attachments/20130827/1561e0fd/attachment.html>