[Traminer-users] suggestion on storing large dissimilarity matrix?

Zuluaga, Juan [zuju0701@stcloudstate.edu] zuju0701 at stcloudstate.edu
Tue Jun 1 17:31:56 CEST 2010


Mr. Studer, thanks a million for your very useful answer. 

I should read more about the exploration of clusters via medoids, since it seems useful in general.  

As you said, a more powerful machine (a Mac with 4Gb, running R64) calculated the OM distance matrix -- in 14 seconds. Amazing. 

The full.matrix=FALSE option, for some reason, did not work -- it complained about not being able to allocate 815.9 Mb of memory, same as before, as if the  large matrix still had to exist somewhere before becoming a half-sized dist object.  

Merci beaucoup, and keep up the good work. 

-j


________________________________________
From: traminer-users-bounces at lists.r-forge.r-project.org [traminer-users-bounces at lists.r-forge.r-project.org] On Behalf Of Matthias Studer [Matthias.Studer at unige.ch]
Sent: Tuesday, June 01, 2010 3:39 AM
To: traminer-users at lists.r-forge.r-project.org
Subject: Re: [Traminer-users] suggestion on storing large dissimilarity matrix?

Dear Juan Zuluaga,

This is indeed a common question and there is no simple answer. The
first thing to do is to try to set the "full.matrix" argument to FALSE.
In the latter case, seqdist return a "dist" object (see help on dist).
That is, it only stores the lower triangle of the matrix (since the
distances are symmetric the whole matrix is redundant).

wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE, full.matrix=FALSE)


All algorithms from the cluster package accept "dist" object as input.
If you have a great number of sequences, "pam" should be much more
efficient than "agnes".

However, I bet that the latter solution will not be sufficient depending
on your computer. You may need to find an access to a computer with more
memory, running on windows 64 bit or Linux (see the R FAQ on this
topic). To give you an idea, I have run an analysis on 18'000 sequences
on a Linux computer with 8GB of memory.

Finally, there is another solution. You may consider using a random
sample of your sequences and then assign each sequence to the closest
medoids of each cluster. I would suggest you to use PAM in this case,
since this is more or less what PAM does anyway.

For instance, using the biofam data set and sampling 573 sequences

library(TraMineR)

data(biofam)
biofam.lab<- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq<- seqdef(biofam, 10:25, labels=biofam.lab)

mysample<- sample(nrow(biofam), 573)

## Select the sequences

sampledseq<- biofam.seq[mysample,]

## Compute distance on the sample using constant costs

biofam.om<- seqdist(sampledseq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, full.matrix=FALSE)

## PAM clustering

library(cluster)
myclustering<- pam(biofam.om, diss=T, k=2)

## Recover medoids

medoids<- disscenter(biofam.om, medoids.index="first", group=myclustering$clustering)

## Compute distance to each medoids using refseq

dist.medoid1<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[1], ])

dist.medoid2<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[2], ])

## Storing the new cluster solution

newcluster<- numeric(nrow(biofam.seq))
for(i in 1:nrow(biofam.seq)){
        newcluster[i]<- which.min(c(dist.medoid1[i],dist.medoid2[i]))
}

You may want to repeat the whole procedure several times to be sure that
the solution is stable enough. You may also compute the medoids of the
new cluster (with all the cases) and compare them to the solutions with
only the samples.

biofam.om.cl1 <- seqdist(biofam.seq[newcluster==1,]
,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE)

medoid1<- disscenter(biofam.om.cl1, medoids.index="first")
print(biofam.seq[newcluster==1,][medoid1,])


If you have further questions or if some things remain unclear, please
feel free to ask.

All the best!

Matthias Studer

Le 01.06.2010 02:15, Zuluaga, Juan [zuju0701 at stcloudstate.edu] a écrit :
> Hello TraMineR people,
> great package, awesome effort.
>
> I realize that this is not really a TraMineR question, but I bet it will be a common question of interest for many TraMineR users.
>
> What would you suggest to be able to store large matrices?
>
>
>> wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE)
>>
>   [>] 14625 sequences with 21 distinct events/states
>   [>] including missing value as additional state
>   [>] 4316 distinct sequences
>   [>] min/max sequence length: 1/22
>   [>] computing distances using OM metric
> Error: cannot allocate vector of size 815.9 Mb
>
> I tried
> wide.om<- big.matrix(seqdist ...))
> from package bigmemory, but still produces the same result.
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
>

_______________________________________________
Traminer-users mailing list
Traminer-users at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users


More information about the Traminer-users mailing list