[Traminer-users] suggestion on storing large dissimilarity matrix?

Tue Jun 1 10:39:58 CEST 2010

Dear Juan Zuluaga,

This is indeed a common question and there is no simple answer. The 
first thing to do is to try to set the "full.matrix" argument to FALSE. 
In the latter case, seqdist return a "dist" object (see help on dist). 
That is, it only stores the lower triangle of the matrix (since the 
distances are symmetric the whole matrix is redundant).

wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE, full.matrix=FALSE)

All algorithms from the cluster package accept "dist" object as input. 
If you have a great number of sequences, "pam" should be much more 
efficient than "agnes".

However, I bet that the latter solution will not be sufficient depending 
on your computer. You may need to find an access to a computer with more 
memory, running on windows 64 bit or Linux (see the R FAQ on this 
topic). To give you an idea, I have run an analysis on 18'000 sequences 
on a Linux computer with 8GB of memory.

Finally, there is another solution. You may consider using a random 
sample of your sequences and then assign each sequence to the closest 
medoids of each cluster. I would suggest you to use PAM in this case, 
since this is more or less what PAM does anyway.

For instance, using the biofam data set and sampling 573 sequences

library(TraMineR)

data(biofam)
biofam.lab<- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq<- seqdef(biofam, 10:25, labels=biofam.lab)

mysample<- sample(nrow(biofam), 573)

## Select the sequences

sampledseq<- biofam.seq[mysample,]

## Compute distance on the sample using constant costs

biofam.om<- seqdist(sampledseq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, full.matrix=FALSE)

## PAM clustering

library(cluster)
myclustering<- pam(biofam.om, diss=T, k=2)

## Recover medoids

medoids<- disscenter(biofam.om, medoids.index="first", group=myclustering$clustering)

## Compute distance to each medoids using refseq

dist.medoid1<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[1], ])

dist.medoid2<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[2], ])

## Storing the new cluster solution

newcluster<- numeric(nrow(biofam.seq))
for(i in 1:nrow(biofam.seq)){
	newcluster[i]<- which.min(c(dist.medoid1[i],dist.medoid2[i]))
}

You may want to repeat the whole procedure several times to be sure that 
the solution is stable enough. You may also compute the medoids of the 
new cluster (with all the cases) and compare them to the solutions with 
only the samples.

biofam.om.cl1 <- seqdist(biofam.seq[newcluster==1,] 
,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE)

medoid1<- disscenter(biofam.om.cl1, medoids.index="first")
print(biofam.seq[newcluster==1,][medoid1,])

If you have further questions or if some things remain unclear, please 
feel free to ask.

All the best!

Matthias Studer

Le 01.06.2010 02:15, Zuluaga, Juan [zuju0701 at stcloudstate.edu] a écrit :
> Hello TraMineR people,
> great package, awesome effort.
>
> I realize that this is not really a TraMineR question, but I bet it will be a common question of interest for many TraMineR users.
>
> What would you suggest to be able to store large matrices?
>
>    
>> wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE)
>>      
>   [>] 14625 sequences with 21 distinct events/states
>   [>] including missing value as additional state
>   [>] 4316 distinct sequences
>   [>] min/max sequence length: 1/22
>   [>] computing distances using OM metric
> Error: cannot allocate vector of size 815.9 Mb
>
> I tried
> wide.om<- big.matrix(seqdist ...))
> from package bigmemory, but still produces the same result.
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
>