# [Traminer-users] suggestion on storing large dissimilarity matrix?

Matthias Studer Matthias.Studer at unige.ch
Tue Jun 1 10:39:58 CEST 2010

```Dear Juan Zuluaga,

This is indeed a common question and there is no simple answer. The
first thing to do is to try to set the "full.matrix" argument to FALSE.
In the latter case, seqdist return a "dist" object (see help on dist).
That is, it only stores the lower triangle of the matrix (since the
distances are symmetric the whole matrix is redundant).

wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE, full.matrix=FALSE)

All algorithms from the cluster package accept "dist" object as input.
If you have a great number of sequences, "pam" should be much more
efficient than "agnes".

However, I bet that the latter solution will not be sufficient depending
memory, running on windows 64 bit or Linux (see the R FAQ on this
topic). To give you an idea, I have run an analysis on 18'000 sequences
on a Linux computer with 8GB of memory.

Finally, there is another solution. You may consider using a random
sample of your sequences and then assign each sequence to the closest
medoids of each cluster. I would suggest you to use PAM in this case,
since this is more or less what PAM does anyway.

For instance, using the biofam data set and sampling 573 sequences

library(TraMineR)

data(biofam)
biofam.lab<- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq<- seqdef(biofam, 10:25, labels=biofam.lab)

mysample<- sample(nrow(biofam), 573)

## Select the sequences

sampledseq<- biofam.seq[mysample,]

## Compute distance on the sample using constant costs

biofam.om<- seqdist(sampledseq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, full.matrix=FALSE)

## PAM clustering

library(cluster)
myclustering<- pam(biofam.om, diss=T, k=2)

## Recover medoids

medoids<- disscenter(biofam.om, medoids.index="first", group=myclustering\$clustering)

## Compute distance to each medoids using refseq

dist.medoid1<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[1], ])

dist.medoid2<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[2], ])

## Storing the new cluster solution

newcluster<- numeric(nrow(biofam.seq))
for(i in 1:nrow(biofam.seq)){
newcluster[i]<- which.min(c(dist.medoid1[i],dist.medoid2[i]))
}

You may want to repeat the whole procedure several times to be sure that
the solution is stable enough. You may also compute the medoids of the
new cluster (with all the cases) and compare them to the solutions with
only the samples.

biofam.om.cl1 <- seqdist(biofam.seq[newcluster==1,]
,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE)

medoid1<- disscenter(biofam.om.cl1, medoids.index="first")
print(biofam.seq[newcluster==1,][medoid1,])

If you have further questions or if some things remain unclear, please

All the best!

Matthias Studer

Le 01.06.2010 02:15, Zuluaga, Juan [zuju0701 at stcloudstate.edu] a écrit :
> Hello TraMineR people,
> great package, awesome effort.
>
> I realize that this is not really a TraMineR question, but I bet it will be a common question of interest for many TraMineR users.
>
> What would you suggest to be able to store large matrices?
>
>
>> wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE)
>>
>   [>] 14625 sequences with 21 distinct events/states
>   [>] including missing value as additional state
>   [>] 4316 distinct sequences
>   [>] min/max sequence length: 1/22
>   [>] computing distances using OM metric
> Error: cannot allocate vector of size 815.9 Mb
>
> I tried
> wide.om<- big.matrix(seqdist ...))
> from package bigmemory, but still produces the same result.
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
>

```