[Traminer-users] suggestion on storing large dissimilarity matrix?

Matthias Studer Matthias.Studer at unige.ch
Wed Jun 2 10:55:37 CEST 2010


Dear Juan,
In fact, when full.matrix=TRUE, seqdist copies the dist object into a 
full matrix (and not the reverse). Maybe you add a previous distance 
matrix in your R environnement, that makes use of all available memory ?
I would be interested to know more about this issue.
All the best
Matthias

Le 01.06.2010 17:31, Zuluaga, Juan [zuju0701 at stcloudstate.edu] a écrit :
> Mr. Studer, thanks a million for your very useful answer.
>
> I should read more about the exploration of clusters via medoids, since it seems useful in general.
>
> As you said, a more powerful machine (a Mac with 4Gb, running R64) calculated the OM distance matrix -- in 14 seconds. Amazing.
>
> The full.matrix=FALSE option, for some reason, did not work -- it complained about not being able to allocate 815.9 Mb of memory, same as before, as if the  large matrix still had to exist somewhere before becoming a half-sized dist object.
>
> Merci beaucoup, and keep up the good work.
>
> -j
>
>
> ________________________________________
> From: traminer-users-bounces at lists.r-forge.r-project.org [traminer-users-bounces at lists.r-forge.r-project.org] On Behalf Of Matthias Studer [Matthias.Studer at unige.ch]
> Sent: Tuesday, June 01, 2010 3:39 AM
> To: traminer-users at lists.r-forge.r-project.org
> Subject: Re: [Traminer-users] suggestion on storing large dissimilarity matrix?
>
> Dear Juan Zuluaga,
>
> This is indeed a common question and there is no simple answer. The
> first thing to do is to try to set the "full.matrix" argument to FALSE.
> In the latter case, seqdist return a "dist" object (see help on dist).
> That is, it only stores the lower triangle of the matrix (since the
> distances are symmetric the whole matrix is redundant).
>
> wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE, full.matrix=FALSE)
>
>
> All algorithms from the cluster package accept "dist" object as input.
> If you have a great number of sequences, "pam" should be much more
> efficient than "agnes".
>
> However, I bet that the latter solution will not be sufficient depending
> on your computer. You may need to find an access to a computer with more
> memory, running on windows 64 bit or Linux (see the R FAQ on this
> topic). To give you an idea, I have run an analysis on 18'000 sequences
> on a Linux computer with 8GB of memory.
>
> Finally, there is another solution. You may consider using a random
> sample of your sequences and then assign each sequence to the closest
> medoids of each cluster. I would suggest you to use PAM in this case,
> since this is more or less what PAM does anyway.
>
> For instance, using the biofam data set and sampling 573 sequences
>
> library(TraMineR)
>
> data(biofam)
> biofam.lab<- c("Parent", "Left", "Married", "Left+Marr",
> "Child", "Left+Child", "Left+Marr+Child", "Divorced")
> biofam.seq<- seqdef(biofam, 10:25, labels=biofam.lab)
>
> mysample<- sample(nrow(biofam), 573)
>
> ## Select the sequences
>
> sampledseq<- biofam.seq[mysample,]
>
> ## Compute distance on the sample using constant costs
>
> biofam.om<- seqdist(sampledseq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, full.matrix=FALSE)
>
> ## PAM clustering
>
> library(cluster)
> myclustering<- pam(biofam.om, diss=T, k=2)
>
> ## Recover medoids
>
> medoids<- disscenter(biofam.om, medoids.index="first", group=myclustering$clustering)
>
> ## Compute distance to each medoids using refseq
>
> dist.medoid1<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[1], ])
>
> dist.medoid2<- seqdist(biofam.seq,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE, refseq=sampledseq[medoids[2], ])
>
> ## Storing the new cluster solution
>
> newcluster<- numeric(nrow(biofam.seq))
> for(i in 1:nrow(biofam.seq)){
>          newcluster[i]<- which.min(c(dist.medoid1[i],dist.medoid2[i]))
> }
>
> You may want to repeat the whole procedure several times to be sure that
> the solution is stable enough. You may also compute the medoids of the
> new cluster (with all the cases) and compare them to the solutions with
> only the samples.
>
> biofam.om.cl1<- seqdist(biofam.seq[newcluster==1,]
> ,method="OM",indel=1,sm="CONSTANT",with.missing=TRUE)
>
> medoid1<- disscenter(biofam.om.cl1, medoids.index="first")
> print(biofam.seq[newcluster==1,][medoid1,])
>
>
> If you have further questions or if some things remain unclear, please
> feel free to ask.
>
> All the best!
>
> Matthias Studer
>
> Le 01.06.2010 02:15, Zuluaga, Juan [zuju0701 at stcloudstate.edu] a écrit :
>    
>> Hello TraMineR people,
>> great package, awesome effort.
>>
>> I realize that this is not really a TraMineR question, but I bet it will be a common question of interest for many TraMineR users.
>>
>> What would you suggest to be able to store large matrices?
>>
>>
>>      
>>> wide.om<- seqdist(wide.seq,method="OM",indel=2,sm=couts,with.missing=TRUE)
>>>
>>>        
>>    [>] 14625 sequences with 21 distinct events/states
>>    [>] including missing value as additional state
>>    [>] 4316 distinct sequences
>>    [>] min/max sequence length: 1/22
>>    [>] computing distances using OM metric
>> Error: cannot allocate vector of size 815.9 Mb
>>
>> I tried
>> wide.om<- big.matrix(seqdist ...))
>> from package bigmemory, but still produces the same result.
>> _______________________________________________
>> Traminer-users mailing list
>> Traminer-users at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
>>
>>      
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
>    



More information about the Traminer-users mailing list