[Traminer-users] Re : Re: longest common substring

Hadrien Commenges hc at parisgeo.cnrs.fr
Tue Mar 20 11:11:31 CET 2012


Yes it works! 
I'll test it with my data (5.000 cases, sequences of max length 8, alphabet of 6) and tell you how much time it takes.
Thank you again,
Hadrien



----- Mail d'origine -----
De: Chris Cameron <cjc73 at cornell.edu>
À: Users questions <traminer-users at r-forge.wu-wien.ac.at>
Envoyé: Tue, 20 Mar 2012 03:44:08 +0100 (CET)
Objet: Re: [Traminer-users] longest common substring

Hi Hadrien -

These functions will work if your sequence alphabet consists of single character codes. Because it is in R instead of C, the computations can take quite a long time. I tested it with sequences of length 12. Longer sequence lengths could take substantially longer. 

The number of comparisons necessary to compute a distance matrix is quite large-- this will only work reasonably for relatively small data sets. The computational complexity is the square of the number of cases, so you can estimate the completion time by calculating the time per comparison and multiplying it by the square of the number of cases. This table will help you guess how long it might take.

Case (Sequence) Count		Time to Compute
				100		7 seconds
			   1000		11 mins
			   2000		50 mins (est)
			   5000		5 hours (est) 
			  10000    24 hours (est)


# Example of how the sequence data can be converted to text strings and sent to lcsubstr_seqdist( )
data.txt = seqconc(seq_data, sep='')
seq_dist2 = lcsubstr_seqdist( data.txt[1:10] ) #test it with a subset and see it if looks right!

#----------------
# code to source into R (or paste)
# Greatest common substring algorithm		
lcsubstr = function(s1, s2) {
	# Based on pseudocode from http://en.wikipedia.org/wiki/Longest_common_substring_problem
	# Code sections from Gabor Grothendieck (http://finzi.psych.upenn.edu/R/Rhelp02a/archive/68013.html)
	# This code provided without warranty. It may not work or could return false values. Use at your own risk.
	# Make sure args are strings 
  a <- as.vector(s1, 'character'); an <- nchar(s1) 
  b <- as.vector(s2, 'character'); bn <- nchar(s2) 
  # If one arg is an empty string, returns the length of the other
  if (nchar(a)==0) return(nchar(b)) 
  if (nchar(b)==0) return(nchar(a))
	# end Gabor code
	z = 0
	long_len = ''
	m = matrix(0, nrow=an, ncol=bn)
	for (i in 1:an) {
		for (j in 1:bn) {
			if (substr(a,i,i)==substr(b,j,j)) {
				if (i==1 | j==1) {
					m[i,j]=1
				} else {
					m[i,j]=m[i-1,j-1]+1		
				}
				if (m[i,j] > z) {
					z = m[i,j]
					long_len = ''
				}
				if (m[i,j] == z) {
					long_len = substr(a,i-z+1,i)
				}
			} else {
				m[i,j]=0
			}
		}
	}
	return(nchar(long_len))
}			
	
	
lcsubstr_seqdist = function( charSeqData ) {
	d_nrow = length(charSeqData)
	dist_matrix = matrix(0, nrow=d_nrow, ncol=d_nrow)
	for (i in 1:d_nrow) {
		for (j in 1:d_nrow){			
			#print(c(i,j))
			dist_matrix[i,j] = lcsubstr( charSeqData[i], charSeqData[j] )
		}
	}
	return(dist_matrix)
}









On Mar 19, 2012, at 11:30 AM, Hadrien Commenges wrote:

> Thank you Chris. I checked the TraMineR Extras package, but there is no function to compute the longest common substring. I don't know if the developers could give me an easy way to modify the seqdist function in order to compute longest common substring. If not, I'm very interested in the ideas you mentioned to do that, although I'm not a good programmer and I don't know anything about C.
> Thanks,
> Hadrien
> 
> 
> 
> ----- Mail d'origine -----
> De: Chris Cameron 
> À: Users questions 
> Envoyé: Sun, 18 Mar 2012 18:48:56 +0100 (CET)
> Objet: Re: [Traminer-users] longest common substring
> 
> Hi Hadrien - 
> 
> I don't see a documented way to compute the greatest common substrings, so a custom function is probably necessary. (Maybe a dev can tell you if there is a hidden flag to indicate substring vs subsequence. The LCS code is implemented in C, so there is not an easy function to modify in the TraMineR package. I have some ideas about how you can make a custom function to do this if you want to talk about it. Before you go that route, it would be worth checking the "TraMineR Extras" package Matthias mentioned on the 16th. 
> 
> Good Luck
> Chris
> 
> 
> 
> 
> On Mar 17, 2012, at 3:02 PM, Hadrien Commenges wrote:
> 
> Hi,
> 
> I've a new question. I'd like to get a dissimilarity matrix based on the longest common substring. The longest prefix (computed with seqdist, method=LCS) doesn't work for me because I'd like to compare substrings anywhere in the sequence. And the notion of subsequence (computed with seqdist, method=LCS) doesn't serve my purpose neither. I'd like to compute something like the LCS method but with the stricter notion of substring. Is it possible?
> 
> Thanks,
> 
> Hadrien
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users
> 
> _______________________________________________
> Traminer-users mailing list
> Traminer-users at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users

_______________________________________________
Traminer-users mailing list
Traminer-users at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/traminer-users/attachments/20120320/2692fd9c/attachment-0001.html>


More information about the Traminer-users mailing list