<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Thanks for the feedback - with Matthias's optimization, you should be able to cut the time by almost half. <div><br><div><div>On Mar 21, 2012, at 2:24 PM, Hadrien Commenges wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><span class="Apple-style-span" style="border-collapse: separate; font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div style="height: 4397px; color: rgb(0, 0, 0); font-size: 12pt; font-family: Arial, Helvetica, sans-serif; "><div style="font-family: Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); ">For information, it took 3 hours to compute the longest common substring with Chris' function (whole matrix). The data I used was a sequence object of 6.000 cases, sequences of max length 8, alphabet of 8.<br>Thanks!<br><br>Hadrien<br><br><hr id="zwchr"><div style="color: rgb(0, 0, 0); font-weight: normal; font-style: normal; text-decoration: none; font-family: Helvetica, Arial, sans-serif; font-size: 12pt; "><b>De:<span class="Apple-converted-space"> </span></b>"Chris Cameron" <<a href="mailto:cjc73@cornell.edu">cjc73@cornell.edu</a>><br><b>À:<span class="Apple-converted-space"> </span></b>"Users questions" <<a href="mailto:traminer-users@r-forge.wu-wien.ac.at">traminer-users@r-forge.wu-wien.ac.at</a>><br><b>Envoyé:<span class="Apple-converted-space"> </span></b>Mercredi 21 Mars 2012 14:49:34<br><b>Objet:<span class="Apple-converted-space"> </span></b>Re: [Traminer-users] Re : Re:  longest common substring<br><br><div>Hello Matthias</div><div><br></div>That is a great optimization! I think it might also be possible to vectorize the inner loop, but the greatest gains would come from implementing it in C like the LCS (subsequence) version, but I am not sure if it is a common enough similarity measure to warrant inclusion. <div><br></div><div>Thanks for the great R package. I am taking a graduate course in sequence analysis and I am trying to do all the exercises in TraMineR rather than STATA. So far, so good.</div><div><br></div><div>Chris</div><div><br></div><div><br><div><div>On Mar 21, 2012, at 8:14 AM, Matthias Studer wrote:</div><br class="Apple-interchange-newline"><blockquote><div>Hi Chris,<br><br>Many thank for contributing!<br><br>Just a small suggestion. If I'm right the longest common substring should be symmetric. Hence, you may divide computation time by two if you compute it only once for each comparison (instead of two). This is done by modifying the second loop in  lcsubstr_seqdist<br>        for (i in 1:d_nrow) {<br>            for (j in i:d_nrow){ ## Change start from i instead of one<br>                d = lcsubstr( charSeqData[i], charSeqData[j] )<br>                dist_matrix[i,j] <- d<br>                dist_matrix[j, i] <- d<br>            }<br>        }<br><br>All the best,<br>Matthias<br><br>Le 20.03.2012 11:11, Hadrien Commenges a écrit :<blockquote cite="mid:876c0495-0cd9-4c49-b422-633afc6b8996@clara">Yes it works!<span class="Apple-converted-space"> </span><br>I'll test it with my data (5.000 cases, sequences of max length 8, alphabet of 6) and tell you how much time it takes.<br>Thank you again,<br>Hadrien<br><br><br><br>----- Mail d'origine -----<br>De: Chris Cameron<span class="Apple-converted-space"> </span><a class="moz-txt-link-rfc2396E" href="mailto:cjc73@cornell.edu" target="_blank"><cjc73@cornell.edu></a><br>À: Users questions<span class="Apple-converted-space"> </span><a class="moz-txt-link-rfc2396E" href="mailto:traminer-users@r-forge.wu-wien.ac.at" target="_blank"><traminer-users@r-forge.wu-wien.ac.at></a><br>Envoyé: Tue, 20 Mar 2012 03:44:08 +0100 (CET)<br>Objet: Re: [Traminer-users] longest common substring<br><br>Hi Hadrien -<br><br>These functions will work if your sequence alphabet consists of single character codes. Because it is in R instead of C, the computations can take quite a long time. I tested it with sequences of length 12. Longer sequence lengths could take substantially longer.<span class="Apple-converted-space"> </span><br><br>The number of comparisons necessary to compute a distance matrix is quite large-- this will only work reasonably for relatively small data sets. The computational complexity is the square of the number of cases, so you can estimate the completion time by calculating the time per comparison and multiplying it by the square of the number of cases. This table will help you guess how long it might take.<br><br>Case (Sequence) Count Time to Compute<br>100 7 seconds<br>1000 11 mins<br>2000 50 mins (est)<br>5000 5 hours (est)<span class="Apple-converted-space"> </span><br>10000 24 hours (est)<br><br><br># Example of how the sequence data can be converted to text strings and sent to lcsubstr_seqdist( )<br>data.txt = seqconc(seq_data, sep='')<br>seq_dist2 = lcsubstr_seqdist( data.txt[1:10] ) #test it with a subset and see it if looks right!<br><br>#----------------<br># code to source into R (or paste)<br># Greatest common substring algorithm<span class="Apple-converted-space"> </span><br>lcsubstr = function(s1, s2) {<br># Based on pseudocode from<a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Longest_common_substring_problem" target="_blank">http://en.wikipedia.org/wiki/Longest_common_substring_problem</a><br># Code sections from Gabor Grothendieck (<a class="moz-txt-link-freetext" href="http://finzi.psych.upenn.edu/R/Rhelp02a/archive/68013.html" target="_blank">http://finzi.psych.upenn.edu/R/Rhelp02a/archive/68013.html</a>)<br># This code provided without warranty. It may not work or could return false values. Use at your own risk.<br># Make sure args are strings<span class="Apple-converted-space"> </span><br>a <- as.vector(s1, 'character'); an <- nchar(s1)<span class="Apple-converted-space"> </span><br>b <- as.vector(s2, 'character'); bn <- nchar(s2)<span class="Apple-converted-space"> </span><br># If one arg is an empty string, returns the length of the other<br>if (nchar(a)==0) return(nchar(b))<span class="Apple-converted-space"> </span><br>if (nchar(b)==0) return(nchar(a))<br># end Gabor code<br>z = 0<br>long_len = ''<br>m = matrix(0, nrow=an, ncol=bn)<br>for (i in 1:an) {<br>for (j in 1:bn) {<br>if (substr(a,i,i)==substr(b,j,j)) {<br>if (i==1 | j==1) {<br>m[i,j]=1<br>} else {<br>m[i,j]=m[i-1,j-1]+1<span class="Apple-converted-space"> </span><br>}<br>if (m[i,j] > z) {<br>z = m[i,j]<br>long_len = ''<br>}<br>if (m[i,j] == z) {<br>long_len = substr(a,i-z+1,i)<br>}<br>} else {<br>m[i,j]=0<br>}<br>}<br>}<br>return(nchar(long_len))<br>}<span class="Apple-converted-space"> </span><br><br><br>lcsubstr_seqdist = function( charSeqData ) {<br>d_nrow = length(charSeqData)<br>dist_matrix = matrix(0, nrow=d_nrow, ncol=d_nrow)<br>for (i in 1:d_nrow) {<br>for (j in 1:d_nrow){<span class="Apple-converted-space"> </span><br>#print(c(i,j))<br>dist_matrix[i,j] = lcsubstr( charSeqData[i], charSeqData[j] )<br>}<br>}<br>return(dist_matrix)<br>}<br><br><br><br><br><br><br><br><br><br>On Mar 19, 2012, at 11:30 AM, Hadrien Commenges wrote:<br><br>> Thank you Chris. I checked the TraMineR Extras package, but there is no function to compute the longest common substring. I don't know if the developers could give me an easy way to modify the seqdist function in order to compute longest common substring. If not, I'm very interested in the ideas you mentioned to do that, although I'm not a good programmer and I don't know anything about C.<br>> Thanks,<br>> Hadrien<br>><span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>> ----- Mail d'origine -----<br>> De: Chris Cameron<span class="Apple-converted-space"> </span><br>> À: Users questions<span class="Apple-converted-space"> </span><br>> Envoyé: Sun, 18 Mar 2012 18:48:56 +0100 (CET)<br>> Objet: Re: [Traminer-users] longest common substring<br>><span class="Apple-converted-space"> </span><br>> Hi Hadrien -<span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>> I don't see a documented way to compute the greatest common substrings, so a custom function is probably necessary. (Maybe a dev can tell you if there is a hidden flag to indicate substring vs subsequence. The LCS code is implemented in C, so there is not an easy function to modify in the TraMineR package. I have some ideas about how you can make a custom function to do this if you want to talk about it. Before you go that route, it would be worth checking the "TraMineR Extras" package Matthias mentioned on the 16th.<span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>> Good Luck<br>> Chris<br>><span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>><span class="Apple-converted-space"> </span><br>> On Mar 17, 2012, at 3:02 PM, Hadrien Commenges wrote:<br>><span class="Apple-converted-space"> </span><br>> Hi,<br>><span class="Apple-converted-space"> </span><br>> I've a new question. I'd like to get a dissimilarity matrix based on the longest common substring. The longest prefix (computed with seqdist, method=LCS) doesn't work for me because I'd like to compare substrings anywhere in the sequence. And the notion of subsequence (computed with seqdist, method=LCS) doesn't serve my purpose neither. I'd like to compute something like the LCS method but with the stricter notion of substring. Is it possible?<br>><span class="Apple-converted-space"> </span><br>> Thanks,<br>><span class="Apple-converted-space"> </span><br>> Hadrien<br>> _______________________________________________<br>> Traminer-users mailing list<br>><span class="Apple-converted-space"> </span><a class="moz-txt-link-abbreviated" href="mailto:Traminer-users@lists.r-forge.r-project.org" target="_blank">Traminer-users@lists.r-forge.r-project.org</a><br>><span class="Apple-converted-space"> </span><a class="moz-txt-link-freetext" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a><br>><span class="Apple-converted-space"> </span><br>> _______________________________________________<br>> Traminer-users mailing list<br>><span class="Apple-converted-space"> </span><a class="moz-txt-link-abbreviated" href="mailto:Traminer-users@lists.r-forge.r-project.org" target="_blank">Traminer-users@lists.r-forge.r-project.org</a><br>><span class="Apple-converted-space"> </span><a class="moz-txt-link-freetext" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a><br><br>_______________________________________________<br>Traminer-users mailing list<br><a class="moz-txt-link-abbreviated" href="mailto:Traminer-users@lists.r-forge.r-project.org" target="_blank">Traminer-users@lists.r-forge.r-project.org</a><br><a class="moz-txt-link-freetext" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a><br><br><br><fieldset class="mimeAttachmentHeader"></fieldset><br><pre>_______________________________________________

Traminer-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Traminer-users@lists.r-forge.r-project.org" target="_blank">Traminer-users@lists.r-forge.r-project.org</a>

<a class="moz-txt-link-freetext" href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a>

</pre></blockquote><br></div>_______________________________________________<br>Traminer-users mailing list<br><a href="mailto:Traminer-users@lists.r-forge.r-project.org" target="_blank">Traminer-users@lists.r-forge.r-project.org</a><br><a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a></blockquote></div><br></div><br>_______________________________________________<br>Traminer-users mailing list<br><a href="mailto:Traminer-users@lists.r-forge.r-project.org">Traminer-users@lists.r-forge.r-project.org</a><br><a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a></div><br></div>_______________________________________________<br>Traminer-users mailing list<br><a href="mailto:Traminer-users@lists.r-forge.r-project.org">Traminer-users@lists.r-forge.r-project.org</a><br><a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/traminer-users</a></div></span></blockquote></div><br></div></body></html>