<div dir="ltr">Wow, Thank you very much Andrej!<div><br></div><div>Tal<br><div><br>----------------Contact Details:-------------------------------------------------------<br>Contact me: <a href="mailto:Tal.Galili@gmail.com">Tal.Galili@gmail.com</a> | 972-52-7275845<br>
Read me: <a href="http://www.talgalili.com">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il">www.biostatistics.co.il</a> (Hebrew) | <a href="http://www.r-statistics.com">www.r-statistics.com</a> (English)<br>
----------------------------------------------------------------------------------------------<br><br><br>
<br><br><div class="gmail_quote">2010/3/17 Andrej Blejec <span dir="ltr"><<a href="mailto:Andrej.Blejec@nib.si">Andrej.Blejec@nib.si</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
A version using regular expressions, lot of regexpr() and substr() functions is attached.<br>
Finally everything is packed into splitSeq() function<br>
<br>
Andrej<br>
<br>
--<br>
Andrej Blejec<br>
National Institute of Biology<br>
Vecna pot 111 POB 141<br>
SI-1000 Ljubljana<br>
SLOVENIA<br>
e-mail: <a href="mailto:andrej.blejec@nib.si">andrej.blejec@nib.si</a><br>
URL: <a href="http://ablejec.nib.si" target="_blank">http://ablejec.nib.si</a><br>
tel: + 386 (0)59 232 789<br>
fax: + 386 1 241 29 80<br>
--------------------------<br>
Local Organizer of ICOTS-8<br>
International Conference on Teaching Statistics<br>
<a href="http://icots8.org" target="_blank">http://icots8.org</a><br>
<div><div></div><div class="h5"><br>
<br>
<br>
> -----Original Message-----<br>
> From: <a href="mailto:r-help-bounces@r-project.org">r-help-bounces@r-project.org</a> [mailto:<a href="mailto:r-help-bounces@r-">r-help-bounces@r-</a><br>
> <a href="http://project.org" target="_blank">project.org</a>] On Behalf Of Gabor Grothendieck<br>
> Sent: Tuesday, March 16, 2010 3:24 PM<br>
> To: Tal Galili<br>
> Cc: <a href="mailto:r-help@r-project.org">r-help@r-project.org</a>; <a href="mailto:seqinr-forum@r-forge.wu-wien.ac.at">seqinr-forum@r-forge.wu-wien.ac.at</a><br>
> Subject: Re: [R] How to parse a string (by a "new" markup) with R ?<br>
><br>
> We show how to use the gsubfn package to parse this.<br>
><br>
> The rules are not entirely clear so we will assume the following:<br>
><br>
> - there is a fixed template for the output which is the same as your<br>
> output but possibly with different character strings filled in. This<br>
> implies, for example, that there are exactly Stem0, Stem1, Stem2 and<br>
> Stem3 and no fewer or more stems.<br>
><br>
> - the sequence always starts with the open of Stem0, at least one dot<br>
> and the open of Stem1. There are no dots prior to the open of Stem0.<br>
> This seems to be implicit in your sample output since there is no zero<br>
> length string in your sample output corresponding to dots prior to<br>
> Stem0.<br>
><br>
> - Stem0 closes with the same number of < as there are > to open it<br>
><br>
> You can modify this yourself to take into account the actual rules<br>
> whatever they are.<br>
><br>
> We first calculate, k, the number of leading >'s using strapply.<br>
><br>
> Then we replace the leading k >'s with }'s and the trailing k <'s with<br>
> {'s giving us Str3:<br>
><br>
><br>
> "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{<br>
> {{."<br>
><br>
> We again use strapply, this time to get the lengths of the runs. Note<br>
> that<br>
> zero length runs are possible so we cannot, for example, use rle for<br>
> this. For<br>
> example there is a zero length run of dots between the last < and the<br>
> first {.<br>
> read.fwf is used to actually parse out the strings using the lengths we<br>
> just<br>
> calculated.<br>
><br>
> Finally we fill in the template using relist.<br>
><br>
> # inputs<br>
><br>
> Seq <-<br>
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG<br>
> GCA"<br>
> Str <-<br>
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<br>
> <<."<br>
> template <-<br>
> list(<br>
> "Stem 0 opening" = "",<br>
> "before Stem 1" = "",<br>
> "Stem 1" = list(opening = "",<br>
> inside = "",<br>
> closing = ""<br>
> ),<br>
> "between Stem 1 and 2" = "",<br>
> "Stem 2" = list(opening = "",<br>
> inside = "",<br>
> closing = ""<br>
> ),<br>
> "between Stem 2 and 3" = "",<br>
> "Stem 3" = list(opening = "",<br>
> inside = "",<br>
> closing = ""<br>
> ),<br>
> "After Stem 3" = "",<br>
> "Stem 0 closing" = ""<br>
> )<br>
><br>
> # processing<br>
><br>
> # create string made by repeating string s k times followed by more<br>
> reps <- function(s, k, more = "") {<br>
> paste(paste(rep(s, k), collapse = ""), more, sep = "")<br>
> }<br>
><br>
> library(gsubfn)<br>
> k <- nchar(strapply(Str, "^>+", c)[[1]])<br>
> Str2 <- sub("^>+", reps("}", k), Str)<br>
> Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2)<br>
><br>
> pat <-<br>
> "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*<br>
> )({*)([.]*)$"<br>
> lens <- sapply(strapply(Str3, pat, c)[[1]], nchar)<br>
> tokens <- unlist(read.fwf(textConnection(Seq), lens, <a href="http://as.is" target="_blank">as.is</a> = TRUE))<br>
> closeAllConnections()<br>
> tokens[<a href="http://is.na" target="_blank">is.na</a>(tokens)] <- ""<br>
> out <- relist(tokens, template)<br>
> out<br>
><br>
><br>
> Here is the str of the output for your sample input:<br>
><br>
> > str(out)<br>
> List of 9<br>
> $ Stem 0 opening : chr "GCCTCGA"<br>
> $ before Stem 1 : chr "TA"<br>
> $ Stem 1 :List of 3<br>
> ..$ opening: chr "GCTC"<br>
> ..$ inside : chr "AGTTGGGA"<br>
> ..$ closing: chr "GAGC"<br>
> $ between Stem 1 and 2: chr "G"<br>
> $ Stem 2 :List of 3<br>
> ..$ opening: chr "TACGA"<br>
> ..$ inside : chr "CTGAAGA"<br>
> ..$ closing: chr "TCGTA"<br>
> $ between Stem 2 and 3: chr "AGGtC"<br>
> $ Stem 3 :List of 3<br>
> ..$ opening: chr "ACCAG"<br>
> ..$ inside : chr "TTCGATC"<br>
> ..$ closing: chr "CTGGT"<br>
> $ After Stem 3 : chr ""<br>
> $ Stem 0 closing : chr "TCGGGGC"<br>
><br>
><br>
><br>
> On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <<a href="mailto:tal.galili@gmail.com">tal.galili@gmail.com</a>><br>
> wrote:<br>
> > Hello all,<br>
> ><br>
> > For some work I am doing on RNA, I want to use R to do string parsing<br>
> that<br>
> > (I think) is like a simplistic HTML parsing.<br>
> ><br>
> ><br>
> > For example, let's say we have the following two variables:<br>
> ><br>
> > Seq <-<br>
> ><br>
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG<br>
> GCA"<br>
> > Str <-<br>
> ><br>
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<br>
> <<."<br>
> ><br>
> > Say that I want to parse "Seq" According to "Str", by using the<br>
> legend here<br>
> ><br>
> > Seq:<br>
> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGG<br>
> CA<br>
> > Str:<br>
> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<br>
> <.<br>
> ><br>
> > | | | | | | |<br>
> || |<br>
> ><br>
> > +-----+ +--------------+ +---------------+ +---------------<br>
> ++-----+<br>
> ><br>
> > | Stem 1 Stem 2 Stem 3<br>
> |<br>
> ><br>
> > |<br>
> |<br>
> ><br>
> > +-------------------------------------------------------------<br>
> ---+<br>
> ><br>
> > Stem 0<br>
> ><br>
> > Assume that we always have 4 stems (0 to 3), but that the length of<br>
> letters<br>
> > before and after each of them can very.<br>
> ><br>
> > The output should be something like the following list structure:<br>
> ><br>
> ><br>
> > list(<br>
> > "Stem 0 opening" = "GCCTCGA",<br>
> > "before Stem 1" = "TA",<br>
> > "Stem 1" = list(opening = "GCTC",<br>
> > inside = "AGTTGGGA",<br>
> > closing = "GAGC"<br>
> > ),<br>
> > "between Stem 1 and 2" = "G",<br>
> > "Stem 2" = list(opening = "TACGA",<br>
> > inside = "CTGAAGA",<br>
> > closing = "TCGTA"<br>
> > ),<br>
> > "between Stem 2 and 3" = "AGGtC",<br>
> > "Stem 3" = list(opening = "ACCAG",<br>
> > inside = "TTCGATC",<br>
> > closing = "CTGGT"<br>
> > ),<br>
> > "After Stem 3" = "",<br>
> > "Stem 0 closing" = "TCGGGGC"<br>
> > )<br>
> ><br>
> ><br>
> > I don't have any experience with programming a parser, and would like<br>
> > advices as to what strategy to use when programming something like<br>
> this (and<br>
> > any recommended R commands to use).<br>
> ><br>
> ><br>
> > What I was thinking of is to first get rid of the "Stem 0", then go<br>
> through<br>
> > the inner string with a recursive function (let's call it<br>
> "seperate.stem")<br>
> > that each time will split the string into:<br>
> > 1. before stem<br>
> > 2. opening stem<br>
> > 3. inside stem<br>
> > 4. closing stem<br>
> > 5. after stem<br>
> ><br>
> > Where the "after stem" will then be recursively entered into the same<br>
> > function ("seperate.stem")<br>
> ><br>
> > The thing is that I am not sure how to try and do this coding without<br>
> using<br>
> > a loop.<br>
> ><br>
> > Any advices will be most welcomed.<br>
> ><br>
> ><br>
> > ----------------Contact<br>
> > Details:-------------------------------------------------------<br>
> > Contact me: <a href="mailto:Tal.Galili@gmail.com">Tal.Galili@gmail.com</a> | 972-52-7275845<br>
> > Read me: <a href="http://www.talgalili.com" target="_blank">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il" target="_blank">www.biostatistics.co.il</a><br>
> (Hebrew) |<br>
> > <a href="http://www.r-statistics.com" target="_blank">www.r-statistics.com</a> (English)<br>
> > ---------------------------------------------------------------------<br>
> -------------------------<br>
> ><br>
> > [[alternative HTML version deleted]]<br>
> ><br>
> > ______________________________________________<br>
> > <a href="mailto:R-help@r-project.org">R-help@r-project.org</a> mailing list<br>
> > <a href="https://stat.ethz.ch/mailman/listinfo/r-help" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-help</a><br>
> > PLEASE do read the posting guide <a href="http://www.R-project.org/posting-" target="_blank">http://www.R-project.org/posting-</a><br>
> guide.html<br>
> > and provide commented, minimal, self-contained, reproducible code.<br>
> ><br>
><br>
> ______________________________________________<br>
> <a href="mailto:R-help@r-project.org">R-help@r-project.org</a> mailing list<br>
> <a href="https://stat.ethz.ch/mailman/listinfo/r-help" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-help</a><br>
> PLEASE do read the posting guide <a href="http://www.R-project.org/posting-" target="_blank">http://www.R-project.org/posting-</a><br>
> guide.html<br>
> and provide commented, minimal, self-contained, reproducible code.<br>
</div></div></blockquote></div><br></div></div></div>