How are you supposed to interprete the string that is doing the parsing? Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence? That what it appears to be looking at the way stem 3 is parsed. You will have to provide a little more insight on how to interprete the symbols. Does the parsing always start with a partial stem 0 as your example shows? Is there a way of making sure you have the right sequences when you start? Is there a chance of error in the middle of the string that you have to restart from? How long are these strings that you want to parse? Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? Is there always at least one '.' between stems? A full set of rules as to how the parsing should be done would be useful. Do you have the BNF syntax for parsing?<br>
<br>
<div class="gmail_quote">On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <span dir="ltr"><<a href="mailto:tal.galili@gmail.com">tal.galili@gmail.com</a>></span> wrote:<br>
<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">Hello all,<br><br>For some work I am doing on RNA, I want to use R to do string parsing that<br>(I think) is like a simplistic HTML parsing.<br>
<br><br>For example, let's say we have the following two variables:<br><br> Seq <-<br>"GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"<br> Str <-<br>">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."<br>
<br>Say that I want to parse "Seq" According to "Str", by using the legend here<br><br>Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA<br>Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.<br>
<br> | | | | | | | || |<br><br> +-----+ +--------------+ +---------------+ +---------------++-----+<br><br> | Stem 1 Stem 2 Stem 3 |<br>
<br> | |<br><br> +----------------------------------------------------------------+<br><br> Stem 0<br><br>Assume that we always have 4 stems (0 to 3), but that the length of letters<br>
before and after each of them can very.<br><br>The output should be something like the following list structure:<br><br><br> list(<br> "Stem 0 opening" = "GCCTCGA",<br> "before Stem 1" = "TA",<br>
"Stem 1" = list(opening = "GCTC",<br> inside = "AGTTGGGA",<br> closing = "GAGC"<br> ),<br> "between Stem 1 and 2" = "G",<br> "Stem 2" = list(opening = "TACGA",<br>
inside = "CTGAAGA",<br> closing = "TCGTA"<br> ),<br> "between Stem 2 and 3" = "AGGtC",<br> "Stem 3" = list(opening = "ACCAG",<br> inside = "TTCGATC",<br>
closing = "CTGGT"<br> ),<br> "After Stem 3" = "",<br> "Stem 0 closing" = "TCGGGGC"<br> )<br><br><br>I don't have any experience with programming a parser, and would like<br>
advices as to what strategy to use when programming something like this (and<br>any recommended R commands to use).<br><br><br>What I was thinking of is to first get rid of the "Stem 0", then go through<br>the inner string with a recursive function (let's call it "seperate.stem")<br>
that each time will split the string into:<br>1. before stem<br>2. opening stem<br>3. inside stem<br>4. closing stem<br>5. after stem<br><br>Where the "after stem" will then be recursively entered into the same<br>
function ("seperate.stem")<br><br>The thing is that I am not sure how to try and do this coding without using<br>a loop.<br><br>Any advices will be most welcomed.<br><br><br>----------------Contact<br>Details:-------------------------------------------------------<br>
Contact me: <a href="mailto:Tal.Galili@gmail.com">Tal.Galili@gmail.com</a> | 972-52-7275845<br>Read me: <a href="http://www.talgalili.com/" target="_blank">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il/" target="_blank">www.biostatistics.co.il</a> (Hebrew) |<br>
<a href="http://www.r-statistics.com/" target="_blank">www.r-statistics.com</a> (English)<br>----------------------------------------------------------------------------------------------<br><br> [[alternative HTML version deleted]]<br>
<br>______________________________________________<br><a href="mailto:R-help@r-project.org">R-help@r-project.org</a> mailing list<br><a href="https://stat.ethz.ch/mailman/listinfo/r-help" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-help</a><br>
PLEASE do read the posting guide <a href="http://www.r-project.org/posting-guide.html" target="_blank">http://www.R-project.org/posting-guide.html</a><br>and provide commented, minimal, self-contained, reproducible code.<br>
</blockquote></div><br><br clear="all"><br>-- <br>Jim Holtman<br>Cincinnati, OH<br>+1 513 646 9390<br><br>What is the problem that you are trying to solve?<br>