How are you supposed to interprete the string that is doing the parsing?  Does each sequence have the same number of &quot;&gt;&gt;&gt;&gt;&quot; for the opening sequence as it does for &quot;&lt;&lt;&lt;&lt;&quot; on the ending sequence?  That what it appears to be looking at the way stem 3 is parsed.  You will have to provide a little more insight on how to interprete the  symbols.  Does the parsing always start with a partial stem 0 as your example shows?  Is there a way of making sure you have the right sequences when you start?  Is there a chance of error in the middle of the string that you have to restart from?  How long are these strings that you want to parse?  Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters?  Is there always at least one &#39;.&#39; between stems?  A full set of rules as to how the parsing should be done would be useful.  Do you have the BNF syntax for parsing?<br>

<br>

<div class="gmail_quote">On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <span dir="ltr">&lt;<a href="mailto:tal.galili@gmail.com">tal.galili@gmail.com</a>&gt;</span> wrote:<br>

<blockquote style="BORDER-LEFT: #ccc 1px solid; MARGIN: 0px 0px 0px 0.8ex; PADDING-LEFT: 1ex" class="gmail_quote">Hello all,<br><br>For some work I am doing on RNA, I want to use R to do string parsing that<br>(I think) is like a simplistic HTML parsing.<br>

<br><br>For example, let&#39;s say we have the following two variables:<br><br>   Seq &lt;-<br>&quot;GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA&quot;<br>   Str &lt;-<br>&quot;&gt;&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.&quot;<br>

<br>Say that I want to parse &quot;Seq&quot; According to &quot;Str&quot;, by using the legend here<br><br>Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA<br>Str: &gt;&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.<br>

<br>    |     |  |              | |               |     |               ||     |<br><br>    +-----+  +--------------+ +---------------+     +---------------++-----+<br><br>       |        Stem 1            Stem 2                 Stem 3         |<br>

<br>       |                                                                |<br><br>       +----------------------------------------------------------------+<br><br>                               Stem 0<br><br>Assume that we always have 4 stems (0 to 3), but that the length of letters<br>

before and after each of them can very.<br><br>The output should be something like the following list structure:<br><br><br>   list(<br>    &quot;Stem 0 opening&quot; = &quot;GCCTCGA&quot;,<br>    &quot;before Stem 1&quot; = &quot;TA&quot;,<br>

    &quot;Stem 1&quot; = list(opening = &quot;GCTC&quot;,<br>    inside = &quot;AGTTGGGA&quot;,<br>    closing = &quot;GAGC&quot;<br>    ),<br>    &quot;between Stem 1 and 2&quot; = &quot;G&quot;,<br>    &quot;Stem 2&quot; = list(opening = &quot;TACGA&quot;,<br>

    inside = &quot;CTGAAGA&quot;,<br>    closing = &quot;TCGTA&quot;<br>    ),<br>    &quot;between Stem 2 and 3&quot; = &quot;AGGtC&quot;,<br>    &quot;Stem 3&quot; = list(opening = &quot;ACCAG&quot;,<br>    inside = &quot;TTCGATC&quot;,<br>

    closing = &quot;CTGGT&quot;<br>    ),<br>    &quot;After Stem 3&quot; = &quot;&quot;,<br>    &quot;Stem 0 closing&quot; = &quot;TCGGGGC&quot;<br>   )<br><br><br>I don&#39;t have any experience with programming a parser, and would like<br>

advices as to what strategy to use when programming something like this (and<br>any recommended R commands to use).<br><br><br>What I was thinking of is to first get rid of the &quot;Stem 0&quot;, then go through<br>the inner string with a recursive function (let&#39;s call it &quot;seperate.stem&quot;)<br>

that each time will split the string into:<br>1. before stem<br>2. opening stem<br>3. inside stem<br>4. closing stem<br>5. after stem<br><br>Where the &quot;after stem&quot; will then be recursively entered into the same<br>

function (&quot;seperate.stem&quot;)<br><br>The thing is that I am not sure how to try and do this coding without using<br>a loop.<br><br>Any advices will be most welcomed.<br><br><br>----------------Contact<br>Details:-------------------------------------------------------<br>

Contact me: <a href="mailto:Tal.Galili@gmail.com">Tal.Galili@gmail.com</a> |  972-52-7275845<br>Read me: <a href="http://www.talgalili.com/" target="_blank">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il/" target="_blank">www.biostatistics.co.il</a> (Hebrew) |<br>

<a href="http://www.r-statistics.com/" target="_blank">www.r-statistics.com</a> (English)<br>----------------------------------------------------------------------------------------------<br><br>       [[alternative HTML version deleted]]<br>

<br>______________________________________________<br><a href="mailto:R-help@r-project.org">R-help@r-project.org</a> mailing list<br><a href="https://stat.ethz.ch/mailman/listinfo/r-help" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-help</a><br>

PLEASE do read the posting guide <a href="http://www.r-project.org/posting-guide.html" target="_blank">http://www.R-project.org/posting-guide.html</a><br>and provide commented, minimal, self-contained, reproducible code.<br>

</blockquote></div><br><br clear="all"><br>-- <br>Jim Holtman<br>Cincinnati, OH<br>+1 513 646 9390<br><br>What is the problem that you are trying to solve?<br>