<div dir="ltr">Hi Jim,<div>Thanks for the questions, here are my answers:<br><div><br></div><div><b>Q: Does each sequence have the same number of &quot;&gt;&gt;&gt;&gt;&quot; for the opening sequence as it does for &quot;&lt;&lt;&lt;&lt;&quot; on the ending sequence?  </b></div>


<div>A: Yes</div><div><br></div><div><b>Q: Does the parsing always start with a partial stem 0 as your example shows? </b></div><div>A: No. Sometimes it will start with a few &quot;.&quot;</div><div><br></div><div><b>Q: Is there a way of making sure you have the right sequences when you start? </b></div>


<div>A: I am not sure I understand what you mean.</div><div><br></div><div><b>Q: Is there a chance of error in the middle of the string that you have to restart from?</b></div><div>A: Sadly, yes. In which case, I&#39;ll need to ignore one of the inner stems...</div>


<div><br></div><div><b>Q: How long are these strings that you want to parse? </b></div><div>A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)</div><div><br></div><div><b>Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? </b></div>


<div>A: each sequence is self contained.</div><div><br></div><div><b>Q: Is there always at least one &#39;.&#39; between stems?  </b></div><div>A: No.</div><div><br></div><div><b>Q: A full set of rules as to how the parsing should be done would be useful.</b></div>


<div>A: I agree.  But since I don&#39;t have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.</div>


<div><br></div><div><b>Q: Do you have the BNF syntax for parsing?</b></div><div>A: No. Your e-mail is the first time I came across it (<a href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form">http://en.wikipedia.org/wiki/Backus–Naur_Form</a>).</div>


<div><br></div><div><br></div><div>Thanks for the help,</div><div>Tal</div><div><br></div><div><br></div><div><br>----------------Contact Details:-------------------------------------------------------<br>Contact me: <a href="mailto:Tal.Galili@gmail.com">Tal.Galili@gmail.com</a> |  972-52-7275845<br>


Read me: <a href="http://www.talgalili.com">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il">www.biostatistics.co.il</a> (Hebrew) | <a href="http://www.r-statistics.com">www.r-statistics.com</a> (English)<br>


----------------------------------------------------------------------------------------------<br><br><br>

<br><br><div class="gmail_quote">On Tue, Mar 16, 2010 at 1:59 PM, jim holtman <span dir="ltr">&lt;<a href="mailto:jholtman@gmail.com">jholtman@gmail.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


How are you supposed to interprete the string that is doing the parsing?  Does each sequence have the same number of &quot;&gt;&gt;&gt;&gt;&quot; for the opening sequence as it does for &quot;&lt;&lt;&lt;&lt;&quot; on the ending sequence?  That what it appears to be looking at the way stem 3 is parsed.  You will have to provide a little more insight on how to interprete the  symbols.  Does the parsing always start with a partial stem 0 as your example shows?  Is there a way of making sure you have the right sequences when you start?  Is there a chance of error in the middle of the string that you have to restart from?  How long are these strings that you want to parse?  Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters?  Is there always at least one &#39;.&#39; between stems?  A full set of rules as to how the parsing should be done would be useful.  Do you have the BNF syntax for parsing?<br>


<br>

<div class="gmail_quote"><div><div></div><div class="h5">On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <span dir="ltr">&lt;<a href="mailto:tal.galili@gmail.com" target="_blank">tal.galili@gmail.com</a>&gt;</span> wrote:<br>


</div></div><blockquote style="border-left:#ccc 1px solid;margin:0px 0px 0px 0.8ex;padding-left:1ex" class="gmail_quote"><div><div></div><div class="h5">Hello all,<br><br>For some work I am doing on RNA, I want to use R to do string parsing that<br>


(I think) is like a simplistic HTML parsing.<br>

<br><br>For example, let&#39;s say we have the following two variables:<br><br>   Seq &lt;-<br>&quot;GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA&quot;<br>   Str &lt;-<br>&quot;&gt;&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.&quot;<br>


<br>Say that I want to parse &quot;Seq&quot; According to &quot;Str&quot;, by using the legend here<br><br>Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA<br>Str: &gt;&gt;&gt;&gt;&gt;&gt;&gt;..&gt;&gt;&gt;&gt;........&lt;&lt;&lt;&lt;.&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;.....&gt;&gt;&gt;&gt;&gt;.......&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;&lt;.<br>


<br>    |     |  |              | |               |     |               ||     |<br><br>    +-----+  +--------------+ +---------------+     +---------------++-----+<br><br>       |        Stem 1            Stem 2                 Stem 3         |<br>


<br>       |                                                                |<br><br>       +----------------------------------------------------------------+<br><br>                               Stem 0<br><br>Assume that we always have 4 stems (0 to 3), but that the length of letters<br>


before and after each of them can very.<br><br>The output should be something like the following list structure:<br><br><br>   list(<br>    &quot;Stem 0 opening&quot; = &quot;GCCTCGA&quot;,<br>    &quot;before Stem 1&quot; = &quot;TA&quot;,<br>


    &quot;Stem 1&quot; = list(opening = &quot;GCTC&quot;,<br>    inside = &quot;AGTTGGGA&quot;,<br>    closing = &quot;GAGC&quot;<br>    ),<br>    &quot;between Stem 1 and 2&quot; = &quot;G&quot;,<br>    &quot;Stem 2&quot; = list(opening = &quot;TACGA&quot;,<br>


    inside = &quot;CTGAAGA&quot;,<br>    closing = &quot;TCGTA&quot;<br>    ),<br>    &quot;between Stem 2 and 3&quot; = &quot;AGGtC&quot;,<br>    &quot;Stem 3&quot; = list(opening = &quot;ACCAG&quot;,<br>    inside = &quot;TTCGATC&quot;,<br>


    closing = &quot;CTGGT&quot;<br>    ),<br>    &quot;After Stem 3&quot; = &quot;&quot;,<br>    &quot;Stem 0 closing&quot; = &quot;TCGGGGC&quot;<br>   )<br><br><br>I don&#39;t have any experience with programming a parser, and would like<br>


advices as to what strategy to use when programming something like this (and<br>any recommended R commands to use).<br><br><br>What I was thinking of is to first get rid of the &quot;Stem 0&quot;, then go through<br>the inner string with a recursive function (let&#39;s call it &quot;seperate.stem&quot;)<br>


that each time will split the string into:<br>1. before stem<br>2. opening stem<br>3. inside stem<br>4. closing stem<br>5. after stem<br><br>Where the &quot;after stem&quot; will then be recursively entered into the same<br>


function (&quot;seperate.stem&quot;)<br><br>The thing is that I am not sure how to try and do this coding without using<br>a loop.<br><br>Any advices will be most welcomed.<br><br><br>----------------Contact<br>Details:-------------------------------------------------------<br>


Contact me: <a href="mailto:Tal.Galili@gmail.com" target="_blank">Tal.Galili@gmail.com</a> |  972-52-7275845<br>Read me: <a href="http://www.talgalili.com/" target="_blank">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il/" target="_blank">www.biostatistics.co.il</a> (Hebrew) |<br>


<a href="http://www.r-statistics.com/" target="_blank">www.r-statistics.com</a> (English)<br>----------------------------------------------------------------------------------------------<br><br></div></div>       [[alternative HTML version deleted]]<div class="im">


<br>

<br>______________________________________________<br><a href="mailto:R-help@r-project.org" target="_blank">R-help@r-project.org</a> mailing list<br><a href="https://stat.ethz.ch/mailman/listinfo/r-help" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-help</a><br>


PLEASE do read the posting guide <a href="http://www.r-project.org/posting-guide.html" target="_blank">http://www.R-project.org/posting-guide.html</a><br>and provide commented, minimal, self-contained, reproducible code.<br>


</div></blockquote></div><font color="#888888"><br><br clear="all"><br>-- <br>Jim Holtman<br>Cincinnati, OH<br>+1 513 646 9390<br><br>What is the problem that you are trying to solve?<br>

</font></blockquote></div><br></div></div></div>