<div dir="ltr">Hi Jim,<div>Thanks for the questions, here are my answers:<br><div><br></div><div><b>Q: Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence? </b></div>
<div>A: Yes</div><div><br></div><div><b>Q: Does the parsing always start with a partial stem 0 as your example shows? </b></div><div>A: No. Sometimes it will start with a few "."</div><div><br></div><div><b>Q: Is there a way of making sure you have the right sequences when you start? </b></div>
<div>A: I am not sure I understand what you mean.</div><div><br></div><div><b>Q: Is there a chance of error in the middle of the string that you have to restart from?</b></div><div>A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...</div>
<div><br></div><div><b>Q: How long are these strings that you want to parse? </b></div><div>A: Each string has between 60 to 150 characters (and I have tens of thousands of them...)</div><div><br></div><div><b>Q: Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? </b></div>
<div>A: each sequence is self contained.</div><div><br></div><div><b>Q: Is there always at least one '.' between stems? </b></div><div>A: No.</div><div><br></div><div><b>Q: A full set of rules as to how the parsing should be done would be useful.</b></div>
<div>A: I agree. But since I don't have even a basic idea on how to start coding this, I thought first to have some help on the beginning and try to tweak with the other cases that will come up before turning back for help.</div>
<div><br></div><div><b>Q: Do you have the BNF syntax for parsing?</b></div><div>A: No. Your e-mail is the first time I came across it (<a href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form">http://en.wikipedia.org/wiki/Backus–Naur_Form</a>).</div>
<div><br></div><div><br></div><div>Thanks for the help,</div><div>Tal</div><div><br></div><div><br></div><div><br>----------------Contact Details:-------------------------------------------------------<br>Contact me: <a href="mailto:Tal.Galili@gmail.com">Tal.Galili@gmail.com</a> | 972-52-7275845<br>
Read me: <a href="http://www.talgalili.com">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il">www.biostatistics.co.il</a> (Hebrew) | <a href="http://www.r-statistics.com">www.r-statistics.com</a> (English)<br>
----------------------------------------------------------------------------------------------<br><br><br>
<br><br><div class="gmail_quote">On Tue, Mar 16, 2010 at 1:59 PM, jim holtman <span dir="ltr"><<a href="mailto:jholtman@gmail.com">jholtman@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
How are you supposed to interprete the string that is doing the parsing? Does each sequence have the same number of ">>>>" for the opening sequence as it does for "<<<<" on the ending sequence? That what it appears to be looking at the way stem 3 is parsed. You will have to provide a little more insight on how to interprete the symbols. Does the parsing always start with a partial stem 0 as your example shows? Is there a way of making sure you have the right sequences when you start? Is there a chance of error in the middle of the string that you have to restart from? How long are these strings that you want to parse? Is each one a self contained sequence like you show in your example, or do they go on for thousands of characters? Is there always at least one '.' between stems? A full set of rules as to how the parsing should be done would be useful. Do you have the BNF syntax for parsing?<br>
<br>
<div class="gmail_quote"><div><div></div><div class="h5">On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <span dir="ltr"><<a href="mailto:tal.galili@gmail.com" target="_blank">tal.galili@gmail.com</a>></span> wrote:<br>
</div></div><blockquote style="border-left:#ccc 1px solid;margin:0px 0px 0px 0.8ex;padding-left:1ex" class="gmail_quote"><div><div></div><div class="h5">Hello all,<br><br>For some work I am doing on RNA, I want to use R to do string parsing that<br>
(I think) is like a simplistic HTML parsing.<br>
<br><br>For example, let's say we have the following two variables:<br><br> Seq <-<br>"GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"<br> Str <-<br>">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."<br>
<br>Say that I want to parse "Seq" According to "Str", by using the legend here<br><br>Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA<br>Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.<br>
<br> | | | | | | | || |<br><br> +-----+ +--------------+ +---------------+ +---------------++-----+<br><br> | Stem 1 Stem 2 Stem 3 |<br>
<br> | |<br><br> +----------------------------------------------------------------+<br><br> Stem 0<br><br>Assume that we always have 4 stems (0 to 3), but that the length of letters<br>
before and after each of them can very.<br><br>The output should be something like the following list structure:<br><br><br> list(<br> "Stem 0 opening" = "GCCTCGA",<br> "before Stem 1" = "TA",<br>
"Stem 1" = list(opening = "GCTC",<br> inside = "AGTTGGGA",<br> closing = "GAGC"<br> ),<br> "between Stem 1 and 2" = "G",<br> "Stem 2" = list(opening = "TACGA",<br>
inside = "CTGAAGA",<br> closing = "TCGTA"<br> ),<br> "between Stem 2 and 3" = "AGGtC",<br> "Stem 3" = list(opening = "ACCAG",<br> inside = "TTCGATC",<br>
closing = "CTGGT"<br> ),<br> "After Stem 3" = "",<br> "Stem 0 closing" = "TCGGGGC"<br> )<br><br><br>I don't have any experience with programming a parser, and would like<br>
advices as to what strategy to use when programming something like this (and<br>any recommended R commands to use).<br><br><br>What I was thinking of is to first get rid of the "Stem 0", then go through<br>the inner string with a recursive function (let's call it "seperate.stem")<br>
that each time will split the string into:<br>1. before stem<br>2. opening stem<br>3. inside stem<br>4. closing stem<br>5. after stem<br><br>Where the "after stem" will then be recursively entered into the same<br>
function ("seperate.stem")<br><br>The thing is that I am not sure how to try and do this coding without using<br>a loop.<br><br>Any advices will be most welcomed.<br><br><br>----------------Contact<br>Details:-------------------------------------------------------<br>
Contact me: <a href="mailto:Tal.Galili@gmail.com" target="_blank">Tal.Galili@gmail.com</a> | 972-52-7275845<br>Read me: <a href="http://www.talgalili.com/" target="_blank">www.talgalili.com</a> (Hebrew) | <a href="http://www.biostatistics.co.il/" target="_blank">www.biostatistics.co.il</a> (Hebrew) |<br>
<a href="http://www.r-statistics.com/" target="_blank">www.r-statistics.com</a> (English)<br>----------------------------------------------------------------------------------------------<br><br></div></div> [[alternative HTML version deleted]]<div class="im">
<br>
<br>______________________________________________<br><a href="mailto:R-help@r-project.org" target="_blank">R-help@r-project.org</a> mailing list<br><a href="https://stat.ethz.ch/mailman/listinfo/r-help" target="_blank">https://stat.ethz.ch/mailman/listinfo/r-help</a><br>
PLEASE do read the posting guide <a href="http://www.r-project.org/posting-guide.html" target="_blank">http://www.R-project.org/posting-guide.html</a><br>and provide commented, minimal, self-contained, reproducible code.<br>
</div></blockquote></div><font color="#888888"><br><br clear="all"><br>-- <br>Jim Holtman<br>Cincinnati, OH<br>+1 513 646 9390<br><br>What is the problem that you are trying to solve?<br>
</font></blockquote></div><br></div></div></div>