[Seqinr-forum] [R] How to parse a string (by a "new" markup) with R ?

Tue Mar 16 15:00:18 CET 2010

Hi Jim,
Thanks for the questions, here are my answers:

*Q: Does each sequence have the same number of ">>>>" for the opening
sequence as it does for "<<<<" on the ending sequence?  *
A: Yes

*Q: Does the parsing always start with a partial stem 0 as your example
shows? *
A: No. Sometimes it will start with a few "."

*Q: Is there a way of making sure you have the right sequences when you
start? *
A: I am not sure I understand what you mean.

*Q: Is there a chance of error in the middle of the string that you have to
restart from?*
A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...

*Q: How long are these strings that you want to parse? *
A: Each string has between 60 to 150 characters (and I have tens
of thousands of them...)

*Q: Is each one a self contained sequence like you show in your example, or
do they go on for thousands of characters? *
A: each sequence is self contained.

*Q: Is there always at least one '.' between stems?  *
A: No.

*Q: A full set of rules as to how the parsing should be done would be
useful.*
A: I agree.  But since I don't have even a basic idea on how to start coding
this, I thought first to have some help on the beginning and try to tweak
with the other cases that will come up before turning back for help.

*Q: Do you have the BNF syntax for parsing?*
A: No. Your e-mail is the first time I came across it (
http://en.wikipedia.org/wiki/Backus–Naur_Form<http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form>
).

Thanks for the help,
Tal

----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

On Tue, Mar 16, 2010 at 1:59 PM, jim holtman <jholtman at gmail.com> wrote:

> How are you supposed to interprete the string that is doing the parsing?
> Does each sequence have the same number of ">>>>" for the opening sequence
> as it does for "<<<<" on the ending sequence?  That what it appears to be
> looking at the way stem 3 is parsed.  You will have to provide a little more
> insight on how to interprete the  symbols.  Does the parsing always start
> with a partial stem 0 as your example shows?  Is there a way of making sure
> you have the right sequences when you start?  Is there a chance of error in
> the middle of the string that you have to restart from?  How long are these
> strings that you want to parse?  Is each one a self contained sequence like
> you show in your example, or do they go on for thousands of characters?  Is
> there always at least one '.' between stems?  A full set of rules as to how
> the parsing should be done would be useful.  Do you have the BNF syntax for
> parsing?
>
> On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com> wrote:
>
>> Hello all,
>>
>> For some work I am doing on RNA, I want to use R to do string parsing that
>> (I think) is like a simplistic HTML parsing.
>>
>>
>> For example, let's say we have the following two variables:
>>
>>    Seq <-
>>
>> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
>>    Str <-
>>
>> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
>>
>> Say that I want to parse "Seq" According to "Str", by using the legend
>> here
>>
>> Seq:
>> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
>> Str:
>> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
>>
>>     |     |  |              | |               |     |               ||
>> |
>>
>>     +-----+  +--------------+ +---------------+
>> +---------------++-----+
>>
>>        |        Stem 1            Stem 2                 Stem 3         |
>>
>>        |                                                                |
>>
>>        +----------------------------------------------------------------+
>>
>>                                Stem 0
>>
>> Assume that we always have 4 stems (0 to 3), but that the length of
>> letters
>> before and after each of them can very.
>>
>> The output should be something like the following list structure:
>>
>>
>>    list(
>>     "Stem 0 opening" = "GCCTCGA",
>>     "before Stem 1" = "TA",
>>     "Stem 1" = list(opening = "GCTC",
>>     inside = "AGTTGGGA",
>>     closing = "GAGC"
>>     ),
>>     "between Stem 1 and 2" = "G",
>>     "Stem 2" = list(opening = "TACGA",
>>     inside = "CTGAAGA",
>>     closing = "TCGTA"
>>     ),
>>     "between Stem 2 and 3" = "AGGtC",
>>     "Stem 3" = list(opening = "ACCAG",
>>     inside = "TTCGATC",
>>     closing = "CTGGT"
>>     ),
>>     "After Stem 3" = "",
>>     "Stem 0 closing" = "TCGGGGC"
>>    )
>>
>>
>> I don't have any experience with programming a parser, and would like
>> advices as to what strategy to use when programming something like this
>> (and
>> any recommended R commands to use).
>>
>>
>> What I was thinking of is to first get rid of the "Stem 0", then go
>> through
>> the inner string with a recursive function (let's call it "seperate.stem")
>> that each time will split the string into:
>> 1. before stem
>> 2. opening stem
>> 3. inside stem
>> 4. closing stem
>> 5. after stem
>>
>> Where the "after stem" will then be recursively entered into the same
>> function ("seperate.stem")
>>
>> The thing is that I am not sure how to try and do this coding without
>> using
>> a loop.
>>
>> Any advices will be most welcomed.
>>
>>
>> ----------------Contact
>> Details:-------------------------------------------------------
>> Contact me: Tal.Galili at gmail.com |  972-52-7275845
>> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
>> www.r-statistics.com (English)
>>
>> ----------------------------------------------------------------------------------------------
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/39d1bbdf/attachment.htm