[Seqinr-forum] How to parse a string (by a "new" markup) with R ?

Tal Galili tal.galili at gmail.com
Tue Mar 16 11:10:55 CET 2010


Hello all,

For some work I am doing on RNA, I want to use R to do string parsing that
(I think) is like a simplistic HTML parsing.


For example, let's say we have the following two variables:

    Seq <-
"GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
    Str <-
">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."

Say that I want to parse "Seq" According to "Str", by using the legend here

Seq: GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.

     |     |  |              | |               |     |               ||     |

     +-----+  +--------------+ +---------------+     +---------------++-----+

        |        Stem 1            Stem 2                 Stem 3         |

        |                                                                |

        +----------------------------------------------------------------+

                                Stem 0

Assume that we always have 4 stems (0 to 3), but that the length of letters
before and after each of them can very.

The output should be something like the following list structure:


    list(
     "Stem 0 opening" = "GCCTCGA",
     "before Stem 1" = "TA",
     "Stem 1" = list(opening = "GCTC",
     inside = "AGTTGGGA",
     closing = "GAGC"
     ),
     "between Stem 1 and 2" = "G",
     "Stem 2" = list(opening = "TACGA",
     inside = "CTGAAGA",
     closing = "TCGTA"
     ),
     "between Stem 2 and 3" = "AGGtC",
     "Stem 3" = list(opening = "ACCAG",
     inside = "TTCGATC",
     closing = "CTGGT"
     ),
     "After Stem 3" = "",
     "Stem 0 closing" = "TCGGGGC"
    )


I don't have any experience with programming a parser, and would like
advices as to what strategy to use when programming something like this (and
any recommended R commands to use).


What I was thinking of is to first get rid of the "Stem 0", then go through
the inner string with a recursive function (let's call it "seperate.stem")
that each time will split the string into:
1. before stem
2. opening stem
3. inside stem
4. closing stem
5. after stem

Where the "after stem" will then be recursively entered into the same
function ("seperate.stem")

The thing is that I am not sure how to try and do this coding without using
a loop.

Any advices will be most welcomed.


----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/ea084ac7/attachment-0001.htm 


More information about the Seqinr-forum mailing list