[Seqinr-forum] [R] How to parse a string (by a "new" markup) with R ?

jim holtman jholtman at gmail.com
Tue Mar 16 12:59:06 CET 2010


How are you supposed to interprete the string that is doing the parsing?
Does each sequence have the same number of ">>>>" for the opening sequence
as it does for "<<<<" on the ending sequence?  That what it appears to be
looking at the way stem 3 is parsed.  You will have to provide a little more
insight on how to interprete the  symbols.  Does the parsing always start
with a partial stem 0 as your example shows?  Is there a way of making sure
you have the right sequences when you start?  Is there a chance of error in
the middle of the string that you have to restart from?  How long are these
strings that you want to parse?  Is each one a self contained sequence like
you show in your example, or do they go on for thousands of characters?  Is
there always at least one '.' between stems?  A full set of rules as to how
the parsing should be done would be useful.  Do you have the BNF syntax for
parsing?

On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com> wrote:

> Hello all,
>
> For some work I am doing on RNA, I want to use R to do string parsing that
> (I think) is like a simplistic HTML parsing.
>
>
> For example, let's say we have the following two variables:
>
>    Seq <-
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
>    Str <-
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
>
> Say that I want to parse "Seq" According to "Str", by using the legend here
>
> Seq:
> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
> Str:
> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
>
>     |     |  |              | |               |     |               ||
> |
>
>     +-----+  +--------------+ +---------------+
> +---------------++-----+
>
>        |        Stem 1            Stem 2                 Stem 3         |
>
>        |                                                                |
>
>        +----------------------------------------------------------------+
>
>                                Stem 0
>
> Assume that we always have 4 stems (0 to 3), but that the length of letters
> before and after each of them can very.
>
> The output should be something like the following list structure:
>
>
>    list(
>     "Stem 0 opening" = "GCCTCGA",
>     "before Stem 1" = "TA",
>     "Stem 1" = list(opening = "GCTC",
>     inside = "AGTTGGGA",
>     closing = "GAGC"
>     ),
>     "between Stem 1 and 2" = "G",
>     "Stem 2" = list(opening = "TACGA",
>     inside = "CTGAAGA",
>     closing = "TCGTA"
>     ),
>     "between Stem 2 and 3" = "AGGtC",
>     "Stem 3" = list(opening = "ACCAG",
>     inside = "TTCGATC",
>     closing = "CTGGT"
>     ),
>     "After Stem 3" = "",
>     "Stem 0 closing" = "TCGGGGC"
>    )
>
>
> I don't have any experience with programming a parser, and would like
> advices as to what strategy to use when programming something like this
> (and
> any recommended R commands to use).
>
>
> What I was thinking of is to first get rid of the "Stem 0", then go through
> the inner string with a recursive function (let's call it "seperate.stem")
> that each time will split the string into:
> 1. before stem
> 2. opening stem
> 3. inside stem
> 4. closing stem
> 5. after stem
>
> Where the "after stem" will then be recursively entered into the same
> function ("seperate.stem")
>
> The thing is that I am not sure how to try and do this coding without using
> a loop.
>
> Any advices will be most welcomed.
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
>
> ----------------------------------------------------------------------------------------------
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/005595db/attachment.htm 


More information about the Seqinr-forum mailing list