[Seqinr-forum] [R] How to parse a string (by a "new" markup) with R ?
jim holtman
jholtman at gmail.com
Tue Mar 16 12:59:06 CET 2010
How are you supposed to interprete the string that is doing the parsing?
Does each sequence have the same number of ">>>>" for the opening sequence
as it does for "<<<<" on the ending sequence? That what it appears to be
looking at the way stem 3 is parsed. You will have to provide a little more
insight on how to interprete the symbols. Does the parsing always start
with a partial stem 0 as your example shows? Is there a way of making sure
you have the right sequences when you start? Is there a chance of error in
the middle of the string that you have to restart from? How long are these
strings that you want to parse? Is each one a self contained sequence like
you show in your example, or do they go on for thousands of characters? Is
there always at least one '.' between stems? A full set of rules as to how
the parsing should be done would be useful. Do you have the BNF syntax for
parsing?
On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com> wrote:
> Hello all,
>
> For some work I am doing on RNA, I want to use R to do string parsing that
> (I think) is like a simplistic HTML parsing.
>
>
> For example, let's say we have the following two variables:
>
> Seq <-
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
> Str <-
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
>
> Say that I want to parse "Seq" According to "Str", by using the legend here
>
> Seq:
> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
> Str:
> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
>
> | | | | | | | ||
> |
>
> +-----+ +--------------+ +---------------+
> +---------------++-----+
>
> | Stem 1 Stem 2 Stem 3 |
>
> | |
>
> +----------------------------------------------------------------+
>
> Stem 0
>
> Assume that we always have 4 stems (0 to 3), but that the length of letters
> before and after each of them can very.
>
> The output should be something like the following list structure:
>
>
> list(
> "Stem 0 opening" = "GCCTCGA",
> "before Stem 1" = "TA",
> "Stem 1" = list(opening = "GCTC",
> inside = "AGTTGGGA",
> closing = "GAGC"
> ),
> "between Stem 1 and 2" = "G",
> "Stem 2" = list(opening = "TACGA",
> inside = "CTGAAGA",
> closing = "TCGTA"
> ),
> "between Stem 2 and 3" = "AGGtC",
> "Stem 3" = list(opening = "ACCAG",
> inside = "TTCGATC",
> closing = "CTGGT"
> ),
> "After Stem 3" = "",
> "Stem 0 closing" = "TCGGGGC"
> )
>
>
> I don't have any experience with programming a parser, and would like
> advices as to what strategy to use when programming something like this
> (and
> any recommended R commands to use).
>
>
> What I was thinking of is to first get rid of the "Stem 0", then go through
> the inner string with a recursive function (let's call it "seperate.stem")
> that each time will split the string into:
> 1. before stem
> 2. opening stem
> 3. inside stem
> 4. closing stem
> 5. after stem
>
> Where the "after stem" will then be recursively entered into the same
> function ("seperate.stem")
>
> The thing is that I am not sure how to try and do this coding without using
> a loop.
>
> Any advices will be most welcomed.
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com | 972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
> www.r-statistics.com (English)
>
> ----------------------------------------------------------------------------------------------
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/005595db/attachment.htm
More information about the Seqinr-forum
mailing list