[Seqinr-forum] [R] How to parse a string (by a "new" markup) with R ?
Tal Galili
tal.galili at gmail.com
Tue Mar 16 15:00:18 CET 2010
Hi Jim,
Thanks for the questions, here are my answers:
*Q: Does each sequence have the same number of ">>>>" for the opening
sequence as it does for "<<<<" on the ending sequence? *
A: Yes
*Q: Does the parsing always start with a partial stem 0 as your example
shows? *
A: No. Sometimes it will start with a few "."
*Q: Is there a way of making sure you have the right sequences when you
start? *
A: I am not sure I understand what you mean.
*Q: Is there a chance of error in the middle of the string that you have to
restart from?*
A: Sadly, yes. In which case, I'll need to ignore one of the inner stems...
*Q: How long are these strings that you want to parse? *
A: Each string has between 60 to 150 characters (and I have tens
of thousands of them...)
*Q: Is each one a self contained sequence like you show in your example, or
do they go on for thousands of characters? *
A: each sequence is self contained.
*Q: Is there always at least one '.' between stems? *
A: No.
*Q: A full set of rules as to how the parsing should be done would be
useful.*
A: I agree. But since I don't have even a basic idea on how to start coding
this, I thought first to have some help on the beginning and try to tweak
with the other cases that will come up before turning back for help.
*Q: Do you have the BNF syntax for parsing?*
A: No. Your e-mail is the first time I came across it (
http://en.wikipedia.org/wiki/Backus–Naur_Form<http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form>
).
Thanks for the help,
Tal
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
On Tue, Mar 16, 2010 at 1:59 PM, jim holtman <jholtman at gmail.com> wrote:
> How are you supposed to interprete the string that is doing the parsing?
> Does each sequence have the same number of ">>>>" for the opening sequence
> as it does for "<<<<" on the ending sequence? That what it appears to be
> looking at the way stem 3 is parsed. You will have to provide a little more
> insight on how to interprete the symbols. Does the parsing always start
> with a partial stem 0 as your example shows? Is there a way of making sure
> you have the right sequences when you start? Is there a chance of error in
> the middle of the string that you have to restart from? How long are these
> strings that you want to parse? Is each one a self contained sequence like
> you show in your example, or do they go on for thousands of characters? Is
> there always at least one '.' between stems? A full set of rules as to how
> the parsing should be done would be useful. Do you have the BNF syntax for
> parsing?
>
> On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com> wrote:
>
>> Hello all,
>>
>> For some work I am doing on RNA, I want to use R to do string parsing that
>> (I think) is like a simplistic HTML parsing.
>>
>>
>> For example, let's say we have the following two variables:
>>
>> Seq <-
>>
>> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA"
>> Str <-
>>
>> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<."
>>
>> Say that I want to parse "Seq" According to "Str", by using the legend
>> here
>>
>> Seq:
>> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGGCA
>> Str:
>> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
>>
>> | | | | | | | ||
>> |
>>
>> +-----+ +--------------+ +---------------+
>> +---------------++-----+
>>
>> | Stem 1 Stem 2 Stem 3 |
>>
>> | |
>>
>> +----------------------------------------------------------------+
>>
>> Stem 0
>>
>> Assume that we always have 4 stems (0 to 3), but that the length of
>> letters
>> before and after each of them can very.
>>
>> The output should be something like the following list structure:
>>
>>
>> list(
>> "Stem 0 opening" = "GCCTCGA",
>> "before Stem 1" = "TA",
>> "Stem 1" = list(opening = "GCTC",
>> inside = "AGTTGGGA",
>> closing = "GAGC"
>> ),
>> "between Stem 1 and 2" = "G",
>> "Stem 2" = list(opening = "TACGA",
>> inside = "CTGAAGA",
>> closing = "TCGTA"
>> ),
>> "between Stem 2 and 3" = "AGGtC",
>> "Stem 3" = list(opening = "ACCAG",
>> inside = "TTCGATC",
>> closing = "CTGGT"
>> ),
>> "After Stem 3" = "",
>> "Stem 0 closing" = "TCGGGGC"
>> )
>>
>>
>> I don't have any experience with programming a parser, and would like
>> advices as to what strategy to use when programming something like this
>> (and
>> any recommended R commands to use).
>>
>>
>> What I was thinking of is to first get rid of the "Stem 0", then go
>> through
>> the inner string with a recursive function (let's call it "seperate.stem")
>> that each time will split the string into:
>> 1. before stem
>> 2. opening stem
>> 3. inside stem
>> 4. closing stem
>> 5. after stem
>>
>> Where the "after stem" will then be recursively entered into the same
>> function ("seperate.stem")
>>
>> The thing is that I am not sure how to try and do this coding without
>> using
>> a loop.
>>
>> Any advices will be most welcomed.
>>
>>
>> ----------------Contact
>> Details:-------------------------------------------------------
>> Contact me: Tal.Galili at gmail.com | 972-52-7275845
>> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
>> www.r-statistics.com (English)
>>
>> ----------------------------------------------------------------------------------------------
>>
>> [[alternative HTML version deleted]]
>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/39d1bbdf/attachment.htm
More information about the Seqinr-forum
mailing list