[Seqinr-forum] [R] How to parse a string (by a "new" markup) with R ?

Tue Mar 16 23:34:31 CET 2010

A version using regular expressions, lot of regexpr() and substr() functions is attached.
Finally everything is packed into splitSeq() function

Andrej

--
Andrej Blejec
National Institute of Biology
Vecna pot 111 POB 141
SI-1000 Ljubljana
SLOVENIA
e-mail: andrej.blejec at nib.si
URL: http://ablejec.nib.si 
tel: + 386 (0)59 232 789
fax: + 386 1 241 29 80
--------------------------
Local Organizer of ICOTS-8
International Conference on Teaching Statistics 
http://icots8.org

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Gabor Grothendieck
> Sent: Tuesday, March 16, 2010 3:24 PM
> To: Tal Galili
> Cc: r-help at r-project.org; seqinr-forum at r-forge.wu-wien.ac.at
> Subject: Re: [R] How to parse a string (by a "new" markup) with R ?
> 
> We show how to use the gsubfn package to parse this.
> 
> The rules are not entirely clear so we will assume the following:
> 
> - there is a fixed template for the output which is the same as your
> output but possibly with different character strings filled in.  This
> implies, for example, that there are exactly Stem0, Stem1, Stem2 and
> Stem3 and no fewer or more stems.
> 
> - the sequence always starts with the open of Stem0, at least one dot
> and the open of Stem1.  There are no dots prior to the open of Stem0.
> This seems to be implicit in your sample output since there is no zero
> length string in your sample output corresponding to dots prior to
> Stem0.
> 
> - Stem0 closes with the same number of < as there are > to open it
> 
> You can modify this yourself to take into account the actual rules
> whatever they are.
> 
> We first calculate, k, the number of leading >'s using strapply.
> 
> Then we replace the leading k >'s with }'s and the trailing k <'s with
> {'s giving us Str3:
> 
> 
> "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{
> {{."
> 
> We again use strapply, this time to get the lengths of the runs.  Note
> that
> zero length runs are possible so we cannot, for example, use rle for
> this.  For
> example there is a zero length run of dots between the last < and the
> first {.
> read.fwf is used to actually parse out the strings using the lengths we
> just
> calculated.
> 
> Finally we fill in the template using relist.
> 
> # inputs
> 
> Seq <-
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG
> GCA"
> Str <-
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<
> <<."
> template <-
>   list(
>     "Stem 0 opening" = "",
>     "before Stem 1" = "",
>     "Stem 1" = list(opening = "",
>     inside = "",
>     closing = ""
>     ),
>     "between Stem 1 and 2" = "",
>     "Stem 2" = list(opening = "",
>     inside = "",
>     closing = ""
>     ),
>     "between Stem 2 and 3" = "",
>     "Stem 3" = list(opening = "",
>     inside = "",
>     closing = ""
>     ),
>     "After Stem 3" = "",
>     "Stem 0 closing" = ""
>    )
> 
> # processing
> 
> # create string made by repeating string s k times followed by more
> reps <- function(s, k, more = "") {
> 	paste(paste(rep(s, k), collapse = ""), more, sep = "")
> }
> 
> library(gsubfn)
> k <- nchar(strapply(Str, "^>+", c)[[1]])
> Str2 <- sub("^>+", reps("}", k), Str)
> Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2)
> 
> pat <-
> "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*
> )({*)([.]*)$"
> lens <- sapply(strapply(Str3, pat, c)[[1]], nchar)
> tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE))
> closeAllConnections()
> tokens[is.na(tokens)] <- ""
> out <- relist(tokens, template)
> out
> 
> 
> Here is the str of the output for your sample input:
> 
> > str(out)
> List of 9
>  $ Stem 0 opening      : chr "GCCTCGA"
>  $ before Stem 1       : chr "TA"
>  $ Stem 1              :List of 3
>   ..$ opening: chr "GCTC"
>   ..$ inside : chr "AGTTGGGA"
>   ..$ closing: chr "GAGC"
>  $ between Stem 1 and 2: chr "G"
>  $ Stem 2              :List of 3
>   ..$ opening: chr "TACGA"
>   ..$ inside : chr "CTGAAGA"
>   ..$ closing: chr "TCGTA"
>  $ between Stem 2 and 3: chr "AGGtC"
>  $ Stem 3              :List of 3
>   ..$ opening: chr "ACCAG"
>   ..$ inside : chr "TTCGATC"
>   ..$ closing: chr "CTGGT"
>  $ After Stem 3        : chr ""
>  $ Stem 0 closing      : chr "TCGGGGC"
> 
> 
> 
> On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com>
> wrote:
> > Hello all,
> >
> > For some work I am doing on RNA, I want to use R to do string parsing
> that
> > (I think) is like a simplistic HTML parsing.
> >
> >
> > For example, let's say we have the following two variables:
> >
> >    Seq <-
> >
> "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG
> GCA"
> >    Str <-
> >
> ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<
> <<."
> >
> > Say that I want to parse "Seq" According to "Str", by using the
> legend here
> >
> > Seq:
> GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGG
> CA
> > Str:
> >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<
> <.
> >
> >     |     |  |              | |               |     |
> ||     |
> >
> >     +-----+  +--------------+ +---------------+     +---------------
> ++-----+
> >
> >        |        Stem 1            Stem 2                 Stem 3
>   |
> >
> >        |
>    |
> >
> >        +-------------------------------------------------------------
> ---+
> >
> >                                Stem 0
> >
> > Assume that we always have 4 stems (0 to 3), but that the length of
> letters
> > before and after each of them can very.
> >
> > The output should be something like the following list structure:
> >
> >
> >    list(
> >     "Stem 0 opening" = "GCCTCGA",
> >     "before Stem 1" = "TA",
> >     "Stem 1" = list(opening = "GCTC",
> >     inside = "AGTTGGGA",
> >     closing = "GAGC"
> >     ),
> >     "between Stem 1 and 2" = "G",
> >     "Stem 2" = list(opening = "TACGA",
> >     inside = "CTGAAGA",
> >     closing = "TCGTA"
> >     ),
> >     "between Stem 2 and 3" = "AGGtC",
> >     "Stem 3" = list(opening = "ACCAG",
> >     inside = "TTCGATC",
> >     closing = "CTGGT"
> >     ),
> >     "After Stem 3" = "",
> >     "Stem 0 closing" = "TCGGGGC"
> >    )
> >
> >
> > I don't have any experience with programming a parser, and would like
> > advices as to what strategy to use when programming something like
> this (and
> > any recommended R commands to use).
> >
> >
> > What I was thinking of is to first get rid of the "Stem 0", then go
> through
> > the inner string with a recursive function (let's call it
> "seperate.stem")
> > that each time will split the string into:
> > 1. before stem
> > 2. opening stem
> > 3. inside stem
> > 4. closing stem
> > 5. after stem
> >
> > Where the "after stem" will then be recursively entered into the same
> > function ("seperate.stem")
> >
> > The thing is that I am not sure how to try and do this coding without
> using
> > a loop.
> >
> > Any advices will be most welcomed.
> >
> >
> > ----------------Contact
> > Details:-------------------------------------------------------
> > Contact me: Tal.Galili at gmail.com |  972-52-7275845
> > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
> (Hebrew) |
> > www.r-statistics.com (English)
> > ---------------------------------------------------------------------
> -------------------------
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: parsingRNA.pdf
Type: application/octet-stream
Size: 115065 bytes
Desc: parsingRNA.pdf
Url : http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/2d7c77a7/attachment-0002.obj 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: parsingRNA.R
Type: application/octet-stream
Size: 6847 bytes
Desc: parsingRNA.R
Url : http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100316/2d7c77a7/attachment-0003.obj