[Seqinr-forum] [R] How to parse a string (by a "new" markup) with R ?

Tal Galili tal.galili at gmail.com
Wed Mar 17 10:31:44 CET 2010


Wow, Thank you very much Andrej!

Tal

----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




2010/3/17 Andrej Blejec <Andrej.Blejec at nib.si>

> A version using regular expressions, lot of regexpr() and substr()
> functions is attached.
> Finally everything is packed into splitSeq() function
>
> Andrej
>
> --
> Andrej Blejec
> National Institute of Biology
> Vecna pot 111 POB 141
> SI-1000 Ljubljana
> SLOVENIA
> e-mail: andrej.blejec at nib.si
> URL: http://ablejec.nib.si
> tel: + 386 (0)59 232 789
> fax: + 386 1 241 29 80
> --------------------------
> Local Organizer of ICOTS-8
> International Conference on Teaching Statistics
> http://icots8.org
>
>
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > project.org] On Behalf Of Gabor Grothendieck
> > Sent: Tuesday, March 16, 2010 3:24 PM
> > To: Tal Galili
> > Cc: r-help at r-project.org; seqinr-forum at r-forge.wu-wien.ac.at
> > Subject: Re: [R] How to parse a string (by a "new" markup) with R ?
> >
> > We show how to use the gsubfn package to parse this.
> >
> > The rules are not entirely clear so we will assume the following:
> >
> > - there is a fixed template for the output which is the same as your
> > output but possibly with different character strings filled in.  This
> > implies, for example, that there are exactly Stem0, Stem1, Stem2 and
> > Stem3 and no fewer or more stems.
> >
> > - the sequence always starts with the open of Stem0, at least one dot
> > and the open of Stem1.  There are no dots prior to the open of Stem0.
> > This seems to be implicit in your sample output since there is no zero
> > length string in your sample output corresponding to dots prior to
> > Stem0.
> >
> > - Stem0 closes with the same number of < as there are > to open it
> >
> > You can modify this yourself to take into account the actual rules
> > whatever they are.
> >
> > We first calculate, k, the number of leading >'s using strapply.
> >
> > Then we replace the leading k >'s with }'s and the trailing k <'s with
> > {'s giving us Str3:
> >
> >
> > "}}}}}}}..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<{{{{{
> > {{."
> >
> > We again use strapply, this time to get the lengths of the runs.  Note
> > that
> > zero length runs are possible so we cannot, for example, use rle for
> > this.  For
> > example there is a zero length run of dots between the last < and the
> > first {.
> > read.fwf is used to actually parse out the strings using the lengths we
> > just
> > calculated.
> >
> > Finally we fill in the template using relist.
> >
> > # inputs
> >
> > Seq <-
> > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG
> > GCA"
> > Str <-
> > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<
> > <<."
> > template <-
> >   list(
> >     "Stem 0 opening" = "",
> >     "before Stem 1" = "",
> >     "Stem 1" = list(opening = "",
> >     inside = "",
> >     closing = ""
> >     ),
> >     "between Stem 1 and 2" = "",
> >     "Stem 2" = list(opening = "",
> >     inside = "",
> >     closing = ""
> >     ),
> >     "between Stem 2 and 3" = "",
> >     "Stem 3" = list(opening = "",
> >     inside = "",
> >     closing = ""
> >     ),
> >     "After Stem 3" = "",
> >     "Stem 0 closing" = ""
> >    )
> >
> > # processing
> >
> > # create string made by repeating string s k times followed by more
> > reps <- function(s, k, more = "") {
> >       paste(paste(rep(s, k), collapse = ""), more, sep = "")
> > }
> >
> > library(gsubfn)
> > k <- nchar(strapply(Str, "^>+", c)[[1]])
> > Str2 <- sub("^>+", reps("}", k), Str)
> > Str3 <- sub(reps("<", k, "([^<]*)$"), reps("{", k, "\\1"), Str2)
> >
> > pat <-
> > "^(}*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*)(>*)([.]*)(<*)([.]*
> > )({*)([.]*)$"
> > lens <- sapply(strapply(Str3, pat, c)[[1]], nchar)
> > tokens <- unlist(read.fwf(textConnection(Seq), lens, as.is = TRUE))
> > closeAllConnections()
> > tokens[is.na(tokens)] <- ""
> > out <- relist(tokens, template)
> > out
> >
> >
> > Here is the str of the output for your sample input:
> >
> > > str(out)
> > List of 9
> >  $ Stem 0 opening      : chr "GCCTCGA"
> >  $ before Stem 1       : chr "TA"
> >  $ Stem 1              :List of 3
> >   ..$ opening: chr "GCTC"
> >   ..$ inside : chr "AGTTGGGA"
> >   ..$ closing: chr "GAGC"
> >  $ between Stem 1 and 2: chr "G"
> >  $ Stem 2              :List of 3
> >   ..$ opening: chr "TACGA"
> >   ..$ inside : chr "CTGAAGA"
> >   ..$ closing: chr "TCGTA"
> >  $ between Stem 2 and 3: chr "AGGtC"
> >  $ Stem 3              :List of 3
> >   ..$ opening: chr "ACCAG"
> >   ..$ inside : chr "TTCGATC"
> >   ..$ closing: chr "CTGGT"
> >  $ After Stem 3        : chr ""
> >  $ Stem 0 closing      : chr "TCGGGGC"
> >
> >
> >
> > On Tue, Mar 16, 2010 at 6:10 AM, Tal Galili <tal.galili at gmail.com>
> > wrote:
> > > Hello all,
> > >
> > > For some work I am doing on RNA, I want to use R to do string parsing
> > that
> > > (I think) is like a simplistic HTML parsing.
> > >
> > >
> > > For example, let's say we have the following two variables:
> > >
> > >    Seq <-
> > >
> > "GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGG
> > GCA"
> > >    Str <-
> > >
> > ">>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<
> > <<."
> > >
> > > Say that I want to parse "Seq" According to "Str", by using the
> > legend here
> > >
> > > Seq:
> > GCCTCGATAGCTCAGTTGGGAGAGCGTACGACTGAAGATCGTAAGGtCACCAGTTCGATCCTGGTTCGGGG
> > CA
> > > Str:
> > >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<
> > <.
> > >
> > >     |     |  |              | |               |     |
> > ||     |
> > >
> > >     +-----+  +--------------+ +---------------+     +---------------
> > ++-----+
> > >
> > >        |        Stem 1            Stem 2                 Stem 3
> >   |
> > >
> > >        |
> >    |
> > >
> > >        +-------------------------------------------------------------
> > ---+
> > >
> > >                                Stem 0
> > >
> > > Assume that we always have 4 stems (0 to 3), but that the length of
> > letters
> > > before and after each of them can very.
> > >
> > > The output should be something like the following list structure:
> > >
> > >
> > >    list(
> > >     "Stem 0 opening" = "GCCTCGA",
> > >     "before Stem 1" = "TA",
> > >     "Stem 1" = list(opening = "GCTC",
> > >     inside = "AGTTGGGA",
> > >     closing = "GAGC"
> > >     ),
> > >     "between Stem 1 and 2" = "G",
> > >     "Stem 2" = list(opening = "TACGA",
> > >     inside = "CTGAAGA",
> > >     closing = "TCGTA"
> > >     ),
> > >     "between Stem 2 and 3" = "AGGtC",
> > >     "Stem 3" = list(opening = "ACCAG",
> > >     inside = "TTCGATC",
> > >     closing = "CTGGT"
> > >     ),
> > >     "After Stem 3" = "",
> > >     "Stem 0 closing" = "TCGGGGC"
> > >    )
> > >
> > >
> > > I don't have any experience with programming a parser, and would like
> > > advices as to what strategy to use when programming something like
> > this (and
> > > any recommended R commands to use).
> > >
> > >
> > > What I was thinking of is to first get rid of the "Stem 0", then go
> > through
> > > the inner string with a recursive function (let's call it
> > "seperate.stem")
> > > that each time will split the string into:
> > > 1. before stem
> > > 2. opening stem
> > > 3. inside stem
> > > 4. closing stem
> > > 5. after stem
> > >
> > > Where the "after stem" will then be recursively entered into the same
> > > function ("seperate.stem")
> > >
> > > The thing is that I am not sure how to try and do this coding without
> > using
> > > a loop.
> > >
> > > Any advices will be most welcomed.
> > >
> > >
> > > ----------------Contact
> > > Details:-------------------------------------------------------
> > > Contact me: Tal.Galili at gmail.com |  972-52-7275845
> > > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il
> > (Hebrew) |
> > > www.r-statistics.com (English)
> > > ---------------------------------------------------------------------
> > -------------------------
> > >
> > >        [[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100317/ecc1dbce/attachment.htm 


More information about the Seqinr-forum mailing list