[datatable-help] datatable-help Digest, Vol 17, Issue 10

Chris Neff caneff at gmail.com
Fri Jul 15 20:29:22 CEST 2011


Just chiming in to say something similar to colwise from plyr would be quite
nice. You could just carry around a vector of variable names, then do
something DT[ ,colwise( f, var_names), by=by_names ].


On 15 July 2011 14:06, Dennis Murphy <djmuser at gmail.com> wrote:

> On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
> > Hi Dennis,
> >
> > I didn't see your post before I sent my latest reply.
> >
> > Nice detective work!
>
> Thanks, Steve. I just followed my nose and the docs, which I have
> conveniently kept in a small binder for such occasions :) Like JV, I
> don't use data.table every day, so some of its idiosyncracies get
> cobwebbed in the hard drive over time. The wiki entries helped a lot.
>
> >
> > For what it's worth, from what I understand your
> > "punchline"/kewpie-prize solution is so much faster because it avoids
> > building the .SD data.table within each group.
>
> That was my deduction from having read the first entry in the wiki. I
> still can't believe I got that thing to work :)
>
> >
> > I'll let Matthew leave a more detailed comment, since he's (obviously)
> > much more intimately familiar w/ the inner voodoo of data.table. But
> > as a last comment -- if the speed differences are so drastic because
> > of the cost of creating the .SD data.table, maybe we should think
> > about taking some "inspiration" from plyr and define a similar
> > `colwise` function -- which would operate across each "column" of
> > supposedly-build .SD object applying a function to each of them w/o
> > actually building an .SD object itself.
>
> Your clairvoyance skills are clearly operating today :)  More
> seriously, this is what I would consider an 'obvious' "big-data"
> problem - I could easily see situations arising in finance and genomic
> applications where a fairly large subset of variables of the same
> type, but not necessarily all of them, need to be summarized in a
> particular way. The colwise() functions would be problematic as well
> in the scenario described in my eariler post, but I haven't tried
> ddply() to verify that assertion so I could be mistaken.
>
> It would be *really* helpful to have a convenient, fast  mechanism in
> data.table that allows one to substitute a (possibly large) vector of
> variable names into a function. Alas, I don't have any bright ideas
> about how to program it. Fortunately, there are some nice functions in
> R to select variable subsets efficiently in data frames (e.g., the
> grep() family of functions, regular expressions, %in% and so on), but
> I don't know how that would translate easily to data.table() since the
> internals are so different.
>
> Looking forward to the team's take on this...
>
> Dennis
>
> >
> > -steve
> >
> > On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <djmuser at gmail.com>
> wrote:
> >> Hi:
> >>
> >> <A bunch snipped because I get the archives in digest form>
> >>
> >> Re Prof. Voelkel's recent posts:
> >>
> >> (1) Quoting does not work well in data.table; this is mentioned in
> >> several of the FAQs. Apropos to this discussion, some of the relevant
> >> ones include 1.2, 1.6 and 2.1; there may be others :)
> >>
> >> (2) Steve's response seems to be the right way to go (although see
> >> below), but I thought I'd up the stakes a little and assume that Prof.
> >> Voelkel has a large number of variables, only a subset of which he may
> >> want summarized in a particular go. To that end, I created the
> >> following toy data frame cum data.table; this is as much for my own
> >> edification as anyone else's (which explains the eventual length of
> >> this post...I got curious :)
> >>
> >> This goes against the advice given in the first example of the
> >> data.table wiki, but if you have, say, 100 variables to select out of
> >> a possible 1000, it doesn't make sense to list them individually as
> >> recommended on the wiki. (But see below...)
> >>
> >> library('data.table')
> >> set.seed(1043)
> >> m <- matrix(rpois(240, 10), nrow = 6)
> >> colnames(m) <- paste('A', 1:40, sep = '')
> >> m <- as.data.frame(m)
> >> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = 'x')
> >> dim(dt2)
> >> # [1]  6 42       ...so far, so good
> >>
> >> # Subset of variables for which sums are desired
> >> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')
> >>
> >> # One approach: use the select = argument of subset() to restrict
> >> # the variables under consideration:
> >> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
> >>     x A1 A4 A10 A15 A31
> >> [1,] 1 18 21  22  22  24
> >> [2,] 2 20 13 27 23 21
> >> [3,] 3 22 15  16  23  15
> >>
> >> # Use the with = FALSE construct of data.table to do the same:
> >> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
> >>     x y A1 A4 A10 A15 A31
> >> [1,] 1 1 11 13  12  11  16
> >> [2,] 1 2  7  8  10  11   8
> >> [3,] 2 1 10  4  16   7  11
> >> [4,] 2 3 10 9 11 16 10
> >> [5,] 3 2 11  8   7  11   7
> >> [6,] 3 3 11  7   9  12   8
> >>
> >> # For this example, it is the same (apart from the key variables) as
> >> dt2[, vars, with = FALSE]
> >>
> >> Not bad for this small example, but what happens in a much larger data
> table?
> >>
> >> To find out, I created a 10000 x 1000 matrix that I converted into a
> >> data table, added two grouping variables of 100 levels each and then
> >> tried both approaches above again. Performance isn't bad when
> >> summarizing over one variable, but there is a definite hit when two
> >> variables are summarized. [It makes some sense since one is grouping
> >> over 10000 level combinations rather than 100, but once again, keep
> >> reading.] Curiously, it makes no difference if there is one key
> >> variable or two, which made me wonder what the preferred approach is
> >> in this circumstance.
> >>
> >> m <- matrix(rpois(10000000, 10), nrow = 10000)
> >> m <- as.data.table(m)
> >> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
> >> setkey(m, 'x')
> >> dim(m)
> >> # [1] 10000  1002
> >>
> >> # Randomly select 150 variables from the 1000
> >> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
> >> length(vars)
> >> # [1] 150
> >> key(m)
> >> # [1] "x"
> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
> >>   user  system elapsed
> >>   0.75    0.00    0.75
> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
> >>   user  system elapsed
> >>   0.64    0.00    0.64
> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
> >>   user  system elapsed
> >>  53.65    0.00   53.85
> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
> >>   user  system elapsed
> >>  44.21    0.01   44.35
> >>
> >> m2 <- data.table(m, key = 'x, y')
> >> rm(m)
> >> key(m2)
> >> # [1] "x" "y"
> >>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
> >>   user  system elapsed
> >>  53.54    0.00   53.73
> >>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
> >>   user  system elapsed
> >>  44.30    0.04   44.60
> >>
> >> The first question in the wiki
> >> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
> >> to use the columns directly rather than to rely on .SD. I wanted to
> >> know how to pass new names to the summaries instead of overwriting the
> >> original variable names. For the fun of it, I tried the following:
> >>
> >> select <- sample(1:1000, 150, replace = FALSE)
> >> vars <- paste('A', select, sep = '')
> >> outvars <- paste('S', select, sep = '')
> >>
> >> # Create a long expression of the form 'list(..., Sn = sum(An), ...)',
> >> # n a subscript from 1 to 150.
> >> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
> >> ''), sep = '=', collapse = ','),
> >>               ')', sep = '')
> >> u <- m2[, eval(parse(text = expr)), by = 'x']
> >>> dim(u)
> >> # [1] 100 151     seems reasonable...
> >>
> >> This seemed to run rather fast, so I decided to time it:
> >>
> >>> system.time(m2[, eval(parse(text = expr)), by = 'x'])
> >>   user  system elapsed
> >>   0.03    0.00    0.03
> >>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
> >>   user  system elapsed
> >>   1.05    0.00    1.04
> >>
> >> I've got to admit, this is not the approach I would have taken
> >> normally, is certainly not intuitively obvious to me and flouts the
> >> usual advice to avoid the eval(parse(text = )) mantra, but the data
> >> don't lie :)  Please tell me there's a more code-efficient way to do
> >> this (the new variable names included), because my 'solution' was a
> >> complete kludge and accidental kewpie prize.
> >>
> >> Cheers,
> >> Dennis
> >>
> >>> Message: 1
> >>> Date: Thu, 14 Jul 2011 16:36:11 -0400
> >>> From: Joseph Voelkel <jgvcqa at rit.edu>
> >>> Subject: [datatable-help] Skipping some Vi names
> >>> To: "datatable-help at lists.r-forge.r-project.org"
> >>>        <datatable-help at r-forge.wu-wien.ac.at>
> >>> Message-ID:
> >>>        <
> 70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
> >>> Content-Type: text/plain; charset="us-ascii"
> >>>
> >>> I don't use data.table too much (though I probably should use it
> more...).
> >>>
> >>> I was surprised at the results below. It appears that the name V1 gets
> assigned to the first result, but then the keys ("in the background") are
> assigned the next set of Vi names, creating a gap in the names depending on
> the number of keys. I would like to see the Vi names appear in their
> natural, sequential, order. Not a show stopper, but it's annoying. (I have
> over 40 Vi's and it'd be good to have them numbered more rationally.)
> Thanks.
> >>>
> >>>>
> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
> >>>     x V1 V3 V4
> >>> [1,] 1  5 17 29
> >>> [2,] 2  7 19 31
> >>> [3,] 3  9 21 33
> >>>> key(dt)<-c("x","y")
> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
> >>>     x y V1 V4 V5
> >>> [1,] 1 1  1  7 13
> >>> [2,] 1 2  4 10 16
> >>> [3,] 2 1  2  8 14
> >>> [4,] 2 3  5 11 17
> >>> [5,] 3 2  3  9 15
> >>> [6,] 3 3  6 12 18
> >>>
> >>>
> >>>
> >>> Joseph G. Voelkel, Ph.D.
> >>> Professor, Center for Quality and Applied Statistics
> >>> Kate Gleason College of Engineering
> >>> Rochester Institute of Technology
> >>> V 585-475-2231
> >>> F 585-475-5959
> >>> joseph.voelkel at rit.edu
> >>>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >>
> >
> >
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> >  | Memorial Sloan-Kettering Cancer Center
> >  | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20110715/dbf884d9/attachment-0001.htm>


More information about the datatable-help mailing list