[datatable-help] datatable-help Digest, Vol 17, Issue 10

Matthew Dowle mdowle at mdowle.plus.com
Sun Jul 17 13:43:02 CEST 2011


Nice observations. All tests, observations and motivations were correct
as far as I could see, although I didn't rerun myself.

Two things going on :

i) Whenever you use .SD in j, .SD will contain *all* the columns from
the table, regardless of how .SD is used. That's because it's difficult
for data.table to know which columns of .SD the j really uses. Where the
subset appears directly in j it's pretty obvious but where the subset of
columns are held in a variable, and that variable could be the same name
as a column name, it all gets complicated.    But, there is a simple
solution (I think) : we could add a new argument to data.table called
'.SDcols' and you could pass the subset of columns in there; e.g.,

     DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]

Would that be better?

ii) lapply() is the base R lapply, which we know is slow. Recall that
data.table is over 10 times faster than tapply because tapply calls
lapply. Note also that lapply takes a function (closure) whereas
data.table's j is just a body (lambda). The syntax changes for
data.table weren't just for fun, you know ;)  There's a FR on this :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978


However,  doing both (i) and (ii) just makes the syntax easier to access
the speed which is already possible by automatically creating a (long) j
expression. That's how data.table knows which columns are being used (by
inspecting the expression using all.vars(), only subsetting those) and
there isn't any call to lapply so that slow down goes away. Maybe making
helper functions to make that easier is another way to go.
 
Matthew


On Fri, 2011-07-15 at 12:01 -0700, Dennis Murphy wrote:
> And I just posted something that nicely used colwise() in conjunction
> with a vector of variable names on R-help just a few minutes ago. A
> look at the colwise() help page shows that there are several ways to
> input a variable list for use with colwise(): a bquoted,
> comma-separated, unquoted string of variables (e.g., .(A, B, C)), a
> one-sided formula interface or a vector of (quoted) variable names,
> which was my particular concern. I was ready to recant my assertion,
> but you guys are too quick and sharp for me today :) So much good
> stuff going on in the R-related lists the past two days to which I can
> either contribute or learn from...
> 
> Back to the point, an efficient variable selection mechanism in
> conjunction with a processing function that could optionally take a
> user-contributed (anonymous) function would be a welcome feature in
> data.table.
> 
> Cheers,
> Dennis
> 
> On Fri, Jul 15, 2011 at 11:29 AM, Chris Neff <caneff at gmail.com> wrote:
> > Just chiming in to say something similar to colwise from plyr would be quite
> > nice. You could just carry around a vector of variable names, then do
> > something DT[ ,colwise( f, var_names), by=by_names ].
> >
> > On 15 July 2011 14:06, Dennis Murphy <djmuser at gmail.com> wrote:
> >>
> >> On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou
> >> <mailinglist.honeypot at gmail.com> wrote:
> >> > Hi Dennis,
> >> >
> >> > I didn't see your post before I sent my latest reply.
> >> >
> >> > Nice detective work!
> >>
> >> Thanks, Steve. I just followed my nose and the docs, which I have
> >> conveniently kept in a small binder for such occasions :) Like JV, I
> >> don't use data.table every day, so some of its idiosyncracies get
> >> cobwebbed in the hard drive over time. The wiki entries helped a lot.
> >>
> >> >
> >> > For what it's worth, from what I understand your
> >> > "punchline"/kewpie-prize solution is so much faster because it avoids
> >> > building the .SD data.table within each group.
> >>
> >> That was my deduction from having read the first entry in the wiki. I
> >> still can't believe I got that thing to work :)
> >>
> >> >
> >> > I'll let Matthew leave a more detailed comment, since he's (obviously)
> >> > much more intimately familiar w/ the inner voodoo of data.table. But
> >> > as a last comment -- if the speed differences are so drastic because
> >> > of the cost of creating the .SD data.table, maybe we should think
> >> > about taking some "inspiration" from plyr and define a similar
> >> > `colwise` function -- which would operate across each "column" of
> >> > supposedly-build .SD object applying a function to each of them w/o
> >> > actually building an .SD object itself.
> >>
> >> Your clairvoyance skills are clearly operating today :)  More
> >> seriously, this is what I would consider an 'obvious' "big-data"
> >> problem - I could easily see situations arising in finance and genomic
> >> applications where a fairly large subset of variables of the same
> >> type, but not necessarily all of them, need to be summarized in a
> >> particular way. The colwise() functions would be problematic as well
> >> in the scenario described in my eariler post, but I haven't tried
> >> ddply() to verify that assertion so I could be mistaken.
> >>
> >> It would be *really* helpful to have a convenient, fast  mechanism in
> >> data.table that allows one to substitute a (possibly large) vector of
> >> variable names into a function. Alas, I don't have any bright ideas
> >> about how to program it. Fortunately, there are some nice functions in
> >> R to select variable subsets efficiently in data frames (e.g., the
> >> grep() family of functions, regular expressions, %in% and so on), but
> >> I don't know how that would translate easily to data.table() since the
> >> internals are so different.
> >>
> >> Looking forward to the team's take on this...
> >>
> >> Dennis
> >>
> >> >
> >> > -steve
> >> >
> >> > On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <djmuser at gmail.com>
> >> > wrote:
> >> >> Hi:
> >> >>
> >> >> <A bunch snipped because I get the archives in digest form>
> >> >>
> >> >> Re Prof. Voelkel's recent posts:
> >> >>
> >> >> (1) Quoting does not work well in data.table; this is mentioned in
> >> >> several of the FAQs. Apropos to this discussion, some of the relevant
> >> >> ones include 1.2, 1.6 and 2.1; there may be others :)
> >> >>
> >> >> (2) Steve's response seems to be the right way to go (although see
> >> >> below), but I thought I'd up the stakes a little and assume that Prof.
> >> >> Voelkel has a large number of variables, only a subset of which he may
> >> >> want summarized in a particular go. To that end, I created the
> >> >> following toy data frame cum data.table; this is as much for my own
> >> >> edification as anyone else's (which explains the eventual length of
> >> >> this post...I got curious :)
> >> >>
> >> >> This goes against the advice given in the first example of the
> >> >> data.table wiki, but if you have, say, 100 variables to select out of
> >> >> a possible 1000, it doesn't make sense to list them individually as
> >> >> recommended on the wiki. (But see below...)
> >> >>
> >> >> library('data.table')
> >> >> set.seed(1043)
> >> >> m <- matrix(rpois(240, 10), nrow = 6)
> >> >> colnames(m) <- paste('A', 1:40, sep = '')
> >> >> m <- as.data.frame(m)
> >> >> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key =
> >> >> 'x')
> >> >> dim(dt2)
> >> >> # [1]  6 42       ...so far, so good
> >> >>
> >> >> # Subset of variables for which sums are desired
> >> >> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')
> >> >>
> >> >> # One approach: use the select = argument of subset() to restrict
> >> >> # the variables under consideration:
> >> >> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
> >> >>     x A1 A4 A10 A15 A31
> >> >> [1,] 1 18 21  22  22  24
> >> >> [2,] 2 20 13 27 23 21
> >> >> [3,] 3 22 15  16  23  15
> >> >>
> >> >> # Use the with = FALSE construct of data.table to do the same:
> >> >> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
> >> >>     x y A1 A4 A10 A15 A31
> >> >> [1,] 1 1 11 13  12  11  16
> >> >> [2,] 1 2  7  8  10  11   8
> >> >> [3,] 2 1 10  4  16   7  11
> >> >> [4,] 2 3 10 9 11 16 10
> >> >> [5,] 3 2 11  8   7  11   7
> >> >> [6,] 3 3 11  7   9  12   8
> >> >>
> >> >> # For this example, it is the same (apart from the key variables) as
> >> >> dt2[, vars, with = FALSE]
> >> >>
> >> >> Not bad for this small example, but what happens in a much larger data
> >> >> table?
> >> >>
> >> >> To find out, I created a 10000 x 1000 matrix that I converted into a
> >> >> data table, added two grouping variables of 100 levels each and then
> >> >> tried both approaches above again. Performance isn't bad when
> >> >> summarizing over one variable, but there is a definite hit when two
> >> >> variables are summarized. [It makes some sense since one is grouping
> >> >> over 10000 level combinations rather than 100, but once again, keep
> >> >> reading.] Curiously, it makes no difference if there is one key
> >> >> variable or two, which made me wonder what the preferred approach is
> >> >> in this circumstance.
> >> >>
> >> >> m <- matrix(rpois(10000000, 10), nrow = 10000)
> >> >> m <- as.data.table(m)
> >> >> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
> >> >> setkey(m, 'x')
> >> >> dim(m)
> >> >> # [1] 10000  1002
> >> >>
> >> >> # Randomly select 150 variables from the 1000
> >> >> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
> >> >> length(vars)
> >> >> # [1] 150
> >> >> key(m)
> >> >> # [1] "x"
> >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
> >> >>   user  system elapsed
> >> >>   0.75    0.00    0.75
> >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
> >> >>   user  system elapsed
> >> >>   0.64    0.00    0.64
> >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
> >> >>   user  system elapsed
> >> >>  53.65    0.00   53.85
> >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
> >> >>   user  system elapsed
> >> >>  44.21    0.01   44.35
> >> >>
> >> >> m2 <- data.table(m, key = 'x, y')
> >> >> rm(m)
> >> >> key(m2)
> >> >> # [1] "x" "y"
> >> >>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x,
> >> >>> y'])
> >> >>   user  system elapsed
> >> >>  53.54    0.00   53.73
> >> >>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
> >> >>   user  system elapsed
> >> >>  44.30    0.04   44.60
> >> >>
> >> >> The first question in the wiki
> >> >> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
> >> >> to use the columns directly rather than to rely on .SD. I wanted to
> >> >> know how to pass new names to the summaries instead of overwriting the
> >> >> original variable names. For the fun of it, I tried the following:
> >> >>
> >> >> select <- sample(1:1000, 150, replace = FALSE)
> >> >> vars <- paste('A', select, sep = '')
> >> >> outvars <- paste('S', select, sep = '')
> >> >>
> >> >> # Create a long expression of the form 'list(..., Sn = sum(An), ...)',
> >> >> # n a subscript from 1 to 150.
> >> >> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
> >> >> ''), sep = '=', collapse = ','),
> >> >>               ')', sep = '')
> >> >> u <- m2[, eval(parse(text = expr)), by = 'x']
> >> >>> dim(u)
> >> >> # [1] 100 151     seems reasonable...
> >> >>
> >> >> This seemed to run rather fast, so I decided to time it:
> >> >>
> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x'])
> >> >>   user  system elapsed
> >> >>   0.03    0.00    0.03
> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
> >> >>   user  system elapsed
> >> >>   1.05    0.00    1.04
> >> >>
> >> >> I've got to admit, this is not the approach I would have taken
> >> >> normally, is certainly not intuitively obvious to me and flouts the
> >> >> usual advice to avoid the eval(parse(text = )) mantra, but the data
> >> >> don't lie :)  Please tell me there's a more code-efficient way to do
> >> >> this (the new variable names included), because my 'solution' was a
> >> >> complete kludge and accidental kewpie prize.
> >> >>
> >> >> Cheers,
> >> >> Dennis
> >> >>
> >> >>> Message: 1
> >> >>> Date: Thu, 14 Jul 2011 16:36:11 -0400
> >> >>> From: Joseph Voelkel <jgvcqa at rit.edu>
> >> >>> Subject: [datatable-help] Skipping some Vi names
> >> >>> To: "datatable-help at lists.r-forge.r-project.org"
> >> >>>        <datatable-help at r-forge.wu-wien.ac.at>
> >> >>> Message-ID:
> >> >>>
> >> >>>  <70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
> >> >>> Content-Type: text/plain; charset="us-ascii"
> >> >>>
> >> >>> I don't use data.table too much (though I probably should use it
> >> >>> more...).
> >> >>>
> >> >>> I was surprised at the results below. It appears that the name V1 gets
> >> >>> assigned to the first result, but then the keys ("in the background") are
> >> >>> assigned the next set of Vi names, creating a gap in the names depending on
> >> >>> the number of keys. I would like to see the Vi names appear in their
> >> >>> natural, sequential, order. Not a show stopper, but it's annoying. (I have
> >> >>> over 40 Vi's and it'd be good to have them numbered more rationally.)
> >> >>> Thanks.
> >> >>>
> >> >>>>
> >> >>>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
> >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
> >> >>>     x V1 V3 V4
> >> >>> [1,] 1  5 17 29
> >> >>> [2,] 2  7 19 31
> >> >>> [3,] 3  9 21 33
> >> >>>> key(dt)<-c("x","y")
> >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
> >> >>>     x y V1 V4 V5
> >> >>> [1,] 1 1  1  7 13
> >> >>> [2,] 1 2  4 10 16
> >> >>> [3,] 2 1  2  8 14
> >> >>> [4,] 2 3  5 11 17
> >> >>> [5,] 3 2  3  9 15
> >> >>> [6,] 3 3  6 12 18
> >> >>>
> >> >>>
> >> >>>
> >> >>> Joseph G. Voelkel, Ph.D.
> >> >>> Professor, Center for Quality and Applied Statistics
> >> >>> Kate Gleason College of Engineering
> >> >>> Rochester Institute of Technology
> >> >>> V 585-475-2231
> >> >>> F 585-475-5959
> >> >>> joseph.voelkel at rit.edu
> >> >>>
> >> >> _______________________________________________
> >> >> datatable-help mailing list
> >> >> datatable-help at lists.r-forge.r-project.org
> >> >>
> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Steve Lianoglou
> >> > Graduate Student: Computational Systems Biology
> >> >  | Memorial Sloan-Kettering Cancer Center
> >> >  | Weill Medical College of Cornell University
> >> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >> >
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >>
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list