[datatable-help] datatable-help Digest, Vol 17, Issue 10

Sun Jul 17 17:24:39 CEST 2011

Hi,

Just an addition comment about:

On Sun, Jul 17, 2011 at 7:43 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> i) Whenever you use .SD in j, .SD will contain *all* the columns from
> the table, regardless of how .SD is used. That's because it's difficult
> for data.table to know which columns of .SD the j really uses. Where the
> subset appears directly in j it's pretty obvious but where the subset of
> columns are held in a variable, and that variable could be the same name
> as a column name, it all gets complicated.    But, there is a simple
> solution (I think) : we could add a new argument to data.table called
> '.SDcols' and you could pass the subset of columns in there; e.g.,
>
>     DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]
>
> Would that be better?

Which is that I think that a solution that avoids building the
temporary .SD altogether would be the most advantageous for "these
scenarios."

I think we're all on the same page with that, but I just wanted to
make that point explicit.

The reason I say this is because I think if we only figure out which
sub-columns to use to reconstruct the .SD will still leave performance
gains to be had if we instead just forget about the tabular structure
of .SD and we just stuff the columns into a normal list-of-things
(where the things are the would-be columns of .SD).

> ii) lapply() is the base R lapply, which we know is slow. Recall that
> data.table is over 10 times faster than tapply because tapply calls
> lapply. Note also that lapply takes a function (closure) whereas
> data.table's j is just a body (lambda). The syntax changes for
> data.table weren't just for fun, you know ;)  There's a FR on this :
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978

I like that FR -- as long as we can get around the whole .SD thing :-)

Something like Chris's `colwise( f, var_names)` thing is what I have in mind.

Maybe shoehorning all of this into the current `data.table.[` might be
to ... tough?

What if we had a colwise like function

colwise(my.data.table, colnames, EXPR, by, ...)

Where everything from the by param onwards would work like the params
in `data.table.[`, but this invokation would run EXPR over each of the
columns listed in `colnames` in your `my.data.table`, using the `by`
groupings as "we expect."

Would this be a helpful way to approach this? That way the
`data.table.[` function isn't overloaded with too much different
functionality. It might be that cramming all of these specialized
cases into the same function might be making it too magical is all.

Also -- `colwise` could be `colapply` or something similar to avoid
trampling on the function by the same name in plyr.

-steve

> However,  doing both (i) and (ii) just makes the syntax easier to access
> the speed which is already possible by automatically creating a (long) j
> expression. That's how data.table knows which columns are being used (by
> inspecting the expression using all.vars(), only subsetting those) and
> there isn't any call to lapply so that slow down goes away. Maybe making
> helper functions to make that easier is another way to go.
>
> Matthew
>
>
> On Fri, 2011-07-15 at 12:01 -0700, Dennis Murphy wrote:
>> And I just posted something that nicely used colwise() in conjunction
>> with a vector of variable names on R-help just a few minutes ago. A
>> look at the colwise() help page shows that there are several ways to
>> input a variable list for use with colwise(): a bquoted,
>> comma-separated, unquoted string of variables (e.g., .(A, B, C)), a
>> one-sided formula interface or a vector of (quoted) variable names,
>> which was my particular concern. I was ready to recant my assertion,
>> but you guys are too quick and sharp for me today :) So much good
>> stuff going on in the R-related lists the past two days to which I can
>> either contribute or learn from...
>>
>> Back to the point, an efficient variable selection mechanism in
>> conjunction with a processing function that could optionally take a
>> user-contributed (anonymous) function would be a welcome feature in
>> data.table.
>>
>> Cheers,
>> Dennis
>>
>> On Fri, Jul 15, 2011 at 11:29 AM, Chris Neff <caneff at gmail.com> wrote:
>> > Just chiming in to say something similar to colwise from plyr would be quite
>> > nice. You could just carry around a vector of variable names, then do
>> > something DT[ ,colwise( f, var_names), by=by_names ].
>> >
>> > On 15 July 2011 14:06, Dennis Murphy <djmuser at gmail.com> wrote:
>> >>
>> >> On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou
>> >> <mailinglist.honeypot at gmail.com> wrote:
>> >> > Hi Dennis,
>> >> >
>> >> > I didn't see your post before I sent my latest reply.
>> >> >
>> >> > Nice detective work!
>> >>
>> >> Thanks, Steve. I just followed my nose and the docs, which I have
>> >> conveniently kept in a small binder for such occasions :) Like JV, I
>> >> don't use data.table every day, so some of its idiosyncracies get
>> >> cobwebbed in the hard drive over time. The wiki entries helped a lot.
>> >>
>> >> >
>> >> > For what it's worth, from what I understand your
>> >> > "punchline"/kewpie-prize solution is so much faster because it avoids
>> >> > building the .SD data.table within each group.
>> >>
>> >> That was my deduction from having read the first entry in the wiki. I
>> >> still can't believe I got that thing to work :)
>> >>
>> >> >
>> >> > I'll let Matthew leave a more detailed comment, since he's (obviously)
>> >> > much more intimately familiar w/ the inner voodoo of data.table. But
>> >> > as a last comment -- if the speed differences are so drastic because
>> >> > of the cost of creating the .SD data.table, maybe we should think
>> >> > about taking some "inspiration" from plyr and define a similar
>> >> > `colwise` function -- which would operate across each "column" of
>> >> > supposedly-build .SD object applying a function to each of them w/o
>> >> > actually building an .SD object itself.
>> >>
>> >> Your clairvoyance skills are clearly operating today :)  More
>> >> seriously, this is what I would consider an 'obvious' "big-data"
>> >> problem - I could easily see situations arising in finance and genomic
>> >> applications where a fairly large subset of variables of the same
>> >> type, but not necessarily all of them, need to be summarized in a
>> >> particular way. The colwise() functions would be problematic as well
>> >> in the scenario described in my eariler post, but I haven't tried
>> >> ddply() to verify that assertion so I could be mistaken.
>> >>
>> >> It would be *really* helpful to have a convenient, fast  mechanism in
>> >> data.table that allows one to substitute a (possibly large) vector of
>> >> variable names into a function. Alas, I don't have any bright ideas
>> >> about how to program it. Fortunately, there are some nice functions in
>> >> R to select variable subsets efficiently in data frames (e.g., the
>> >> grep() family of functions, regular expressions, %in% and so on), but
>> >> I don't know how that would translate easily to data.table() since the
>> >> internals are so different.
>> >>
>> >> Looking forward to the team's take on this...
>> >>
>> >> Dennis
>> >>
>> >> >
>> >> > -steve
>> >> >
>> >> > On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <djmuser at gmail.com>
>> >> > wrote:
>> >> >> Hi:
>> >> >>
>> >> >> <A bunch snipped because I get the archives in digest form>
>> >> >>
>> >> >> Re Prof. Voelkel's recent posts:
>> >> >>
>> >> >> (1) Quoting does not work well in data.table; this is mentioned in
>> >> >> several of the FAQs. Apropos to this discussion, some of the relevant
>> >> >> ones include 1.2, 1.6 and 2.1; there may be others :)
>> >> >>
>> >> >> (2) Steve's response seems to be the right way to go (although see
>> >> >> below), but I thought I'd up the stakes a little and assume that Prof.
>> >> >> Voelkel has a large number of variables, only a subset of which he may
>> >> >> want summarized in a particular go. To that end, I created the
>> >> >> following toy data frame cum data.table; this is as much for my own
>> >> >> edification as anyone else's (which explains the eventual length of
>> >> >> this post...I got curious :)
>> >> >>
>> >> >> This goes against the advice given in the first example of the
>> >> >> data.table wiki, but if you have, say, 100 variables to select out of
>> >> >> a possible 1000, it doesn't make sense to list them individually as
>> >> >> recommended on the wiki. (But see below...)
>> >> >>
>> >> >> library('data.table')
>> >> >> set.seed(1043)
>> >> >> m <- matrix(rpois(240, 10), nrow = 6)
>> >> >> colnames(m) <- paste('A', 1:40, sep = '')
>> >> >> m <- as.data.frame(m)
>> >> >> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key =
>> >> >> 'x')
>> >> >> dim(dt2)
>> >> >> # [1]  6 42       ...so far, so good
>> >> >>
>> >> >> # Subset of variables for which sums are desired
>> >> >> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')
>> >> >>
>> >> >> # One approach: use the select = argument of subset() to restrict
>> >> >> # the variables under consideration:
>> >> >> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
>> >> >>     x A1 A4 A10 A15 A31
>> >> >> [1,] 1 18 21  22  22  24
>> >> >> [2,] 2 20 13 27 23 21
>> >> >> [3,] 3 22 15  16  23  15
>> >> >>
>> >> >> # Use the with = FALSE construct of data.table to do the same:
>> >> >> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
>> >> >>     x y A1 A4 A10 A15 A31
>> >> >> [1,] 1 1 11 13  12  11  16
>> >> >> [2,] 1 2  7  8  10  11   8
>> >> >> [3,] 2 1 10  4  16   7  11
>> >> >> [4,] 2 3 10 9 11 16 10
>> >> >> [5,] 3 2 11  8   7  11   7
>> >> >> [6,] 3 3 11  7   9  12   8
>> >> >>
>> >> >> # For this example, it is the same (apart from the key variables) as
>> >> >> dt2[, vars, with = FALSE]
>> >> >>
>> >> >> Not bad for this small example, but what happens in a much larger data
>> >> >> table?
>> >> >>
>> >> >> To find out, I created a 10000 x 1000 matrix that I converted into a
>> >> >> data table, added two grouping variables of 100 levels each and then
>> >> >> tried both approaches above again. Performance isn't bad when
>> >> >> summarizing over one variable, but there is a definite hit when two
>> >> >> variables are summarized. [It makes some sense since one is grouping
>> >> >> over 10000 level combinations rather than 100, but once again, keep
>> >> >> reading.] Curiously, it makes no difference if there is one key
>> >> >> variable or two, which made me wonder what the preferred approach is
>> >> >> in this circumstance.
>> >> >>
>> >> >> m <- matrix(rpois(10000000, 10), nrow = 10000)
>> >> >> m <- as.data.table(m)
>> >> >> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
>> >> >> setkey(m, 'x')
>> >> >> dim(m)
>> >> >> # [1] 10000  1002
>> >> >>
>> >> >> # Randomly select 150 variables from the 1000
>> >> >> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
>> >> >> length(vars)
>> >> >> # [1] 150
>> >> >> key(m)
>> >> >> # [1] "x"
>> >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
>> >> >>   user  system elapsed
>> >> >>   0.75    0.00    0.75
>> >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
>> >> >>   user  system elapsed
>> >> >>   0.64    0.00    0.64
>> >> >>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
>> >> >>   user  system elapsed
>> >> >>  53.65    0.00   53.85
>> >> >>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
>> >> >>   user  system elapsed
>> >> >>  44.21    0.01   44.35
>> >> >>
>> >> >> m2 <- data.table(m, key = 'x, y')
>> >> >> rm(m)
>> >> >> key(m2)
>> >> >> # [1] "x" "y"
>> >> >>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x,
>> >> >>> y'])
>> >> >>   user  system elapsed
>> >> >>  53.54    0.00   53.73
>> >> >>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
>> >> >>   user  system elapsed
>> >> >>  44.30    0.04   44.60
>> >> >>
>> >> >> The first question in the wiki
>> >> >> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
>> >> >> to use the columns directly rather than to rely on .SD. I wanted to
>> >> >> know how to pass new names to the summaries instead of overwriting the
>> >> >> original variable names. For the fun of it, I tried the following:
>> >> >>
>> >> >> select <- sample(1:1000, 150, replace = FALSE)
>> >> >> vars <- paste('A', select, sep = '')
>> >> >> outvars <- paste('S', select, sep = '')
>> >> >>
>> >> >> # Create a long expression of the form 'list(..., Sn = sum(An), ...)',
>> >> >> # n a subscript from 1 to 150.
>> >> >> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
>> >> >> ''), sep = '=', collapse = ','),
>> >> >>               ')', sep = '')
>> >> >> u <- m2[, eval(parse(text = expr)), by = 'x']
>> >> >>> dim(u)
>> >> >> # [1] 100 151     seems reasonable...
>> >> >>
>> >> >> This seemed to run rather fast, so I decided to time it:
>> >> >>
>> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x'])
>> >> >>   user  system elapsed
>> >> >>   0.03    0.00    0.03
>> >> >>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
>> >> >>   user  system elapsed
>> >> >>   1.05    0.00    1.04
>> >> >>
>> >> >> I've got to admit, this is not the approach I would have taken
>> >> >> normally, is certainly not intuitively obvious to me and flouts the
>> >> >> usual advice to avoid the eval(parse(text = )) mantra, but the data
>> >> >> don't lie :)  Please tell me there's a more code-efficient way to do
>> >> >> this (the new variable names included), because my 'solution' was a
>> >> >> complete kludge and accidental kewpie prize.
>> >> >>
>> >> >> Cheers,
>> >> >> Dennis
>> >> >>
>> >> >>> Message: 1
>> >> >>> Date: Thu, 14 Jul 2011 16:36:11 -0400
>> >> >>> From: Joseph Voelkel <jgvcqa at rit.edu>
>> >> >>> Subject: [datatable-help] Skipping some Vi names
>> >> >>> To: "datatable-help at lists.r-forge.r-project.org"
>> >> >>>        <datatable-help at r-forge.wu-wien.ac.at>
>> >> >>> Message-ID:
>> >> >>>
>> >> >>>  <70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
>> >> >>> Content-Type: text/plain; charset="us-ascii"
>> >> >>>
>> >> >>> I don't use data.table too much (though I probably should use it
>> >> >>> more...).
>> >> >>>
>> >> >>> I was surprised at the results below. It appears that the name V1 gets
>> >> >>> assigned to the first result, but then the keys ("in the background") are
>> >> >>> assigned the next set of Vi names, creating a gap in the names depending on
>> >> >>> the number of keys. I would like to see the Vi names appear in their
>> >> >>> natural, sequential, order. Not a show stopper, but it's annoying. (I have
>> >> >>> over 40 Vi's and it'd be good to have them numbered more rationally.)
>> >> >>> Thanks.
>> >> >>>
>> >> >>>>
>> >> >>>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
>> >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
>> >> >>>     x V1 V3 V4
>> >> >>> [1,] 1  5 17 29
>> >> >>> [2,] 2  7 19 31
>> >> >>> [3,] 3  9 21 33
>> >> >>>> key(dt)<-c("x","y")
>> >> >>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
>> >> >>>     x y V1 V4 V5
>> >> >>> [1,] 1 1  1  7 13
>> >> >>> [2,] 1 2  4 10 16
>> >> >>> [3,] 2 1  2  8 14
>> >> >>> [4,] 2 3  5 11 17
>> >> >>> [5,] 3 2  3  9 15
>> >> >>> [6,] 3 3  6 12 18
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Joseph G. Voelkel, Ph.D.
>> >> >>> Professor, Center for Quality and Applied Statistics
>> >> >>> Kate Gleason College of Engineering
>> >> >>> Rochester Institute of Technology
>> >> >>> V 585-475-2231
>> >> >>> F 585-475-5959
>> >> >>> joseph.voelkel at rit.edu
>> >> >>>
>> >> >> _______________________________________________
>> >> >> datatable-help mailing list
>> >> >> datatable-help at lists.r-forge.r-project.org
>> >> >>
>> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Steve Lianoglou
>> >> > Graduate Student: Computational Systems Biology
>> >> >  | Memorial Sloan-Kettering Cancer Center
>> >> >  | Weill Medical College of Cornell University
>> >> > Contact Info: http://cbio.mskcc.org/~lianos/contact
>> >> >
>> >> _______________________________________________
>> >> datatable-help mailing list
>> >> datatable-help at lists.r-forge.r-project.org
>> >>
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> >
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact