[datatable-help] datatable-help Digest, Vol 17, Issue 10

Fri Jul 15 17:23:11 CEST 2011

Hi Dennis,

I didn't see your post before I sent my latest reply.

Nice detective work!

For what it's worth, from what I understand your
"punchline"/kewpie-prize solution is so much faster because it avoids
building the .SD data.table within each group.

I'll let Matthew leave a more detailed comment, since he's (obviously)
much more intimately familiar w/ the inner voodoo of data.table. But
as a last comment -- if the speed differences are so drastic because
of the cost of creating the .SD data.table, maybe we should think
about taking some "inspiration" from plyr and define a similar
`colwise` function -- which would operate across each "column" of
supposedly-build .SD object applying a function to each of them w/o
actually building an .SD object itself.

-steve

On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <djmuser at gmail.com> wrote:
> Hi:
>
> <A bunch snipped because I get the archives in digest form>
>
> Re Prof. Voelkel's recent posts:
>
> (1) Quoting does not work well in data.table; this is mentioned in
> several of the FAQs. Apropos to this discussion, some of the relevant
> ones include 1.2, 1.6 and 2.1; there may be others :)
>
> (2) Steve's response seems to be the right way to go (although see
> below), but I thought I'd up the stakes a little and assume that Prof.
> Voelkel has a large number of variables, only a subset of which he may
> want summarized in a particular go. To that end, I created the
> following toy data frame cum data.table; this is as much for my own
> edification as anyone else's (which explains the eventual length of
> this post...I got curious :)
>
> This goes against the advice given in the first example of the
> data.table wiki, but if you have, say, 100 variables to select out of
> a possible 1000, it doesn't make sense to list them individually as
> recommended on the wiki. (But see below...)
>
> library('data.table')
> set.seed(1043)
> m <- matrix(rpois(240, 10), nrow = 6)
> colnames(m) <- paste('A', 1:40, sep = '')
> m <- as.data.frame(m)
> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = 'x')
> dim(dt2)
> # [1]  6 42       ...so far, so good
>
> # Subset of variables for which sums are desired
> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')
>
> # One approach: use the select = argument of subset() to restrict
> # the variables under consideration:
> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
>     x A1 A4 A10 A15 A31
> [1,] 1 18 21  22  22  24
> [2,] 2 20 13 27 23 21
> [3,] 3 22 15  16  23  15
>
> # Use the with = FALSE construct of data.table to do the same:
> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
>     x y A1 A4 A10 A15 A31
> [1,] 1 1 11 13  12  11  16
> [2,] 1 2  7  8  10  11   8
> [3,] 2 1 10  4  16   7  11
> [4,] 2 3 10 9 11 16 10
> [5,] 3 2 11  8   7  11   7
> [6,] 3 3 11  7   9  12   8
>
> # For this example, it is the same (apart from the key variables) as
> dt2[, vars, with = FALSE]
>
> Not bad for this small example, but what happens in a much larger data table?
>
> To find out, I created a 10000 x 1000 matrix that I converted into a
> data table, added two grouping variables of 100 levels each and then
> tried both approaches above again. Performance isn't bad when
> summarizing over one variable, but there is a definite hit when two
> variables are summarized. [It makes some sense since one is grouping
> over 10000 level combinations rather than 100, but once again, keep
> reading.] Curiously, it makes no difference if there is one key
> variable or two, which made me wonder what the preferred approach is
> in this circumstance.
>
> m <- matrix(rpois(10000000, 10), nrow = 10000)
> m <- as.data.table(m)
> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
> setkey(m, 'x')
> dim(m)
> # [1] 10000  1002
>
> # Randomly select 150 variables from the 1000
> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
> length(vars)
> # [1] 150
> key(m)
> # [1] "x"
>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
>   user  system elapsed
>   0.75    0.00    0.75
>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
>   user  system elapsed
>   0.64    0.00    0.64
>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
>   user  system elapsed
>  53.65    0.00   53.85
>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
>   user  system elapsed
>  44.21    0.01   44.35
>
> m2 <- data.table(m, key = 'x, y')
> rm(m)
> key(m2)
> # [1] "x" "y"
>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
>   user  system elapsed
>  53.54    0.00   53.73
>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
>   user  system elapsed
>  44.30    0.04   44.60
>
> The first question in the wiki
> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
> to use the columns directly rather than to rely on .SD. I wanted to
> know how to pass new names to the summaries instead of overwriting the
> original variable names. For the fun of it, I tried the following:
>
> select <- sample(1:1000, 150, replace = FALSE)
> vars <- paste('A', select, sep = '')
> outvars <- paste('S', select, sep = '')
>
> # Create a long expression of the form 'list(..., Sn = sum(An), ...)',
> # n a subscript from 1 to 150.
> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
> ''), sep = '=', collapse = ','),
>               ')', sep = '')
> u <- m2[, eval(parse(text = expr)), by = 'x']
>> dim(u)
> # [1] 100 151     seems reasonable...
>
> This seemed to run rather fast, so I decided to time it:
>
>> system.time(m2[, eval(parse(text = expr)), by = 'x'])
>   user  system elapsed
>   0.03    0.00    0.03
>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
>   user  system elapsed
>   1.05    0.00    1.04
>
> I've got to admit, this is not the approach I would have taken
> normally, is certainly not intuitively obvious to me and flouts the
> usual advice to avoid the eval(parse(text = )) mantra, but the data
> don't lie :)  Please tell me there's a more code-efficient way to do
> this (the new variable names included), because my 'solution' was a
> complete kludge and accidental kewpie prize.
>
> Cheers,
> Dennis
>
>> Message: 1
>> Date: Thu, 14 Jul 2011 16:36:11 -0400
>> From: Joseph Voelkel <jgvcqa at rit.edu>
>> Subject: [datatable-help] Skipping some Vi names
>> To: "datatable-help at lists.r-forge.r-project.org"
>>        <datatable-help at r-forge.wu-wien.ac.at>
>> Message-ID:
>>        <70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> I don't use data.table too much (though I probably should use it more...).
>>
>> I was surprised at the results below. It appears that the name V1 gets assigned to the first result, but then the keys ("in the background") are assigned the next set of Vi names, creating a gap in the names depending on the number of keys. I would like to see the Vi names appear in their natural, sequential, order. Not a show stopper, but it's annoying. (I have over 40 Vi's and it'd be good to have them numbered more rationally.) Thanks.
>>
>>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
>>     x V1 V3 V4
>> [1,] 1  5 17 29
>> [2,] 2  7 19 31
>> [3,] 3  9 21 33
>>> key(dt)<-c("x","y")
>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
>>     x y V1 V4 V5
>> [1,] 1 1  1  7 13
>> [2,] 1 2  4 10 16
>> [3,] 2 1  2  8 14
>> [4,] 2 3  5 11 17
>> [5,] 3 2  3  9 15
>> [6,] 3 3  6 12 18
>>
>>
>>
>> Joseph G. Voelkel, Ph.D.
>> Professor, Center for Quality and Applied Statistics
>> Kate Gleason College of Engineering
>> Rochester Institute of Technology
>> V 585-475-2231
>> F 585-475-5959
>> joseph.voelkel at rit.edu
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact