[datatable-help] datatable-help Digest, Vol 17, Issue 10.

Fri Jul 15 16:46:04 CEST 2011

Thanks, Dennis!

(Dennis is a former student of mine who clearly knows data.table much better than I do.)

Joe

-----Original Message-----
From: Dennis Murphy [mailto:djmuser at gmail.com] 
Sent: Friday, July 15, 2011 10:34 AM
To: datatable-help at lists.r-forge.r-project.org; Joseph Voelkel
Subject: Re: datatable-help Digest, Vol 17, Issue 10

Hi:

<A bunch snipped because I get the archives in digest form>

Re Prof. Voelkel's recent posts:

(1) Quoting does not work well in data.table; this is mentioned in
several of the FAQs. Apropos to this discussion, some of the relevant
ones include 1.2, 1.6 and 2.1; there may be others :)

(2) Steve's response seems to be the right way to go (although see
below), but I thought I'd up the stakes a little and assume that Prof.
Voelkel has a large number of variables, only a subset of which he may
want summarized in a particular go. To that end, I created the
following toy data frame cum data.table; this is as much for my own
edification as anyone else's (which explains the eventual length of
this post...I got curious :)

This goes against the advice given in the first example of the
data.table wiki, but if you have, say, 100 variables to select out of
a possible 1000, it doesn't make sense to list them individually as
recommended on the wiki. (But see below...)

library('data.table')
set.seed(1043)
m <- matrix(rpois(240, 10), nrow = 6)
colnames(m) <- paste('A', 1:40, sep = '')
m <- as.data.frame(m)
dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = 'x')
dim(dt2)
# [1]  6 42       ...so far, so good

# Subset of variables for which sums are desired
vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')

# One approach: use the select = argument of subset() to restrict
# the variables under consideration:
dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
     x A1 A4 A10 A15 A31
[1,] 1 18 21  22  22  24
[2,] 2 20 13  27  23  21
[3,] 3 22 15  16  23  15

# Use the with = FALSE construct of data.table to do the same:
dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
     x y A1 A4 A10 A15 A31
[1,] 1 1 11 13  12  11  16
[2,] 1 2  7  8  10  11   8
[3,] 2 1 10  4  16   7  11
[4,] 2 3 10  9  11  16  10
[5,] 3 2 11  8   7  11   7
[6,] 3 3 11  7   9  12   8

# For this example, it is the same (apart from the key variables) as
dt2[, vars, with = FALSE]

Not bad for this small example, but what happens in a much larger data table?

To find out, I created a 10000 x 1000 matrix that I converted into a
data table, added two grouping variables of 100 levels each and then
tried both approaches above again. Performance isn't bad when
summarizing over one variable, but there is a definite hit when two
variables are summarized. [It makes some sense since one is grouping
over 10000 level combinations rather than 100, but once again, keep
reading.] Curiously, it makes no difference if there is one key
variable or two, which made me wonder what the preferred approach is
in this circumstance.

m <- matrix(rpois(10000000, 10), nrow = 10000)
m <- as.data.table(m)
m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
setkey(m, 'x')
dim(m)
# [1] 10000  1002

# Randomly select 150 variables from the 1000
vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
length(vars)
# [1] 150
key(m)
# [1] "x"
> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
   user  system elapsed
   0.75    0.00    0.75
> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
   user  system elapsed
   0.64    0.00    0.64
> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
   user  system elapsed
  53.65    0.00   53.85
> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
   user  system elapsed
  44.21    0.01   44.35

m2 <- data.table(m, key = 'x, y')
rm(m)
key(m2)
# [1] "x" "y"
> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
   user  system elapsed
  53.54    0.00   53.73
> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
   user  system elapsed
  44.30    0.04   44.60

The first question in the wiki
(http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
to use the columns directly rather than to rely on .SD. I wanted to
know how to pass new names to the summaries instead of overwriting the
original variable names. For the fun of it, I tried the following:

select <- sample(1:1000, 150, replace = FALSE)
vars <- paste('A', select, sep = '')
outvars <- paste('S', select, sep = '')

# Create a long expression of the form 'list(..., Sn = sum(An), ...)',
# n a subscript from 1 to 150.
expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
''), sep = '=', collapse = ','),
               ')', sep = '')
u <- m2[, eval(parse(text = expr)), by = 'x']
> dim(u)
# [1] 100 151     seems reasonable...

This seemed to run rather fast, so I decided to time it:

> system.time(m2[, eval(parse(text = expr)), by = 'x'])
   user  system elapsed
   0.03    0.00    0.03
> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
   user  system elapsed
   1.05    0.00    1.04

I've got to admit, this is not the approach I would have taken
normally, is certainly not intuitively obvious to me and flouts the
usual advice to avoid the eval(parse(text = )) mantra, but the data
don't lie :)  Please tell me there's a more code-efficient way to do
this (the new variable names included), because my 'solution' was a
complete kludge and accidental kewpie prize.

Cheers,
Dennis

> Message: 1
> Date: Thu, 14 Jul 2011 16:36:11 -0400
> From: Joseph Voelkel <jgvcqa at rit.edu>
> Subject: [datatable-help] Skipping some Vi names
> To: "datatable-help at lists.r-forge.r-project.org"
>        <datatable-help at r-forge.wu-wien.ac.at>
> Message-ID:
>        <70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
> Content-Type: text/plain; charset="us-ascii"
>
> I don't use data.table too much (though I probably should use it more...).
>
> I was surprised at the results below. It appears that the name V1 gets assigned to the first result, but then the keys ("in the background") are assigned the next set of Vi names, creating a gap in the names depending on the number of keys. I would like to see the Vi names appear in their natural, sequential, order. Not a show stopper, but it's annoying. (I have over 40 Vi's and it'd be good to have them numbered more rationally.) Thanks.
>
>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
>     x V1 V3 V4
> [1,] 1  5 17 29
> [2,] 2  7 19 31
> [3,] 3  9 21 33
>> key(dt)<-c("x","y")
>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
>     x y V1 V4 V5
> [1,] 1 1  1  7 13
> [2,] 1 2  4 10 16
> [3,] 2 1  2  8 14
> [4,] 2 3  5 11 17
> [5,] 3 2  3  9 15
> [6,] 3 3  6 12 18
>
>
>
> Joseph G. Voelkel, Ph.D.
> Professor, Center for Quality and Applied Statistics
> Kate Gleason College of Engineering
> Rochester Institute of Technology
> V 585-475-2231
> F 585-475-5959
> joseph.voelkel at rit.edu
>