[datatable-help] Access to local variables in "j" expressions

Tue May 11 20:23:24 CEST 2010

Johann, see below.

> -----Original Message-----
> From: Johann Hibschman [mailto:jhibschman at gmail.com] 
> Sent: Monday, May 10, 2010 4:16 PM
> To: Short, Tom
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Access to local variables in 
> "j" expressions
> 
> After thinking about this some, I realized I was having 
> data.table do extra work by not pre-calculating as much as 
> possible. After I followed the example of "calc.fake.dt.2", 
> below, I wound up with data.table being about 2.5 times 
> faster than the data.frame version.
> 
> So, yes, data.table is faster, but I have to introduce a few 
> more temporary columns in order to make it work efficiently.
> 
> Here's the example code I ran:
> 
> ## Try fake data experiments, to see if I can duplicate the results.
> mk.fake.df <- function (n.groups=10000, n.per.group=70) {
>   data.frame(grp=rep(1:n.groups, each=n.per.group),
>              age=rep(0:(n.per.group-1), n.groups),
>              x=rnorm(n.groups * n.per.group),
>              ## These don't do anything, but only exist to give
>              ## the table a similar size to the real data.
>              y1=rnorm(n.groups * n.per.group),
>              y2=rnorm(n.groups * n.per.group),
>              y3=rnorm(n.groups * n.per.group),
>              y4=rnorm(n.groups * n.per.group)) }
> 
> mk.fake.dt <- function (fake.df) {
>   fake.dt <- as.data.table(fake.df)
>   setkey(fake.dt, grp, age)
>   fake.dt
> }
> 
> cumsum.lag <- function (x) {
>   x.prev <- c(0, x[-length(x)])
>   cumsum(x.prev)
> }
> 
> calc.fake.df <- function (df) {
>   calc.lst <- with(df, within(list(), {
>     sum   <- unlist(tapply(pmax(x, 0), grp, cumsum.lag))
>     sum6  <- unlist(tapply(pmax((age <  6) * x, 0), grp, cumsum.lag))
>     sum12 <- unlist(tapply(pmax((age < 12) * x, 0), grp, cumsum.lag))
>     sum18 <- unlist(tapply(pmax((age < 18) * x, 0), grp, cumsum.lag))
>     sum24 <- unlist(tapply(pmax((age < 24) * x, 0), grp, cumsum.lag))
>     sum36 <- unlist(tapply(pmax((age < 36) * x, 0), grp, cumsum.lag))
>     sum48 <- unlist(tapply(pmax((age < 48) * x, 0), grp, cumsum.lag))
>     sum60 <- unlist(tapply(pmax((age < 60) * x, 0), grp, cumsum.lag))
>   }))
>   calc.lst
> }
> 
> calc.fake.dt <- function (dt) {
>   dt[, list(sum  =cumsum.lag(pmax(x, 0)),
>             sum6 =cumsum.lag(pmax((age <  6) * x, 0)),
>             sum12=cumsum.lag(pmax((age < 12) * x, 0)),
>             sum18=cumsum.lag(pmax((age < 18) * x, 0)),
>             sum24=cumsum.lag(pmax((age < 24) * x, 0)),
>             sum36=cumsum.lag(pmax((age < 36) * x, 0)),
>             sum48=cumsum.lag(pmax((age < 48) * x, 0)),
>             sum60=cumsum.lag(pmax((age < 60) * x, 0))),
>      by=grp]
> }
> 
> calc.fake.dt.2 <- function (dt) {
>   dt$tmp.0  <- pmax(dt$x, 0)
>   dt$tmp.6  <- pmax((dt$age <  6) * dt$x, 0)
>   dt$tmp.12 <- pmax((dt$age < 12) * dt$x, 0)
>   dt$tmp.18 <- pmax((dt$age < 18) * dt$x, 0)
>   dt$tmp.24 <- pmax((dt$age < 24) * dt$x, 0)
>   dt$tmp.36 <- pmax((dt$age < 36) * dt$x, 0)
>   dt$tmp.48 <- pmax((dt$age < 48) * dt$x, 0)
>   dt$tmp.60 <- pmax((dt$age < 60) * dt$x, 0)
>   dt[, list(sum  =cumsum.lag(tmp.0),
>             sum6 =cumsum.lag(tmp.6),
>             sum12=cumsum.lag(tmp.12),
>             sum18=cumsum.lag(tmp.18),
>             sum24=cumsum.lag(tmp.24),
>             sum36=cumsum.lag(tmp.36),
>             sum48=cumsum.lag(tmp.48),
>             sum60=cumsum.lag(tmp.60)),
>      by=grp]
> }
> 

If you do more and more in the loop, the speed difference between tapply
and data.table decreases. One way to reduce the runtime is to run things
in parallel. For this case, it's easy to run the column calculations in
parallel. Using Simon Urbanek's multicore package, I get some speedup,
but it's not much because of the overhead of running things in parallel.
Here's some code for testing:

calc.fake.dt.lapply <- function (dt) { # use base lapply for testing
    lapply(6*c(1000,1:4,6,8,10),
           function(critical.age) {
               dt$tmp <-  pmax((dt$age <  critical.age) * dt$x, 0)
               dt[, cumsum.lag(tmp), by = grp]$V1})
}

require(multicore)
calc.fake.dt.mclapply <- function (dt) {
    mclapply(6*c(1000,1:4,6,8,10),
             function(critical.age) {
                 dt$tmp <-  pmax((dt$age <  critical.age) * dt$x, 0)
                 dt[, cumsum.lag(tmp), by = grp]$V1})
}

# set up the data
df <- mk.fake.df()
dt <- mk.fake.dt(df)

Now, here are some timings on a relatively fast workstation.

> system.time(res.df          <- calc.fake.df(df))
   user  system elapsed
  6.556   0.019   6.578
> system.time(res.dt          <- calc.fake.dt(dt))
   user  system elapsed
  5.744   0.009   5.756
> system.time(res.dt.2        <- calc.fake.dt.2(dt))
   user  system elapsed
  1.234   0.007   1.241
> system.time(res.dt.lapply   <- calc.fake.dt.lapply(dt))
   user  system elapsed
  1.111   0.025   1.137
> system.time(res.dt.mclapply <- calc.fake.dt.mclapply(dt))
   user  system elapsed
  2.751   2.746   0.828  -- note the elapsed time difference

For many other cases, you could "parallelize" by dividing up the
grouping variable. That's not hard with keyed data tables because
they're already sorted.

- Tom