[datatable-help] Access to local variables in "j" expressions

Johann Hibschman jhibschman at gmail.com
Mon May 10 22:16:07 CEST 2010


After thinking about this some, I realized I was having data.table do
extra work by not pre-calculating as much as possible. After I
followed the example of "calc.fake.dt.2", below, I wound up with
data.table being about 2.5 times faster than the data.frame version.

So, yes, data.table is faster, but I have to introduce a few more
temporary columns in order to make it work efficiently.

Here's the example code I ran:

## Try fake data experiments, to see if I can duplicate the results.
mk.fake.df <- function (n.groups=10000, n.per.group=70) {
  data.frame(grp=rep(1:n.groups, each=n.per.group),
             age=rep(0:(n.per.group-1), n.groups),
             x=rnorm(n.groups * n.per.group),
             ## These don't do anything, but only exist to give
             ## the table a similar size to the real data.
             y1=rnorm(n.groups * n.per.group),
             y2=rnorm(n.groups * n.per.group),
             y3=rnorm(n.groups * n.per.group),
             y4=rnorm(n.groups * n.per.group))
}

mk.fake.dt <- function (fake.df) {
  fake.dt <- as.data.table(fake.df)
  setkey(fake.dt, grp, age)
  fake.dt
}

cumsum.lag <- function (x) {
  x.prev <- c(0, x[-length(x)])
  cumsum(x.prev)
}

calc.fake.df <- function (df) {
  calc.lst <- with(df, within(list(), {
    sum   <- unlist(tapply(pmax(x, 0), grp, cumsum.lag))
    sum6  <- unlist(tapply(pmax((age <  6) * x, 0), grp, cumsum.lag))
    sum12 <- unlist(tapply(pmax((age < 12) * x, 0), grp, cumsum.lag))
    sum18 <- unlist(tapply(pmax((age < 18) * x, 0), grp, cumsum.lag))
    sum24 <- unlist(tapply(pmax((age < 24) * x, 0), grp, cumsum.lag))
    sum36 <- unlist(tapply(pmax((age < 36) * x, 0), grp, cumsum.lag))
    sum48 <- unlist(tapply(pmax((age < 48) * x, 0), grp, cumsum.lag))
    sum60 <- unlist(tapply(pmax((age < 60) * x, 0), grp, cumsum.lag))
  }))
  calc.lst
}

calc.fake.dt <- function (dt) {
  dt[, list(sum  =cumsum.lag(pmax(x, 0)),
            sum6 =cumsum.lag(pmax((age <  6) * x, 0)),
            sum12=cumsum.lag(pmax((age < 12) * x, 0)),
            sum18=cumsum.lag(pmax((age < 18) * x, 0)),
            sum24=cumsum.lag(pmax((age < 24) * x, 0)),
            sum36=cumsum.lag(pmax((age < 36) * x, 0)),
            sum48=cumsum.lag(pmax((age < 48) * x, 0)),
            sum60=cumsum.lag(pmax((age < 60) * x, 0))),
     by=grp]
}

calc.fake.dt.2 <- function (dt) {
  dt$tmp.0  <- pmax(dt$x, 0)
  dt$tmp.6  <- pmax((dt$age <  6) * dt$x, 0)
  dt$tmp.12 <- pmax((dt$age < 12) * dt$x, 0)
  dt$tmp.18 <- pmax((dt$age < 18) * dt$x, 0)
  dt$tmp.24 <- pmax((dt$age < 24) * dt$x, 0)
  dt$tmp.36 <- pmax((dt$age < 36) * dt$x, 0)
  dt$tmp.48 <- pmax((dt$age < 48) * dt$x, 0)
  dt$tmp.60 <- pmax((dt$age < 60) * dt$x, 0)
  dt[, list(sum  =cumsum.lag(tmp.0),
            sum6 =cumsum.lag(tmp.6),
            sum12=cumsum.lag(tmp.12),
            sum18=cumsum.lag(tmp.18),
            sum24=cumsum.lag(tmp.24),
            sum36=cumsum.lag(tmp.36),
            sum48=cumsum.lag(tmp.48),
            sum60=cumsum.lag(tmp.60)),
     by=grp]
}


On Mon, May 10, 2010 at 12:42 PM, Short, Tom <TShort at epri.com> wrote:
> Johann,
>
> I did some timing tests to compare tapply to data.table and couldn't find a case where tapply was close. See here for the timing code:
>
> n <- 1e7
> groupsizes <- c(100,1000,1e4,1e5,1e6)
> res <- data.frame(groupsize = groupsizes, tapply.runtime = 0, dt.runtime = 0)
> for (i in seq(along=groupsizes)) {
>    df <- data.frame(x = rnorm(n), grp = as.integer(runif(n,1,groupsizes[i])))
>    dt <- as.data.table(df)
>    res$tapply.runtime[i] <- system.time(  with(df, unlist(tapply(x, grp, cumsum)))  )[1]
>    res$dt.runtime[i]     <- system.time(  dt[,list(x=cumsum(x)), by=grp]            )[1]
> }
> res
>
> This gives these results (the runtimes are in seconds):
>
>  groupsize tapply.runtime dt.runtime
> 1     1e+02          49.42       1.89
> 2     1e+03          43.20       2.10
> 3     1e+04          45.82       2.18
> 4     1e+05          59.77       2.49
> 5     1e+06         113.01      22.23
>
> If the grouping variable is negative or has a range of more than 100000, you could get slowdowns. One workaround for this is to convert the grouping variable to a factor if you have less than 100000 unique id's.
>
> - Tom
>
>
>> -----Original Message-----
>> From: Johann Hibschman [mailto:jhibschman at gmail.com]
>> Sent: Monday, May 10, 2010 11:41 AM
>> To: Short, Tom
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Access to local variables in
>> "j" expressions
>>
>> Hi Tom,
>>
>> Thanks for taking the time to look into this so promptly.
>> I've installed that version, and it fixes my original
>> problem. I'll keep testing it and see if I run into any more issues.
>>
>> I didn't get as much of a speed-up as I was hoping for the
>> time-intensive parts of my calculation, so I don't know how
>> much additional time I'll invest in it, but I'll keep
>> experimenting for a few days at least.
>>
>> (I'm comparing, in effect, unlist(tapply(x, GroupID, cumsum))
>> to dt[,list(out=cumsum(x)), by=GroupID]; they take
>> more-or-less the same time for me.)
>>
>> -Johann
>>
>>
>>
>> On Sat, May 8, 2010 at 2:17 PM, Short, Tom <TShort at epri.com> wrote:
>> > I checked in a fix for this bug on R-forge. It should be
>> available for installation tomorrow as follows (R-forge has
>> been a little flakey lately):
>> >
>> > install.packages("data.table",repos="http://r-forge.r-project.org")
>> >
>> > If you're able to try it, let me know if it works or causes
>> other problems.
>> >
>> > - Tom
>> >
>> >
>> >
>> >> -----Original Message-----
>> >> From: datatable-help-bounces at lists.r-forge.r-project.org
>> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> >> On Behalf Of Short, Tom
>> >> Sent: Saturday, May 08, 2010 10:58 AM
>> >> To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
>> >> Subject: Re: [datatable-help] Access to local variables in "j"
>> >> expressions
>> >>
>> >> I've got a fix for this. It'll probably be a couple of
>> days before I
>> >> can get it up to R-forge.
>> >>
>> >> - Tom
>> >>
>> >> > -----Original Message-----
>> >> > From: datatable-help-bounces at lists.r-forge.r-project.org
>> >> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> >> > On Behalf Of Short, Tom
>> >> > Sent: Saturday, May 08, 2010 7:35 AM
>> >> > To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
>> >> > Subject: Re: [datatable-help] Access to local variables in "j"
>> >> > expressions
>> >> >
>> >> > I think it's a bug, Johann. I'll dig deeper. Thanks for
>> >> reporting it.
>> >> >
>> >> > - Tom
>> >> >
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: datatable-help-bounces at lists.r-forge.r-project.org
>> >> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> >> > > On Behalf Of Johann Hibschman
>> >> > > Sent: Friday, May 07, 2010 4:53 PM
>> >> > > To: datatable-help at lists.r-forge.r-project.org
>> >> > > Subject: [datatable-help] Access to local variables in "j"
>> >> > expressions
>> >> > >
>> >> > > I'm just taking a look at data.table again, now that
>> >> 1.4.1 has been
>> >> > > released. I tried the following:
>> >> > >
>> >> > >   dt.test <- data.table(n=c("a","a","b"), x=1:3, key="n")
>> >> > >
>> >> > >   global.sum7 <- function (y) {
>> >> > >    sum(y) + 7
>> >> > >   }
>> >> > >
>> >> > >   test.1 <- function (dt) {
>> >> > >    local.sum7 <- global.sum7
>> >> > >    dt[, list(out=local.sum7(x)), by=n]
>> >> > >   }
>> >> > >
>> >> > >   test.1(dt.test)
>> >> > >
>> >> > >  This failed, with 'Error in eval(expr, envir, enclos) :
>> >> > > could not find function "local.sum7"'. Looking at the
>> >> > documentation, I
>> >> > > see:
>> >> > >
>> >> > >      The j expression 'sees' variables in the calling frame
>> >> > and above
>> >> > >      including .GlobalEnv, see the examples. This is base R
>> >> > >      functionality from eval() and with().
>> >> > >
>> >> > > That led me to think that the above would work. Is this a
>> >> > bug, or am I
>> >> > > not understanding something?
>> >> > >
>> >> > > Thanks,
>> >> > > Johann
>> >> > > _______________________________________________
>> >> > > datatable-help mailing list
>> >> > > datatable-help at lists.r-forge.r-project.org
>> >> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> >> > atatable-help
>> >> > >
>> >> > _______________________________________________
>> >> > datatable-help mailing list
>> >> > datatable-help at lists.r-forge.r-project.org
>> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> >> atatable-help
>> >> >
>> >> _______________________________________________
>> >> datatable-help mailing list
>> >> datatable-help at lists.r-forge.r-project.org
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> > atatable-help
>> >>
>> >
>>
>


More information about the datatable-help mailing list