[datatable-help] Access to local variables in "j" expressions
Johann Hibschman
jhibschman at gmail.com
Mon May 10 22:16:07 CEST 2010
After thinking about this some, I realized I was having data.table do
extra work by not pre-calculating as much as possible. After I
followed the example of "calc.fake.dt.2", below, I wound up with
data.table being about 2.5 times faster than the data.frame version.
So, yes, data.table is faster, but I have to introduce a few more
temporary columns in order to make it work efficiently.
Here's the example code I ran:
## Try fake data experiments, to see if I can duplicate the results.
mk.fake.df <- function (n.groups=10000, n.per.group=70) {
data.frame(grp=rep(1:n.groups, each=n.per.group),
age=rep(0:(n.per.group-1), n.groups),
x=rnorm(n.groups * n.per.group),
## These don't do anything, but only exist to give
## the table a similar size to the real data.
y1=rnorm(n.groups * n.per.group),
y2=rnorm(n.groups * n.per.group),
y3=rnorm(n.groups * n.per.group),
y4=rnorm(n.groups * n.per.group))
}
mk.fake.dt <- function (fake.df) {
fake.dt <- as.data.table(fake.df)
setkey(fake.dt, grp, age)
fake.dt
}
cumsum.lag <- function (x) {
x.prev <- c(0, x[-length(x)])
cumsum(x.prev)
}
calc.fake.df <- function (df) {
calc.lst <- with(df, within(list(), {
sum <- unlist(tapply(pmax(x, 0), grp, cumsum.lag))
sum6 <- unlist(tapply(pmax((age < 6) * x, 0), grp, cumsum.lag))
sum12 <- unlist(tapply(pmax((age < 12) * x, 0), grp, cumsum.lag))
sum18 <- unlist(tapply(pmax((age < 18) * x, 0), grp, cumsum.lag))
sum24 <- unlist(tapply(pmax((age < 24) * x, 0), grp, cumsum.lag))
sum36 <- unlist(tapply(pmax((age < 36) * x, 0), grp, cumsum.lag))
sum48 <- unlist(tapply(pmax((age < 48) * x, 0), grp, cumsum.lag))
sum60 <- unlist(tapply(pmax((age < 60) * x, 0), grp, cumsum.lag))
}))
calc.lst
}
calc.fake.dt <- function (dt) {
dt[, list(sum =cumsum.lag(pmax(x, 0)),
sum6 =cumsum.lag(pmax((age < 6) * x, 0)),
sum12=cumsum.lag(pmax((age < 12) * x, 0)),
sum18=cumsum.lag(pmax((age < 18) * x, 0)),
sum24=cumsum.lag(pmax((age < 24) * x, 0)),
sum36=cumsum.lag(pmax((age < 36) * x, 0)),
sum48=cumsum.lag(pmax((age < 48) * x, 0)),
sum60=cumsum.lag(pmax((age < 60) * x, 0))),
by=grp]
}
calc.fake.dt.2 <- function (dt) {
dt$tmp.0 <- pmax(dt$x, 0)
dt$tmp.6 <- pmax((dt$age < 6) * dt$x, 0)
dt$tmp.12 <- pmax((dt$age < 12) * dt$x, 0)
dt$tmp.18 <- pmax((dt$age < 18) * dt$x, 0)
dt$tmp.24 <- pmax((dt$age < 24) * dt$x, 0)
dt$tmp.36 <- pmax((dt$age < 36) * dt$x, 0)
dt$tmp.48 <- pmax((dt$age < 48) * dt$x, 0)
dt$tmp.60 <- pmax((dt$age < 60) * dt$x, 0)
dt[, list(sum =cumsum.lag(tmp.0),
sum6 =cumsum.lag(tmp.6),
sum12=cumsum.lag(tmp.12),
sum18=cumsum.lag(tmp.18),
sum24=cumsum.lag(tmp.24),
sum36=cumsum.lag(tmp.36),
sum48=cumsum.lag(tmp.48),
sum60=cumsum.lag(tmp.60)),
by=grp]
}
On Mon, May 10, 2010 at 12:42 PM, Short, Tom <TShort at epri.com> wrote:
> Johann,
>
> I did some timing tests to compare tapply to data.table and couldn't find a case where tapply was close. See here for the timing code:
>
> n <- 1e7
> groupsizes <- c(100,1000,1e4,1e5,1e6)
> res <- data.frame(groupsize = groupsizes, tapply.runtime = 0, dt.runtime = 0)
> for (i in seq(along=groupsizes)) {
> df <- data.frame(x = rnorm(n), grp = as.integer(runif(n,1,groupsizes[i])))
> dt <- as.data.table(df)
> res$tapply.runtime[i] <- system.time( with(df, unlist(tapply(x, grp, cumsum))) )[1]
> res$dt.runtime[i] <- system.time( dt[,list(x=cumsum(x)), by=grp] )[1]
> }
> res
>
> This gives these results (the runtimes are in seconds):
>
> groupsize tapply.runtime dt.runtime
> 1 1e+02 49.42 1.89
> 2 1e+03 43.20 2.10
> 3 1e+04 45.82 2.18
> 4 1e+05 59.77 2.49
> 5 1e+06 113.01 22.23
>
> If the grouping variable is negative or has a range of more than 100000, you could get slowdowns. One workaround for this is to convert the grouping variable to a factor if you have less than 100000 unique id's.
>
> - Tom
>
>
>> -----Original Message-----
>> From: Johann Hibschman [mailto:jhibschman at gmail.com]
>> Sent: Monday, May 10, 2010 11:41 AM
>> To: Short, Tom
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Access to local variables in
>> "j" expressions
>>
>> Hi Tom,
>>
>> Thanks for taking the time to look into this so promptly.
>> I've installed that version, and it fixes my original
>> problem. I'll keep testing it and see if I run into any more issues.
>>
>> I didn't get as much of a speed-up as I was hoping for the
>> time-intensive parts of my calculation, so I don't know how
>> much additional time I'll invest in it, but I'll keep
>> experimenting for a few days at least.
>>
>> (I'm comparing, in effect, unlist(tapply(x, GroupID, cumsum))
>> to dt[,list(out=cumsum(x)), by=GroupID]; they take
>> more-or-less the same time for me.)
>>
>> -Johann
>>
>>
>>
>> On Sat, May 8, 2010 at 2:17 PM, Short, Tom <TShort at epri.com> wrote:
>> > I checked in a fix for this bug on R-forge. It should be
>> available for installation tomorrow as follows (R-forge has
>> been a little flakey lately):
>> >
>> > install.packages("data.table",repos="http://r-forge.r-project.org")
>> >
>> > If you're able to try it, let me know if it works or causes
>> other problems.
>> >
>> > - Tom
>> >
>> >
>> >
>> >> -----Original Message-----
>> >> From: datatable-help-bounces at lists.r-forge.r-project.org
>> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> >> On Behalf Of Short, Tom
>> >> Sent: Saturday, May 08, 2010 10:58 AM
>> >> To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
>> >> Subject: Re: [datatable-help] Access to local variables in "j"
>> >> expressions
>> >>
>> >> I've got a fix for this. It'll probably be a couple of
>> days before I
>> >> can get it up to R-forge.
>> >>
>> >> - Tom
>> >>
>> >> > -----Original Message-----
>> >> > From: datatable-help-bounces at lists.r-forge.r-project.org
>> >> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> >> > On Behalf Of Short, Tom
>> >> > Sent: Saturday, May 08, 2010 7:35 AM
>> >> > To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
>> >> > Subject: Re: [datatable-help] Access to local variables in "j"
>> >> > expressions
>> >> >
>> >> > I think it's a bug, Johann. I'll dig deeper. Thanks for
>> >> reporting it.
>> >> >
>> >> > - Tom
>> >> >
>> >> >
>> >> > > -----Original Message-----
>> >> > > From: datatable-help-bounces at lists.r-forge.r-project.org
>> >> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> >> > > On Behalf Of Johann Hibschman
>> >> > > Sent: Friday, May 07, 2010 4:53 PM
>> >> > > To: datatable-help at lists.r-forge.r-project.org
>> >> > > Subject: [datatable-help] Access to local variables in "j"
>> >> > expressions
>> >> > >
>> >> > > I'm just taking a look at data.table again, now that
>> >> 1.4.1 has been
>> >> > > released. I tried the following:
>> >> > >
>> >> > > dt.test <- data.table(n=c("a","a","b"), x=1:3, key="n")
>> >> > >
>> >> > > global.sum7 <- function (y) {
>> >> > > sum(y) + 7
>> >> > > }
>> >> > >
>> >> > > test.1 <- function (dt) {
>> >> > > local.sum7 <- global.sum7
>> >> > > dt[, list(out=local.sum7(x)), by=n]
>> >> > > }
>> >> > >
>> >> > > test.1(dt.test)
>> >> > >
>> >> > > This failed, with 'Error in eval(expr, envir, enclos) :
>> >> > > could not find function "local.sum7"'. Looking at the
>> >> > documentation, I
>> >> > > see:
>> >> > >
>> >> > > The j expression 'sees' variables in the calling frame
>> >> > and above
>> >> > > including .GlobalEnv, see the examples. This is base R
>> >> > > functionality from eval() and with().
>> >> > >
>> >> > > That led me to think that the above would work. Is this a
>> >> > bug, or am I
>> >> > > not understanding something?
>> >> > >
>> >> > > Thanks,
>> >> > > Johann
>> >> > > _______________________________________________
>> >> > > datatable-help mailing list
>> >> > > datatable-help at lists.r-forge.r-project.org
>> >> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> >> > atatable-help
>> >> > >
>> >> > _______________________________________________
>> >> > datatable-help mailing list
>> >> > datatable-help at lists.r-forge.r-project.org
>> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> >> atatable-help
>> >> >
>> >> _______________________________________________
>> >> datatable-help mailing list
>> >> datatable-help at lists.r-forge.r-project.org
>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
>> > atatable-help
>> >>
>> >
>>
>
More information about the datatable-help
mailing list