[datatable-help] Access to local variables in "j" expressions

Short, Tom TShort at epri.com
Mon May 10 18:42:46 CEST 2010


Johann, 

I did some timing tests to compare tapply to data.table and couldn't find a case where tapply was close. See here for the timing code:

n <- 1e7
groupsizes <- c(100,1000,1e4,1e5,1e6)
res <- data.frame(groupsize = groupsizes, tapply.runtime = 0, dt.runtime = 0)
for (i in seq(along=groupsizes)) {
    df <- data.frame(x = rnorm(n), grp = as.integer(runif(n,1,groupsizes[i])))
    dt <- as.data.table(df)
    res$tapply.runtime[i] <- system.time(  with(df, unlist(tapply(x, grp, cumsum)))  )[1]
    res$dt.runtime[i]     <- system.time(  dt[,list(x=cumsum(x)), by=grp]            )[1]
}
res    

This gives these results (the runtimes are in seconds):

  groupsize tapply.runtime dt.runtime
1     1e+02          49.42       1.89
2     1e+03          43.20       2.10
3     1e+04          45.82       2.18
4     1e+05          59.77       2.49
5     1e+06         113.01      22.23

If the grouping variable is negative or has a range of more than 100000, you could get slowdowns. One workaround for this is to convert the grouping variable to a factor if you have less than 100000 unique id's.

- Tom
 

> -----Original Message-----
> From: Johann Hibschman [mailto:jhibschman at gmail.com] 
> Sent: Monday, May 10, 2010 11:41 AM
> To: Short, Tom
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Access to local variables in 
> "j" expressions
> 
> Hi Tom,
> 
> Thanks for taking the time to look into this so promptly. 
> I've installed that version, and it fixes my original 
> problem. I'll keep testing it and see if I run into any more issues.
> 
> I didn't get as much of a speed-up as I was hoping for the 
> time-intensive parts of my calculation, so I don't know how 
> much additional time I'll invest in it, but I'll keep 
> experimenting for a few days at least.
> 
> (I'm comparing, in effect, unlist(tapply(x, GroupID, cumsum)) 
> to dt[,list(out=cumsum(x)), by=GroupID]; they take 
> more-or-less the same time for me.)
> 
> -Johann
> 
> 
> 
> On Sat, May 8, 2010 at 2:17 PM, Short, Tom <TShort at epri.com> wrote:
> > I checked in a fix for this bug on R-forge. It should be 
> available for installation tomorrow as follows (R-forge has 
> been a little flakey lately):
> >
> > install.packages("data.table",repos="http://r-forge.r-project.org")
> >
> > If you're able to try it, let me know if it works or causes 
> other problems.
> >
> > - Tom
> >
> >
> >
> >> -----Original Message-----
> >> From: datatable-help-bounces at lists.r-forge.r-project.org
> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> On Behalf Of Short, Tom
> >> Sent: Saturday, May 08, 2010 10:58 AM
> >> To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] Access to local variables in "j" 
> >> expressions
> >>
> >> I've got a fix for this. It'll probably be a couple of 
> days before I 
> >> can get it up to R-forge.
> >>
> >> - Tom
> >>
> >> > -----Original Message-----
> >> > From: datatable-help-bounces at lists.r-forge.r-project.org
> >> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> > On Behalf Of Short, Tom
> >> > Sent: Saturday, May 08, 2010 7:35 AM
> >> > To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
> >> > Subject: Re: [datatable-help] Access to local variables in "j"
> >> > expressions
> >> >
> >> > I think it's a bug, Johann. I'll dig deeper. Thanks for
> >> reporting it.
> >> >
> >> > - Tom
> >> >
> >> >
> >> > > -----Original Message-----
> >> > > From: datatable-help-bounces at lists.r-forge.r-project.org
> >> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> > > On Behalf Of Johann Hibschman
> >> > > Sent: Friday, May 07, 2010 4:53 PM
> >> > > To: datatable-help at lists.r-forge.r-project.org
> >> > > Subject: [datatable-help] Access to local variables in "j"
> >> > expressions
> >> > >
> >> > > I'm just taking a look at data.table again, now that
> >> 1.4.1 has been
> >> > > released. I tried the following:
> >> > >
> >> > >   dt.test <- data.table(n=c("a","a","b"), x=1:3, key="n")
> >> > >
> >> > >   global.sum7 <- function (y) {
> >> > >    sum(y) + 7
> >> > >   }
> >> > >
> >> > >   test.1 <- function (dt) {
> >> > >    local.sum7 <- global.sum7
> >> > >    dt[, list(out=local.sum7(x)), by=n]
> >> > >   }
> >> > >
> >> > >   test.1(dt.test)
> >> > >
> >> > >  This failed, with 'Error in eval(expr, envir, enclos) :
> >> > > could not find function "local.sum7"'. Looking at the
> >> > documentation, I
> >> > > see:
> >> > >
> >> > >      The j expression 'sees' variables in the calling frame
> >> > and above
> >> > >      including .GlobalEnv, see the examples. This is base R
> >> > >      functionality from eval() and with().
> >> > >
> >> > > That led me to think that the above would work. Is this a
> >> > bug, or am I
> >> > > not understanding something?
> >> > >
> >> > > Thanks,
> >> > > Johann
> >> > > _______________________________________________
> >> > > datatable-help mailing list
> >> > > datatable-help at lists.r-forge.r-project.org
> >> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> >> > atatable-help
> >> > >
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> >> atatable-help
> >> >
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> >>
> >
> 


More information about the datatable-help mailing list