[datatable-help] Access to local variables in "j" expressions
Short, Tom
TShort at epri.com
Mon May 10 18:42:46 CEST 2010
Johann,
I did some timing tests to compare tapply to data.table and couldn't find a case where tapply was close. See here for the timing code:
n <- 1e7
groupsizes <- c(100,1000,1e4,1e5,1e6)
res <- data.frame(groupsize = groupsizes, tapply.runtime = 0, dt.runtime = 0)
for (i in seq(along=groupsizes)) {
df <- data.frame(x = rnorm(n), grp = as.integer(runif(n,1,groupsizes[i])))
dt <- as.data.table(df)
res$tapply.runtime[i] <- system.time( with(df, unlist(tapply(x, grp, cumsum))) )[1]
res$dt.runtime[i] <- system.time( dt[,list(x=cumsum(x)), by=grp] )[1]
}
res
This gives these results (the runtimes are in seconds):
groupsize tapply.runtime dt.runtime
1 1e+02 49.42 1.89
2 1e+03 43.20 2.10
3 1e+04 45.82 2.18
4 1e+05 59.77 2.49
5 1e+06 113.01 22.23
If the grouping variable is negative or has a range of more than 100000, you could get slowdowns. One workaround for this is to convert the grouping variable to a factor if you have less than 100000 unique id's.
- Tom
> -----Original Message-----
> From: Johann Hibschman [mailto:jhibschman at gmail.com]
> Sent: Monday, May 10, 2010 11:41 AM
> To: Short, Tom
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Access to local variables in
> "j" expressions
>
> Hi Tom,
>
> Thanks for taking the time to look into this so promptly.
> I've installed that version, and it fixes my original
> problem. I'll keep testing it and see if I run into any more issues.
>
> I didn't get as much of a speed-up as I was hoping for the
> time-intensive parts of my calculation, so I don't know how
> much additional time I'll invest in it, but I'll keep
> experimenting for a few days at least.
>
> (I'm comparing, in effect, unlist(tapply(x, GroupID, cumsum))
> to dt[,list(out=cumsum(x)), by=GroupID]; they take
> more-or-less the same time for me.)
>
> -Johann
>
>
>
> On Sat, May 8, 2010 at 2:17 PM, Short, Tom <TShort at epri.com> wrote:
> > I checked in a fix for this bug on R-forge. It should be
> available for installation tomorrow as follows (R-forge has
> been a little flakey lately):
> >
> > install.packages("data.table",repos="http://r-forge.r-project.org")
> >
> > If you're able to try it, let me know if it works or causes
> other problems.
> >
> > - Tom
> >
> >
> >
> >> -----Original Message-----
> >> From: datatable-help-bounces at lists.r-forge.r-project.org
> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> On Behalf Of Short, Tom
> >> Sent: Saturday, May 08, 2010 10:58 AM
> >> To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] Access to local variables in "j"
> >> expressions
> >>
> >> I've got a fix for this. It'll probably be a couple of
> days before I
> >> can get it up to R-forge.
> >>
> >> - Tom
> >>
> >> > -----Original Message-----
> >> > From: datatable-help-bounces at lists.r-forge.r-project.org
> >> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> > On Behalf Of Short, Tom
> >> > Sent: Saturday, May 08, 2010 7:35 AM
> >> > To: Johann Hibschman; datatable-help at lists.r-forge.r-project.org
> >> > Subject: Re: [datatable-help] Access to local variables in "j"
> >> > expressions
> >> >
> >> > I think it's a bug, Johann. I'll dig deeper. Thanks for
> >> reporting it.
> >> >
> >> > - Tom
> >> >
> >> >
> >> > > -----Original Message-----
> >> > > From: datatable-help-bounces at lists.r-forge.r-project.org
> >> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> > > On Behalf Of Johann Hibschman
> >> > > Sent: Friday, May 07, 2010 4:53 PM
> >> > > To: datatable-help at lists.r-forge.r-project.org
> >> > > Subject: [datatable-help] Access to local variables in "j"
> >> > expressions
> >> > >
> >> > > I'm just taking a look at data.table again, now that
> >> 1.4.1 has been
> >> > > released. I tried the following:
> >> > >
> >> > > dt.test <- data.table(n=c("a","a","b"), x=1:3, key="n")
> >> > >
> >> > > global.sum7 <- function (y) {
> >> > > sum(y) + 7
> >> > > }
> >> > >
> >> > > test.1 <- function (dt) {
> >> > > local.sum7 <- global.sum7
> >> > > dt[, list(out=local.sum7(x)), by=n]
> >> > > }
> >> > >
> >> > > test.1(dt.test)
> >> > >
> >> > > This failed, with 'Error in eval(expr, envir, enclos) :
> >> > > could not find function "local.sum7"'. Looking at the
> >> > documentation, I
> >> > > see:
> >> > >
> >> > > The j expression 'sees' variables in the calling frame
> >> > and above
> >> > > including .GlobalEnv, see the examples. This is base R
> >> > > functionality from eval() and with().
> >> > >
> >> > > That led me to think that the above would work. Is this a
> >> > bug, or am I
> >> > > not understanding something?
> >> > >
> >> > > Thanks,
> >> > > Johann
> >> > > _______________________________________________
> >> > > datatable-help mailing list
> >> > > datatable-help at lists.r-forge.r-project.org
> >> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> >> > atatable-help
> >> > >
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> >> atatable-help
> >> >
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> >>
> >
>
More information about the datatable-help
mailing list