[datatable-help] datatable-help Digest, Vol 17, Issue 10
Matthew Dowle
mdowle at mdowle.plus.com
Tue Jul 19 02:49:12 CEST 2011
.SDcols was a few lines to add, so that's committed as a first step
since it appears to knock out most of the terrible performance for this
case.
o A new argument .SDcols has been added to [.data.table. This
may be character column names or numeric positions, and
specifies the columns of x included in .SD. This is useful
for speed when applying a function through a subset of
(possible very many) columns; e.g.,
DT[,lapply(.SD,sum),by="x,y",.SDcols=301:350]
Taking the nice example from Dennis, and running on my little netbook :
> DT = data.table(x = rep(1:100, each = 100), y = rep(1:100, 100),
matrix(rpois(10000000, 10), nrow = 10000))
> setkey(DT,x,y)
> dim(DT)
[1] 10000 1002
> vars = paste('V', sample(1:1000, 150, replace = FALSE), sep = '')
> system.time(ans1 <- DT[, lapply(.SD[,vars,with=FALSE], sum),
by='x,y'])
user system elapsed
243.807 0.372 245.141 # awful
> system.time(ans2<-DT[,lapply(.SD,sum),by='x,y',.SDcols=vars])
user system elapsed
12.225 0.000 12.256 # 20 times faster, and code efficient
> e = parse(text=paste("list(",
paste(paste(vars,"=sum(",vars,")",sep=""),collapse=","),")",sep=""))
> system.time(ans3<-DT[,eval(e),by='x,y'])
user system elapsed
6.368 0.000 6.382 # twice faster again, but cumbersome
> identical(ans1,ans2)
[1] TRUE
> identical(ans1,ans3)
[1] TRUE
>
I'll need to take a look at vapply in base, colwise and Steve's
suggestions again. Any way to apply a function through a list of vectors
efficiently basically, for that last factor of 2 speed up without the
parsing shenanigans.
Matthew
On Mon, 2011-07-18 at 09:00 +0100, Matthew Dowle wrote:
> On Sun, 2011-07-17 at 11:24 -0400, Steve Lianoglou wrote:
> > Hi,
> >
> > Just an addition comment about:
> >
> > On Sun, Jul 17, 2011 at 7:43 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> >
> > > i) Whenever you use .SD in j, .SD will contain *all* the columns from
> > > the table, regardless of how .SD is used. That's because it's difficult
> > > for data.table to know which columns of .SD the j really uses. Where the
> > > subset appears directly in j it's pretty obvious but where the subset of
> > > columns are held in a variable, and that variable could be the same name
> > > as a column name, it all gets complicated. But, there is a simple
> > > solution (I think) : we could add a new argument to data.table called
> > > '.SDcols' and you could pass the subset of columns in there; e.g.,
> > >
> > > DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]
> > >
> > > Would that be better?
> >
> > Which is that I think that a solution that avoids building the
> > temporary .SD altogether would be the most advantageous for "these
> > scenarios."
> >
> > I think we're all on the same page with that, but I just wanted to
> > make that point explicit.
> >
> > The reason I say this is because I think if we only figure out which
> > sub-columns to use to reconstruct the .SD will still leave performance
> > gains to be had if we instead just forget about the tabular structure
> > of .SD and we just stuff the columns into a normal list-of-things
> > (where the things are the would-be columns of .SD).
>
> I think I may have misled in the past about .SD. It is actually always
> created, whether j uses it or not. It isn't even created really. It may
> look as though it is in the R code for the first group only. The first
> group is specially used to make a (usually very good) guess about the
> type of query and optimise the remaining groups. However, at the top of
> dogroup.c there are comments that .SD points to itself. Maybe I should
> write up what actually happens (at least what I think it's been designed
> to do): .SD is just a symbol for the environment that holds the columns
> used, basically. There isn't extra storage created for it, and there is
> no extra work in populating it for each group.
>
> > DT=data.table(a=1:3,b=1:3,c=1:3,d=1:3)
> > DT
> a b c d
> [1,] 1 1 1 1
> [2,] 2 2 2 2
> [3,] 3 3 3 3
> > DT[,{print(get(".SD"));sum(b)},a]
> # j doesn't use .SD symbol but .SD is there
> b
> [1,] 1 #.SD includes just the symbols used by j: b
> b
> [1,] 2
> b
> [1,] 3
> a V1
> [1,] 1 1
> [2,] 2 2
> [3,] 3 3
> > DT[,{print(get(".SD"));sum(b*c)},a]
> b c
> [1,] 1 1 # .SD includes b and c now
> b c
> [1,] 2 2
> b c
> [1,] 3 3
> a V1
> [1,] 1 1
> [2,] 2 4
> [3,] 3 9
> > DT[,{print(get(".SD"));.SD;sum(b)},a]
> b c d
> [1,] 1 1 1 # .SD used by j; c and d included wastefully
> b c d
> [1,] 2 2 2
> b c d
> [1,] 3 3 3
> a V1
> [1,] 1 1
> [2,] 2 2
> [3,] 3 3
> >
>
> So I don't think there's a problem with .SD per se, just the two
> problems using it: i) using it in j may mean too many columns are
> included in it wastefully (.SDcols would provide way to fix that), and
> ii) using lapply on it is slow because lapply is slow.
>
> >
> > > ii) lapply() is the base R lapply, which we know is slow. Recall that
> > > data.table is over 10 times faster than tapply because tapply calls
> > > lapply. Note also that lapply takes a function (closure) whereas
> > > data.table's j is just a body (lambda). The syntax changes for
> > > data.table weren't just for fun, you know ;) There's a FR on this :
> > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978
> >
> > I like that FR -- as long as we can get around the whole .SD thing :-)
> >
> > Something like Chris's `colwise( f, var_names)` thing is what I have in mind.
> >
> > Maybe shoehorning all of this into the current `data.table.[` might be
> > to ... tough?
> >
> > What if we had a colwise like function
> >
> > colwise(my.data.table, colnames, EXPR, by, ...)
> >
> > Where everything from the by param onwards would work like the params
> > in `data.table.[`,
> > but this invokation would run EXPR over each of the
> > columns listed in `colnames` in your `my.data.table`, using the `by`
> > groupings as "we expect."
> >
> > Would this be a helpful way to approach this? That way the
> > `data.table.[` function isn't overloaded with too much different
> > functionality. It might be that cramming all of these specialized
> > cases into the same function might be making it too magical is all.
> >
> > Also -- `colwise` could be `colapply` or something similar to avoid
> > trampling on the function by the same name in plyr.
> >
> > -steve
> >
> > > However, doing both (i) and (ii) just makes the syntax easier to access
> > > the speed which is already possible by automatically creating a (long) j
> > > expression. That's how data.table knows which columns are being used (by
> > > inspecting the expression using all.vars(), only subsetting those) and
> > > there isn't any call to lapply so that slow down goes away. Maybe making
> > > helper functions to make that easier is another way to go.
> > >
> > > Matthew
> > >
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list