[datatable-help] datatable-help Digest, Vol 17, Issue 10

Tue Jul 19 02:49:12 CEST 2011

.SDcols was a few lines to add, so that's committed as a first step
since it appears to knock out most of the terrible performance for this
case.

o   A new argument .SDcols has been added to [.data.table. This
    may be character column names or numeric positions, and
    specifies the columns of x included in .SD. This is useful
    for speed when applying a function through a subset of
    (possible very many) columns; e.g.,
        DT[,lapply(.SD,sum),by="x,y",.SDcols=301:350]

Taking the nice example from Dennis, and running on my little netbook :

> DT = data.table(x = rep(1:100, each = 100), y = rep(1:100, 100),
matrix(rpois(10000000, 10), nrow = 10000))
> setkey(DT,x,y)
> dim(DT)
[1] 10000  1002
> vars = paste('V', sample(1:1000, 150, replace = FALSE), sep = '')

> system.time(ans1 <- DT[, lapply(.SD[,vars,with=FALSE], sum),
by='x,y'])
   user  system elapsed 
243.807   0.372 245.141   # awful

> system.time(ans2<-DT[,lapply(.SD,sum),by='x,y',.SDcols=vars])
   user  system elapsed 
 12.225   0.000  12.256   # 20 times faster, and code efficient

> e = parse(text=paste("list(",
paste(paste(vars,"=sum(",vars,")",sep=""),collapse=","),")",sep=""))
> system.time(ans3<-DT[,eval(e),by='x,y'])
   user  system elapsed 
  6.368   0.000   6.382   # twice faster again, but cumbersome

> identical(ans1,ans2)
[1] TRUE
> identical(ans1,ans3)
[1] TRUE
> 

I'll need to take a look at vapply in base, colwise and Steve's
suggestions again. Any way to apply a function through a list of vectors
efficiently basically, for that last factor of 2 speed up without the
parsing shenanigans.

Matthew

On Mon, 2011-07-18 at 09:00 +0100, Matthew Dowle wrote:
> On Sun, 2011-07-17 at 11:24 -0400, Steve Lianoglou wrote:
> > Hi,
> > 
> > Just an addition comment about:
> > 
> > On Sun, Jul 17, 2011 at 7:43 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > 
> > > i) Whenever you use .SD in j, .SD will contain *all* the columns from
> > > the table, regardless of how .SD is used. That's because it's difficult
> > > for data.table to know which columns of .SD the j really uses. Where the
> > > subset appears directly in j it's pretty obvious but where the subset of
> > > columns are held in a variable, and that variable could be the same name
> > > as a column name, it all gets complicated.    But, there is a simple
> > > solution (I think) : we could add a new argument to data.table called
> > > '.SDcols' and you could pass the subset of columns in there; e.g.,
> > >
> > >     DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]
> > >
> > > Would that be better?
> > 
> > Which is that I think that a solution that avoids building the
> > temporary .SD altogether would be the most advantageous for "these
> > scenarios."
> > 
> > I think we're all on the same page with that, but I just wanted to
> > make that point explicit.
> > 
> > The reason I say this is because I think if we only figure out which
> > sub-columns to use to reconstruct the .SD will still leave performance
> > gains to be had if we instead just forget about the tabular structure
> > of .SD and we just stuff the columns into a normal list-of-things
> > (where the things are the would-be columns of .SD).
> 
> I think I may have misled in the past about .SD.  It is actually always
> created, whether j uses it or not. It isn't even created really. It may
> look as though it is in the R code for the first group only. The first
> group is specially used to make a (usually very good) guess about the
> type of query and optimise the remaining groups. However, at the top of
> dogroup.c there are comments that .SD points to itself. Maybe I should
> write up what actually happens (at least what I think it's been designed
> to do): .SD is just a symbol for the environment that holds the columns
> used, basically.  There isn't extra storage created for it, and there is
> no extra work in populating it for each group.
> 
> > DT=data.table(a=1:3,b=1:3,c=1:3,d=1:3)
> > DT
>      a b c d
> [1,] 1 1 1 1
> [2,] 2 2 2 2
> [3,] 3 3 3 3
> > DT[,{print(get(".SD"));sum(b)},a]
>   # j doesn't use .SD symbol but .SD is there
>      b   
> [1,] 1   #.SD includes just the symbols used by j: b
>      b
> [1,] 2
>      b
> [1,] 3
>      a V1
> [1,] 1  1
> [2,] 2  2
> [3,] 3  3
> > DT[,{print(get(".SD"));sum(b*c)},a]
>      b c
> [1,] 1 1  # .SD includes b and c now
>      b c
> [1,] 2 2
>      b c
> [1,] 3 3
>      a V1
> [1,] 1  1
> [2,] 2  4
> [3,] 3  9
> > DT[,{print(get(".SD"));.SD;sum(b)},a]
>      b c d
> [1,] 1 1 1  # .SD used by j; c and d included wastefully
>      b c d
> [1,] 2 2 2
>      b c d
> [1,] 3 3 3
>      a V1
> [1,] 1  1
> [2,] 2  2
> [3,] 3  3
> > 
> 
> So I don't think there's a problem with .SD per se, just the two
> problems using it: i) using it in j may mean too many columns are
> included in it wastefully (.SDcols would provide way to fix that), and
> ii) using lapply on it is slow because lapply is slow.
> 
> > 
> > > ii) lapply() is the base R lapply, which we know is slow. Recall that
> > > data.table is over 10 times faster than tapply because tapply calls
> > > lapply. Note also that lapply takes a function (closure) whereas
> > > data.table's j is just a body (lambda). The syntax changes for
> > > data.table weren't just for fun, you know ;)  There's a FR on this :
> > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978
> > 
> > I like that FR -- as long as we can get around the whole .SD thing :-)
> > 
> > Something like Chris's `colwise( f, var_names)` thing is what I have in mind.
> > 
> > Maybe shoehorning all of this into the current `data.table.[` might be
> > to ... tough?
> > 
> > What if we had a colwise like function
> > 
> > colwise(my.data.table, colnames, EXPR, by, ...)
> > 
> > Where everything from the by param onwards would work like the params
> > in `data.table.[`, 
> > but this invokation would run EXPR over each of the
> > columns listed in `colnames` in your `my.data.table`, using the `by`
> > groupings as "we expect."
> > 
> > Would this be a helpful way to approach this? That way the
> > `data.table.[` function isn't overloaded with too much different
> > functionality. It might be that cramming all of these specialized
> > cases into the same function might be making it too magical is all.
> > 
> > Also -- `colwise` could be `colapply` or something similar to avoid
> > trampling on the function by the same name in plyr.
> > 
> > -steve
> > 
> > > However,  doing both (i) and (ii) just makes the syntax easier to access
> > > the speed which is already possible by automatically creating a (long) j
> > > expression. That's how data.table knows which columns are being used (by
> > > inspecting the expression using all.vars(), only subsetting those) and
> > > there isn't any call to lapply so that slow down goes away. Maybe making
> > > helper functions to make that easier is another way to go.
> > >
> > > Matthew
> > >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help