[datatable-help] datatable-help Digest, Vol 17, Issue 10

Matthew Dowle mdowle at mdowle.plus.com
Mon Jul 18 10:00:53 CEST 2011


On Sun, 2011-07-17 at 11:24 -0400, Steve Lianoglou wrote:
> Hi,
> 
> Just an addition comment about:
> 
> On Sun, Jul 17, 2011 at 7:43 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> 
> > i) Whenever you use .SD in j, .SD will contain *all* the columns from
> > the table, regardless of how .SD is used. That's because it's difficult
> > for data.table to know which columns of .SD the j really uses. Where the
> > subset appears directly in j it's pretty obvious but where the subset of
> > columns are held in a variable, and that variable could be the same name
> > as a column name, it all gets complicated.    But, there is a simple
> > solution (I think) : we could add a new argument to data.table called
> > '.SDcols' and you could pass the subset of columns in there; e.g.,
> >
> >     DT[,lapply(.SD,sum),by="x,y",.SDcols=names(DT)[40:50]]
> >
> > Would that be better?
> 
> Which is that I think that a solution that avoids building the
> temporary .SD altogether would be the most advantageous for "these
> scenarios."
> 
> I think we're all on the same page with that, but I just wanted to
> make that point explicit.
> 
> The reason I say this is because I think if we only figure out which
> sub-columns to use to reconstruct the .SD will still leave performance
> gains to be had if we instead just forget about the tabular structure
> of .SD and we just stuff the columns into a normal list-of-things
> (where the things are the would-be columns of .SD).

I think I may have misled in the past about .SD.  It is actually always
created, whether j uses it or not. It isn't even created really. It may
look as though it is in the R code for the first group only. The first
group is specially used to make a (usually very good) guess about the
type of query and optimise the remaining groups. However, at the top of
dogroup.c there are comments that .SD points to itself. Maybe I should
write up what actually happens (at least what I think it's been designed
to do): .SD is just a symbol for the environment that holds the columns
used, basically.  There isn't extra storage created for it, and there is
no extra work in populating it for each group.

> DT=data.table(a=1:3,b=1:3,c=1:3,d=1:3)
> DT
     a b c d
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
> DT[,{print(get(".SD"));sum(b)},a]
  # j doesn't use .SD symbol but .SD is there
     b   
[1,] 1   #.SD includes just the symbols used by j: b
     b
[1,] 2
     b
[1,] 3
     a V1
[1,] 1  1
[2,] 2  2
[3,] 3  3
> DT[,{print(get(".SD"));sum(b*c)},a]
     b c
[1,] 1 1  # .SD includes b and c now
     b c
[1,] 2 2
     b c
[1,] 3 3
     a V1
[1,] 1  1
[2,] 2  4
[3,] 3  9
> DT[,{print(get(".SD"));.SD;sum(b)},a]
     b c d
[1,] 1 1 1  # .SD used by j; c and d included wastefully
     b c d
[1,] 2 2 2
     b c d
[1,] 3 3 3
     a V1
[1,] 1  1
[2,] 2  2
[3,] 3  3
> 

So I don't think there's a problem with .SD per se, just the two
problems using it: i) using it in j may mean too many columns are
included in it wastefully (.SDcols would provide way to fix that), and
ii) using lapply on it is slow because lapply is slow.

> 
> > ii) lapply() is the base R lapply, which we know is slow. Recall that
> > data.table is over 10 times faster than tapply because tapply calls
> > lapply. Note also that lapply takes a function (closure) whereas
> > data.table's j is just a body (lambda). The syntax changes for
> > data.table weren't just for fun, you know ;)  There's a FR on this :
> > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1303&group_id=240&atid=978
> 
> I like that FR -- as long as we can get around the whole .SD thing :-)
> 
> Something like Chris's `colwise( f, var_names)` thing is what I have in mind.
> 
> Maybe shoehorning all of this into the current `data.table.[` might be
> to ... tough?
> 
> What if we had a colwise like function
> 
> colwise(my.data.table, colnames, EXPR, by, ...)
> 
> Where everything from the by param onwards would work like the params
> in `data.table.[`, 
> but this invokation would run EXPR over each of the
> columns listed in `colnames` in your `my.data.table`, using the `by`
> groupings as "we expect."
> 
> Would this be a helpful way to approach this? That way the
> `data.table.[` function isn't overloaded with too much different
> functionality. It might be that cramming all of these specialized
> cases into the same function might be making it too magical is all.
> 
> Also -- `colwise` could be `colapply` or something similar to avoid
> trampling on the function by the same name in plyr.
> 
> -steve
> 
> > However,  doing both (i) and (ii) just makes the syntax easier to access
> > the speed which is already possible by automatically creating a (long) j
> > expression. That's how data.table knows which columns are being used (by
> > inspecting the expression using all.vars(), only subsetting those) and
> > there isn't any call to lapply so that slow down goes away. Maybe making
> > helper functions to make that easier is another way to go.
> >
> > Matthew
> >




More information about the datatable-help mailing list