[datatable-help] adding names to j columns is costly

Thu Sep 12 01:50:02 CEST 2013

I don't remember you asking this before!

How many rows does delay.dt have and how many groups?

 > because setting them in aggregation is expensive:

I'm not sure this example is proof of that.  On the contrary, the output 
shows that names are being dropped before grouping commences (they are 
reinstated after grouping), as is correct behaviour.  All I can think is 
that the list() wrapper itself is adding overhead. That might show up as 
this 38% difference if there are a very large number of groups (lots of 
calls to j). In the case of a single aggregate, the list() wrapper could 
be optimized away.  This would be a nice improvement I didn't think of 
before.

Does this theory fit with your experience?   If my guess is correct,  if 
you instead compare two queries where j has list() in both; e.g., 
list(sum(count),max(count))    -vs- list(s=sum(count), m=max(count))  
then I don't think you'll see a speed difference.

On 11/09/13 22:35, Sam Steingold wrote:
> I find myself using setnames(...,"V1","...") very often because setting
> them in aggregation is expensive:
>
> --8<---------------cut here---------------start------------->8---
>> delays.short <- delays.dt[,sum(count),by="delay"]
> Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0
> Detected that j uses these columns: count
> Optimization is on but j left unchanged as 'sum(count)'
> Starting dogroups ... done dogroups in 8.612 secs
>> delays.short <- delays.dt[,list(count=sum(count)),by="delay"]
> Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0
> Detected that j uses these columns: count
> Optimization is on but j left unchanged as 'list(sum(count))'
> Starting dogroups ... done dogroups in 11.918 secs
> --8<---------------cut here---------------end--------------->8---
>
> 38% difference is a lot (3 seconds is not a big deal, but this is just a
> toy dataset).
>
> ISTR that I have asked this question before - is this still (data.table
> 1.8.10) the state of the art, or am I doing something stupid?
>
> Thanks!
>