[datatable-help] adding names to j columns is costly

Sam Steingold sds at gnu.org
Thu Sep 12 05:54:21 CEST 2013


> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-09-12 00:50:02 +0100]:
>
> How many rows does delay.dt have and how many groups?

--8<---------------cut here---------------start------------->8---
> nrow(delays.dt)
[1] 18772831
> nrow(delays.short)
[1] 14893103
--8<---------------cut here---------------end--------------->8---

>> because setting them in aggregation is expensive:
>
> I'm not sure this example is proof of that.  On the contrary, the output
> shows that names are being dropped before grouping commences (they are
> reinstated after grouping), as is correct behaviour.  All I can think is
> that the list() wrapper itself is adding overhead. That might show up as
> this 38% difference if there are a very large number of groups (lots of
> calls to j). In the case of a single aggregate, the list() wrapper could
> be optimized away.  This would be a nice improvement I didn't think of
> before.

Yes, I would love to be able to drop the extra setnames() call.

> Does this theory fit with your experience?

Looks like it.

> If my guess is correct,  if
> you instead compare two queries where j has list() in both; e.g.,
> list(sum(count),max(count))    -vs- list(s=sum(count), m=max(count))
> then I don't think you'll see a speed difference.

--8<---------------cut here---------------start------------->8---
> delays.short <- delays.dt[,list(sum(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.91secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count))'
Starting dogroups ... done dogroups in 11.497 secs
> delays.short <- delays.dt[,list(s=sum(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.91secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count))'
Starting dogroups ... done dogroups in 11.535 secs
> delays.short <- delays.dt[,list(s=sum(count),m=max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.948secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 18.931 secs
> delays.short <- delays.dt[,list(sum(count),max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.968secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 17.872 secs
> delays.short <- delays.dt[,list(sum(count),max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 1.004secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 18.971 secs
> delays.short <- delays.dt[,list(s=sum(count),m=max(count)),by="delay"]
Finding groups (bysameorder=TRUE) ... done in 0.946secs. bysameorder=TRUE and o__ is length 0
Detected that j uses these columns: count 
Optimization is on but j left unchanged as 'list(sum(count), max(count))'
Starting dogroups ... done dogroups in 18.799 secs
--8<---------------cut here---------------end--------------->8---


Thanks for your kind help!

>
> On 11/09/13 22:35, Sam Steingold wrote:
>> I find myself using setnames(...,"V1","...") very often because setting
>> them in aggregation is expensive:
>>
>> --8<---------------cut here---------------start------------->8---
>>> delays.short <- delays.dt[,sum(count),by="delay"]
>> Finding groups (bysameorder=TRUE) ... done in 1.262secs. bysameorder=TRUE and o__ is length 0
>> Detected that j uses these columns: count
>> Optimization is on but j left unchanged as 'sum(count)'
>> Starting dogroups ... done dogroups in 8.612 secs
>>> delays.short <- delays.dt[,list(count=sum(count)),by="delay"]
>> Finding groups (bysameorder=TRUE) ... done in 1.051secs. bysameorder=TRUE and o__ is length 0
>> Detected that j uses these columns: count
>> Optimization is on but j left unchanged as 'list(sum(count))'
>> Starting dogroups ... done dogroups in 11.918 secs
>> --8<---------------cut here---------------end--------------->8---
>>
>> 38% difference is a lot (3 seconds is not a big deal, but this is just a
>> toy dataset).
>>
>> ISTR that I have asked this question before - is this still (data.table
>> 1.8.10) the state of the art, or am I doing something stupid?
>>
>> Thanks!
>>

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 13.04 (raring) X 11.0.11303000
http://www.childpsy.net/ http://americancensorship.org http://memri.org
http://mideasttruth.com http://iris.org.il http://truepeace.org
UNIX is a way of thinking.  Windows is a way of not thinking.


More information about the datatable-help mailing list