[datatable-help] Performance observation

Matthew Dowle mdowle at mdowle.plus.com
Tue May 28 20:26:29 CEST 2013


 

Here's a nice benchmark that's just been posted on S.O. showing
set() speedup when looped : 

http://stackoverflow.com/a/16797392/403310


On 28.05.2013 19:11, Matthew Dowle wrote: 

> Hi, 
> 
> Yes this is
expected because `[.data.table` is a function call with associated
overhead. You don't want to loop calls to it. Consider all the arguments
to `[.data.table` and all the checks that must be done for existence and
type of arguments on each call. The idea is to give [.data.table meaty
calls which it can chew on. It doesn't like tiny tasks one at a time. 
>

> `[[` on the other hand is an R primitive. It's part of the language.
You can do very limited things with `[[` but in this case (looking up a
single column by name or position) in a loop, that's best for the job. I
use `[[` on data.table quite a lot. 
> 
> This is also the very reason
for set()'s existence: ?set says it's a 'loopable :=' because of the
`[.data.table` overhead. 
> 
> There's a feature request to detect when
[.data.table is being looped, though : 
> 
>
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2028&group_id=240&atid=978

> 
> which would be more helpful of data.table, so at least it told
you, rather than having to stumble across it. 
> 
> Hope that helps, 
>

> Matthew 
> 
> On 28.05.2013 18:37, Alexandre Sieira wrote: 
> 
>> I
was working on some code today and encountered this scenario here where
the performance behavior of data.table surprised me a little. Is this
expected? 
>> 
>>> dt = data.table(a=rnorm(1000000)) 
>> 
>>>
system.time( for(i in 1:100000) j = dt[i, a] ) 
>> 
>> usuário sistema
decorrido 
>> 
>> 78.064 0.426 78.034 
>> 
>>> system.time( for(i in
1:100000) j = dt[i, "a", with=F] ) 
>> 
>> usuário sistema decorrido 
>>

>> 27.814 0.154 27.810 
>> 
>>> system.time( for(i in 1:100000) j =
dt[["a"]][i] ) 
>> 
>> usuário sistema decorrido 
>> 
>> 1.227 0.006
1.225 
>> (sorry about the output in portuguese) 
>> Not knowing
anything about how data.table is implemented internally, I would have
assumed the three syntaxes for accessing the data.table should have
similar or at the most a small difference in performance. 
>> 
>> -- 
>>
Alexandre Sieira
>> CISA, CISSP, ISO 27001 Lead Auditor
>> 
>> "The
truth is rarely pure and never simple."
>> Oscar Wilde, The Importance
of Being Earnest, 1895, Act I

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130528/bf0419cc/attachment.html>


More information about the datatable-help mailing list