[datatable-help] Performance observation
Matthew Dowle
mdowle at mdowle.plus.com
Tue May 28 20:26:29 CEST 2013
Here's a nice benchmark that's just been posted on S.O. showing
set() speedup when looped :
http://stackoverflow.com/a/16797392/403310
On 28.05.2013 19:11, Matthew Dowle wrote:
> Hi,
>
> Yes this is
expected because `[.data.table` is a function call with associated
overhead. You don't want to loop calls to it. Consider all the arguments
to `[.data.table` and all the checks that must be done for existence and
type of arguments on each call. The idea is to give [.data.table meaty
calls which it can chew on. It doesn't like tiny tasks one at a time.
>
> `[[` on the other hand is an R primitive. It's part of the language.
You can do very limited things with `[[` but in this case (looking up a
single column by name or position) in a loop, that's best for the job. I
use `[[` on data.table quite a lot.
>
> This is also the very reason
for set()'s existence: ?set says it's a 'loopable :=' because of the
`[.data.table` overhead.
>
> There's a feature request to detect when
[.data.table is being looped, though :
>
>
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2028&group_id=240&atid=978
>
> which would be more helpful of data.table, so at least it told
you, rather than having to stumble across it.
>
> Hope that helps,
>
> Matthew
>
> On 28.05.2013 18:37, Alexandre Sieira wrote:
>
>> I
was working on some code today and encountered this scenario here where
the performance behavior of data.table surprised me a little. Is this
expected?
>>
>>> dt = data.table(a=rnorm(1000000))
>>
>>>
system.time( for(i in 1:100000) j = dt[i, a] )
>>
>> usuário sistema
decorrido
>>
>> 78.064 0.426 78.034
>>
>>> system.time( for(i in
1:100000) j = dt[i, "a", with=F] )
>>
>> usuário sistema decorrido
>>
>> 27.814 0.154 27.810
>>
>>> system.time( for(i in 1:100000) j =
dt[["a"]][i] )
>>
>> usuário sistema decorrido
>>
>> 1.227 0.006
1.225
>> (sorry about the output in portuguese)
>> Not knowing
anything about how data.table is implemented internally, I would have
assumed the three syntaxes for accessing the data.table should have
similar or at the most a small difference in performance.
>>
>> --
>>
Alexandre Sieira
>> CISA, CISSP, ISO 27001 Lead Auditor
>>
>> "The
truth is rarely pure and never simple."
>> Oscar Wilde, The Importance
of Being Earnest, 1895, Act I
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130528/bf0419cc/attachment.html>
More information about the datatable-help
mailing list