[datatable-help] Performance observation

Alexandre Sieira alexandre.sieira at gmail.com
Tue May 28 20:25:57 CEST 2013


Thank you very much. The documentation on := and set are really clear on this, thanks for pointing that out.

-- 
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
On 28 de maio de 2013 at 15:11:04, Matthew Dowle (mdowle at mdowle.plus.com) wrote:
 
Hi,
Yes this is expected because `[.data.table` is a function call with associated overhead.  You don't want to loop calls to it.  Consider all the arguments to `[.data.table` and all the checks that must be done for existence and type of arguments on each call.  The idea is to give [.data.table meaty calls which it can chew on. It doesn't like tiny tasks one at a time.
`[[` on the other hand is an R primitive. It's part of the language.  You can do very limited things with `[[` but in this case (looking up a single column by name or position) in a loop, that's best for the job.   I use `[[` on data.table quite a lot.
This is also the very reason for set()'s existence:  ?set says it's a 'loopable :=' because of the `[.data.table` overhead.
There's a feature request to detect when [.data.table is being looped, though :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2028&group_id=240&atid=978
which would be more helpful of data.table, so at least it told you, rather than having to stumble across it.
Hope that helps,
Matthew
 
On 28.05.2013 18:37, Alexandre Sieira wrote:
I was working on some code today and encountered this scenario here where the performance behavior of data.table surprised me a little. Is this expected?
 
 
> dt = data.table(a=rnorm(1000000))
 
 
> system.time( for(i in 1:100000) j = dt[i, a] )
  usuário   sistema decorrido 
   78.064     0.426    78.034 
 
 
> system.time( for(i in 1:100000) j = dt[i, "a", with=F] )
  usuário   sistema decorrido 
   27.814     0.154    27.810
 
> system.time( for(i in 1:100000) j = dt[["a"]][i] )
  usuário   sistema decorrido 
    1.227     0.006     1.225 
(sorry about the output in portuguese)
Not knowing anything about how data.table is implemented internally, I would have assumed the three syntaxes for accessing the data.table should have similar or at the most a small difference in performance.
-- 
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130528/8c329708/attachment-0001.html>


More information about the datatable-help mailing list