[datatable-help] Response to dplyr baseball vignette benchmarks

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Jan 22 21:20:57 CET 2014


Chris,

You're 100% right. That's what we've conversed with Hadley as well. For this data, we decided to stick to this, as we weren't lagging behind "dplyr".
This is also why I made the point that "However, when benchmarking one should be benchmarking the equivalent of an operation in each tool, not how one thinks the design should be."
This is so that the next time we benchmark, we can do it the data.table way and dplyr way and not dplyr's data.table way.


Arun
From: Chris Neff Chris Neff
Reply: Chris Neff caneff at gmail.com
Date: January 22, 2014 at 9:17:49 PM
To: Arunkumar Srinivasan aragorn168b at gmail.com
Subject:  Re: [datatable-help] Response to dplyr baseball vignette benchmarks  
When you do use larger data sets where it will matter, I think more strongly highlighting the in-place vs. copying differences will be key. There is also the notion that yes, you should compare things as closely as possible when just doing standard benchmarking, but I think this is selling data.table a bit short by mimicking dplyr with copying.  You show this a bit in the mutate example, but even in the arrange example the copy is slowing things down.  It is so small that it wouldn't really make a ton of difference in this case, but with 10m rows the copying gets to be a large noticeable difference between data.table and standard data.frame methods like setnames vs names<-




On Wed, Jan 22, 2014 at 3:09 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Chris,

Thanks. Yes that's the plan (the last line in the link). Once the next version of data.table is out on CRAN, the benchmarks should come out.

Arun
From: Chris Neff Chris Neff
Reply: Chris Neff caneff at gmail.com
Date: January 22, 2014 at 9:07:34 PM
To: Arunkumar Srinivasan aragorn168b at gmail.com
Subject:  Re: [datatable-help] Response to dplyr baseball vignette benchmarks
Thank you for responding to this so fast to get out ahead of the misleading aspects.

As another comparison, it would definitely be constructive to also use a data set that is larger than 10 MB.  Something in the 1m+ row range perhaps.


On Wed, Jan 22, 2014 at 2:54 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Hello,

Matthew and I have redone the benchmarks and posted a response to the dplyr's 
baseball vignette benchmark here: http://arunsrinivasan.github.io/dplyr_benchmark/

Have a look and let us know what you think!

Arun

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140122/ac5ab801/attachment.html>


More information about the datatable-help mailing list