[datatable-help] datatable roll="next" takes 150 times longer than findInterval

Gabor Grothendieck ggrothendieck at gmail.com
Wed Feb 5 16:22:32 CET 2014


There was anoither benchmark posted with larger data and longer times
but this time data.table stopped with an error.  See:

http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855

On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle <mdowle at mdowle.plus.com> wrote:
> Gabor,
>
> With that said about it being a micro benchmark,  by-without-by might be at
> play in GG2(X,Y) here; i.e. running j for each row of i, where it could run
> once.  I remember you and others quite rightly said by-without-by should be
> explicit ... still got to make that change.  A similar speed issue came up
> recently somewhere else as well which the change in default should help.
>
> Matt
>
>
> On 02/02/14 18:57, Matt Dowle wrote:
>
>
> But this is at the *micro* second level ?!!
>
> I confirm those results on my slow netbook but remember these are **micro**
> seconds i.e. 71,000 here is less than 0.1 of a second.
>
>> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y))
> Unit: microseconds
>          expr       min        lq      median          uq       max neval
>  flodel(X, Y)   330.798   369.369    402.7935    455.3225  17996.26   100
>     GG1(X, Y) 14287.380 14370.038  14466.5990  16010.5440 121082.77   100
>     GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62   100
>
> To put it in some perspective :
>
>> system.time(GG2(X,Y))
>    user  system elapsed
>   0.072   0.000   0.072
>> system.time(GG2(X,Y))
>    user  system elapsed
>   0.080   0.000   0.079
>> system.time(GG2(X,Y))
>    user  system elapsed
>   0.072   0.000   0.072
>
> Where those times are in seconds.   So the task in question here,  takes
> 0.07 seconds ?!
>
> The 150x longer figure is actually (using figures from the S.O. answer)
> 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds
> (0.000168 seconds).  0.024 seconds / 0.000168 = "150 times".   If you
> rounded to milliseconds you could say data.table is infinitely slower  (24ms
> / 0ms = Inf).
>
> I can believe there's scope for improvement, sure,  but not from this
> benchmark. The vectors need to be *much* bigger and replications needs to be
> *much* smaller, say 3.   The task being timed needs to take a meaningful
> amount of time (say 5 seconds) *for a single run*.
>
> Matt
>
>
> On 02/02/14 12:27, Gabor Grothendieck wrote:
>
> The benchmark at the bottom of this post shows a problem where a data.table
> roll="next" took nearly 150x longer than a base findInterval() solution.
> (The data.table solution is easier to write though.) This suggests an area
> for possible speed improvement.
>
> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list