[datatable-help] datatable roll="next" takes 150 times longer than findInterval

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Feb 5 16:42:10 CET 2014


Seems like the "by-without-by" is what's slowing things down:

require(data.table)
dtx <- data.table(x=which(X), key="x")
dty <- data.table(y=which(Y), key="y")
dtx[, x1 := x]
dty[, y1 := y]
system.time(ans <- dty[dtx, roll="nearest"][, abs(x1-y1)])
   user  system elapsed
  1.321   0.076   1.396
system.time(ans2 <- flodel(x,y))
   user  system elapsed
  0.936   0.044   0.977

identical(ans, ans2) # [1] TRUE


On Wed, Feb 5, 2014 at 4:32 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

> Just tested. Works just fine (on 1.8.11). Takes 16 seconds as opposed to
> Flodel's which takes 1.4 seconds on my laptop. Also identical returned TRUE.
> Will see where's the delay coming from.
>
>
> On Wed, Feb 5, 2014 at 4:22 PM, Gabor Grothendieck <
> ggrothendieck at gmail.com> wrote:
>
>> There was anoither benchmark posted with larger data and longer times
>> but this time data.table stopped with an error.  See:
>>
>>
>> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855
>>
>> On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle <mdowle at mdowle.plus.com>
>> wrote:
>> > Gabor,
>> >
>> > With that said about it being a micro benchmark,  by-without-by might
>> be at
>> > play in GG2(X,Y) here; i.e. running j for each row of i, where it could
>> run
>> > once.  I remember you and others quite rightly said by-without-by
>> should be
>> > explicit ... still got to make that change.  A similar speed issue came
>> up
>> > recently somewhere else as well which the change in default should help.
>> >
>> > Matt
>> >
>> >
>> > On 02/02/14 18:57, Matt Dowle wrote:
>> >
>> >
>> > But this is at the *micro* second level ?!!
>> >
>> > I confirm those results on my slow netbook but remember these are
>> **micro**
>> > seconds i.e. 71,000 here is less than 0.1 of a second.
>> >
>> >> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y))
>> > Unit: microseconds
>> >          expr       min        lq      median          uq       max
>> neval
>> >  flodel(X, Y)   330.798   369.369    402.7935    455.3225  17996.26
>> 100
>> >     GG1(X, Y) 14287.380 14370.038  14466.5990  16010.5440 121082.77
>> 100
>> >     GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62
>> 100
>> >
>> > To put it in some perspective :
>> >
>> >> system.time(GG2(X,Y))
>> >    user  system elapsed
>> >   0.072   0.000   0.072
>> >> system.time(GG2(X,Y))
>> >    user  system elapsed
>> >   0.080   0.000   0.079
>> >> system.time(GG2(X,Y))
>> >    user  system elapsed
>> >   0.072   0.000   0.072
>> >
>> > Where those times are in seconds.   So the task in question here,  takes
>> > 0.07 seconds ?!
>> >
>> > The 150x longer figure is actually (using figures from the S.O. answer)
>> > 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds
>> > (0.000168 seconds).  0.024 seconds / 0.000168 = "150 times".   If you
>> > rounded to milliseconds you could say data.table is infinitely slower
>>  (24ms
>> > / 0ms = Inf).
>> >
>> > I can believe there's scope for improvement, sure,  but not from this
>> > benchmark. The vectors need to be *much* bigger and replications needs
>> to be
>> > *much* smaller, say 3.   The task being timed needs to take a meaningful
>> > amount of time (say 5 seconds) *for a single run*.
>> >
>> > Matt
>> >
>> >
>> > On 02/02/14 12:27, Gabor Grothendieck wrote:
>> >
>> > The benchmark at the bottom of this post shows a problem where a
>> data.table
>> > roll="next" took nearly 150x longer than a base findInterval() solution.
>> > (The data.table solution is easier to write though.) This suggests an
>> area
>> > for possible speed improvement.
>> >
>> >
>> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855
>> >
>> > --
>> > Statistics & Software Consulting
>> > GKX Group, GKX Associates Inc.
>> > tel: 1-877-GKX-GROUP
>> > email: ggrothendieck at gmail.com
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> >
>> >
>> >
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140205/5f64f128/attachment-0001.html>


More information about the datatable-help mailing list