[datatable-help] datatable roll="next" takes 150 times longer than findInterval

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Feb 5 17:12:03 CET 2014


Have edited here now:
http://stackoverflow.com/a/21500855/559784


On Wed, Feb 5, 2014 at 4:42 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

> Seems like the "by-without-by" is what's slowing things down:
>
> require(data.table)
> dtx <- data.table(x=which(X), key="x")
> dty <- data.table(y=which(Y), key="y")
> dtx[, x1 := x]
> dty[, y1 := y]
> system.time(ans <- dty[dtx, roll="nearest"][, abs(x1-y1)])
>    user  system elapsed
>   1.321   0.076   1.396
> system.time(ans2 <- flodel(x,y))
>    user  system elapsed
>   0.936   0.044   0.977
>
> identical(ans, ans2) # [1] TRUE
>
>
> On Wed, Feb 5, 2014 at 4:32 PM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>> Just tested. Works just fine (on 1.8.11). Takes 16 seconds as opposed to
>> Flodel's which takes 1.4 seconds on my laptop. Also identical returned TRUE.
>> Will see where's the delay coming from.
>>
>>
>> On Wed, Feb 5, 2014 at 4:22 PM, Gabor Grothendieck <
>> ggrothendieck at gmail.com> wrote:
>>
>>> There was anoither benchmark posted with larger data and longer times
>>> but this time data.table stopped with an error.  See:
>>>
>>>
>>> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855
>>>
>>> On Mon, Feb 3, 2014 at 6:46 AM, Matt Dowle <mdowle at mdowle.plus.com>
>>> wrote:
>>> > Gabor,
>>> >
>>> > With that said about it being a micro benchmark,  by-without-by might
>>> be at
>>> > play in GG2(X,Y) here; i.e. running j for each row of i, where it
>>> could run
>>> > once.  I remember you and others quite rightly said by-without-by
>>> should be
>>> > explicit ... still got to make that change.  A similar speed issue
>>> came up
>>> > recently somewhere else as well which the change in default should
>>> help.
>>> >
>>> > Matt
>>> >
>>> >
>>> > On 02/02/14 18:57, Matt Dowle wrote:
>>> >
>>> >
>>> > But this is at the *micro* second level ?!!
>>> >
>>> > I confirm those results on my slow netbook but remember these are
>>> **micro**
>>> > seconds i.e. 71,000 here is less than 0.1 of a second.
>>> >
>>> >> microbenchmark(flodel(X,Y), GG1(X,Y), GG2(X,Y))
>>> > Unit: microseconds
>>> >          expr       min        lq      median          uq       max
>>> neval
>>> >  flodel(X, Y)   330.798   369.369    402.7935    455.3225  17996.26
>>> 100
>>> >     GG1(X, Y) 14287.380 14370.038  14466.5990  16010.5440 121082.77
>>> 100
>>> >     GG2(X, Y) 71164.270 85751.437 107951.3415 161676.5720 366003.62
>>> 100
>>> >
>>> > To put it in some perspective :
>>> >
>>> >> system.time(GG2(X,Y))
>>> >    user  system elapsed
>>> >   0.072   0.000   0.072
>>> >> system.time(GG2(X,Y))
>>> >    user  system elapsed
>>> >   0.080   0.000   0.079
>>> >> system.time(GG2(X,Y))
>>> >    user  system elapsed
>>> >   0.072   0.000   0.072
>>> >
>>> > Where those times are in seconds.   So the task in question here,
>>>  takes
>>> > 0.07 seconds ?!
>>> >
>>> > The 150x longer figure is actually (using figures from the S.O. answer)
>>> > 24695 microseconds (i.e. 0.024 seconds) divided by 168 microseconds
>>> > (0.000168 seconds).  0.024 seconds / 0.000168 = "150 times".   If you
>>> > rounded to milliseconds you could say data.table is infinitely slower
>>>  (24ms
>>> > / 0ms = Inf).
>>> >
>>> > I can believe there's scope for improvement, sure,  but not from this
>>> > benchmark. The vectors need to be *much* bigger and replications needs
>>> to be
>>> > *much* smaller, say 3.   The task being timed needs to take a
>>> meaningful
>>> > amount of time (say 5 seconds) *for a single run*.
>>> >
>>> > Matt
>>> >
>>> >
>>> > On 02/02/14 12:27, Gabor Grothendieck wrote:
>>> >
>>> > The benchmark at the bottom of this post shows a problem where a
>>> data.table
>>> > roll="next" took nearly 150x longer than a base findInterval()
>>> solution.
>>> > (The data.table solution is easier to write though.) This suggests an
>>> area
>>> > for possible speed improvement.
>>> >
>>> >
>>> http://stackoverflow.com/questions/21499742/fast-minimum-distance-interval-between-elements-of-2-logical-vectors-take-2/21500855#21500855
>>> >
>>> > --
>>> > Statistics & Software Consulting
>>> > GKX Group, GKX Associates Inc.
>>> > tel: 1-877-GKX-GROUP
>>> > email: ggrothendieck at gmail.com
>>> >
>>> >
>>> > _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140205/4a0038d2/attachment.html>


More information about the datatable-help mailing list