[datatable-help] numeric rounding for 'order'

Arunkumar Srinivasan aragorn168b at gmail.com
Mon Apr 11 12:42:08 CEST 2016


Hi Frederik, the reason this was implemented is to avoid issues like this (copied from ?setNumericRounding), which IIRC I pointed to you before:
DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
DT
setNumericRounding(0)   # turn off rounding
DT[.(0.4)]   # works
DT[.(0.6)]   # no match, confusing since 0.6 is clearly there in DT
So while numeric rounding of ‘0’ solves your issue, it still persists on other cases (like the one shown above). 
Also you seem to be suggesting to use this *only* for order(). Why? Why not ‘setorder()’ or ‘setkey()’?
FYI, speed is/was never really an issue and is just a (positive) side-effect.

I see two options:

1. Identify, if possible, clearly and set the rounding appropriately so that we run into this issue very rarely. i.e., ad-hoc numeric rounding.
2. If it is not possible, then, rounding last two bytes really doesn’t solve *most* issues w.r.t. rounding (which was its original purpose), as 
opposed to without any rounding.. in which case, there’s no need for setNumericRounding, so that we can attribute the inconsistencies 
to floating point representation inaccuracies.

Having had my share of experiences with floating point issues, my guess would be the latter. Perhaps better to continue on the github project 
page (if you could please file an issue there with a minimal example of *your* problem).

-- 
Arun

On 7 April 2016 at 22:14:08, frederik at ofb.net (frederik at ofb.net) wrote:

Sorry, I forgot to Cc the list for this.

Arunkumar, do you have an answer? You said:

> If you’ve a better idea, please let us know and we would definitely be
> willing to implement that.

and I said

> My "better idea" at this point is, if speed is not an issue, then
> 'order' could use a numeric rounding of zero.

(see below)

Thank you,

Frederick



----- Forwarded message from frederik at ofb.net -----

Date: Wed, 27 Jan 2016 15:52:25 -0800
From: frederik at ofb.net
To: Arunkumar Srinivasan <aragorn168b at gmail.com>
Subject: Re: [datatable-help] sorting on a floating point column
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no
version=3.4.1
X-Spam-Level:  
User-Agent: Mutt/1.5.24 (2015-08-30)
X-My-Tags: inbox

Thanks Arun for your reply. The '?order' page says:

Columns of ‘numeric’ types (i.e., ‘double’) have their last two
bytes rounded off while computing order, by defalult, to avoid any
unexpected behaviour due to limitations in representing floating
point numbers precisely. Have a look at ‘setNumericRounding’ to
learn more.

But I'm not sure what unexpected behavior this avoids. It seems like
it *causes* unexpected behavior (even if I'm the first to comment in
two years)... And '?setNumericRounding' says:

Computers cannot represent some floating point numbers (such as
0.6) precisely, using base 2. This leads to unexpected behaviour
when joining or grouping columns of type 'numeric';

So it sounds like the cases where you benefit from numeric rounding
are "joining or grouping", not in sorting. My "better idea" at this
point is, if speed is not an issue, then 'order' could use a numeric
rounding of zero. Alternatively, I would expand upon the '?order'
documentation to clarify that the reason for rounding is, for example,
speed - and not the elimination of "unexpected behavior".

Thank you,

Frederick

On Thu, Jan 28, 2016 at 12:10:37AM +0100, Arunkumar Srinivasan wrote:
> Why do you want a minimal test case, when setNumericRounding explains 
> that the behavior I reported is intentional? 
> Because you refer to a post that’s quite a few years old, and data.table has moved along from ‘tolerance’ quite some time ago. And therefore it wasn’t clear to me what the exact issue is — whether you’re using an older version or a newer one, but you dint know that it wasn’t due to tolerance issue.
>  
> I now see that this is also documented in the data.table::order page. 
> So I guess it is already "documented visibly". 
> Glad you got to read that.
>  
> And setNumericRounding explains that it is slightly faster to ignore 
> the last two bytes, requiring fewer radix sort passes. 
> That’s not the reason for the function though, as it’s explained in `?setNumericRounding` with examples at the bottom of that page. 
>  
> I wanted to share my experience that this behavior is confusing.
> With floating point numbers, there’s always limitations. I find the examples under ?setNumericRounding confusing cases as well (which would return wrong results if we did not round). We try to reduce confusion by managing most obvious cases, or so we think. If you’ve a better idea, please let us know and we would definitely be willing to implement that.
> -- 
> Arun
>  
> On 28 January 2016 at 00:03:19, frederik at ofb.net (frederik at ofb.net) wrote:
>  
> data.table 1.9.6  
>  
> What's surprising is that sorting a list of floats wouldn't do the  
> obvious thing, and sort them exactly. Is it surprising that this would  
> be surprising?  
>  
> Why do you want a minimal test case, when setNumericRounding explains  
> that the behavior I reported is intentional?  
>  
> I now see that this is also documented in the data.table::order page.  
> So I guess it is already "documented visibly".  
>  
> And setNumericRounding explains that it is slightly faster to ignore  
> the last two bytes, requiring fewer radix sort passes.  
>  
> I wanted to share my experience that this behavior is confusing. Thank  
> you at least for pointing me to your documentation.  
>  
> Frederick  
>  
> On Wed, Jan 27, 2016 at 10:13:44PM +0100, Arunkumar Srinivasan wrote:  
> > This is following up on a thread from a couple years ago:   
> > http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001689.html   
> > Things have changed A LOT! I suggest you keep up-to-date by reading the README about bug fixes and features from the github project page: https://github.com/Rdatatable/data.table  
> >  
> > I ran into this problem myself, it took a bit of time to debug because it is so surprising.   
> > What’s surprising? Reproducible example please. data.table package version, R version as well please.   
> > Without that my best guess is for you to look at `?setNumericRounding`.  
> >  
> > --   
> > Arun  
> >  
> > On 27 January 2016 at 21:40:23, frederik at ofb.net (frederik at ofb.net) wrote:  
> >  
> > This is following up on a thread from a couple years ago:  
> >  
> > http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001689.html  
> >  
> > I ran into this problem myself, it took a bit of time to debug because  
> > it is so surprising.  
> >  
> > In my case, I was using order() to sort a list of floats.  
> >  
> > I expected the result to be monotonic but it wasn't!  
> >  
> > Then I found out that the problem was due to 'order' being part of the  
> > data.table library. By using base::order, I was able to get correct  
> > behavior.  
> >  
> > I don't understand why improperly ordering floating point data helps  
> > the data.table library accomplish anything, whether it is looking up  
> > keys or what.  
> >  
> > Also, it must be much slower to compare floats with a tolerance, than  
> > to just compare them. I seem to recall that floats were designed so  
> > that normal comparison is quite fast.  
> >  
> > Please fix this bug, or at least document it more visibly.  
> >  
> > Thank you,  
> >  
> > Frederick Eaton  
> > _______________________________________________  
> > datatable-help mailing list  
> > datatable-help at lists.r-forge.r-project.org  
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  


----- End forwarded message -----
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20160411/2b85221d/attachment.html>


More information about the datatable-help mailing list