[datatable-help] sorting on floating point column

Matthew Dowle mdowle at mdowle.plus.com
Tue Apr 30 16:22:54 CEST 2013


 

Maybe it doesn't actually need to sort within machine tolerance. If
it was precise, the sort would be faster, that's for sure. But at the
time, I remember thinking that it should preserve the order of rows
within a group of values within machine tolerance (e.g. 3.99999999,
4.00000001, 3.99999999 should be consider 4.0 and order of those 3 rows
maintained). But maybe sorting them to 3.99999999, 3.99999999,
4.00000001 is ok as it's just the join that should be within machine
tolerance? 

Interested in how fast order(y) is, though. Compared to
data.table sorting of doubles. 

Matthew 

On 30.04.2013 15:16,
Arunkumar Srinivasan wrote: 

> Matthew, 
> I see. I din't think about
tolerance. Although 
> dt[with(dt, order(y)), ] 
> seems to do the task
right (similar to data.frame). I'm glad that I don't have to convert to
data.frame to perform the order. I am not keying by this column. Unless
one needs this column for keying, I don't think a tolerance option is
essential. Although, having it definitely would be only nicer. 
> 
>
Arun 
> 
> On Tuesday, April 30, 2013 at 4:09 PM, Matthew Dowle wrote:

> 
>> Hi, 
>> 
>> data.table sorts double within machine tolerance :

>> 
>>> sqrt(.Machine$double.eps)
>> [1] 1.490116e-08
>>> 
>> 
>> i.e.
numbers closer than this are considered equal.
>> 
>> Otherwise we
wouldn't be able to do things like DT[.(3.14)].
>> 
>> I had a quick
look, see arguments of data.table:::ordernumtol which takes "tol" but
there is no option provided (yet) to change this. Do we need one?
>> 
>>
In the examples section of one of the help pages it has an example which
generates a series of numers very close together using pi. Note that
your numbers are both close together, and, very close to 0.
>> 
>>
Matthew
>> 
>> On 30.04.2013 14:52, Arunkumar Srinivasan wrote: 
>> 
>>>
Hi there, 
>>> I just saw something strange when I was sorting a column
of p-values. I checked the data.table bug tracker for words "sort" and
"floating point" and there were no hits for this case. There's a bug for
"integer 64" sort on a column though. 
>>> So, here's a reproducible
example. I'd be glad to file a bug, if it is and be corrected if it's
something I am doing wrong. 
>>> 
>>> set.seed(45) 
>>> dt <-
data.table(x=sample(50), y= sample(c(seq(0, 1, length.out=1000),
7000000:7000100), 50)/1e7) 
>>> head(dt) 
>>> x y 
>>> 1: 32
5.395395e-08 
>>> 2: 16 6.956957e-08 
>>> 3: 12 2.142142e-08 
>>> 4: 18
5.855856e-08 
>>> 5: 17 6.216216e-08 
>>> 6: 14 5.025025e-08 
>>>
setkey(dt, "y") # sort by column y 
>>> head(dt, 10) 
>>> x y 
>>> 1: 47
1.401401e-09 
>>> 2: 12 2.142142e-08 
>>> 3: 24 1.391391e-08 
>>> 4: 43
9.809810e-09 <~~~ obviously false 
>>> 5: 1 2.932933e-08 
>>> 6: 48
2.562563e-08 
>>> 7: 49 1.891892e-08 
>>> 8: 40 2.182182e-08 
>>> 9: 9
7.307307e-09 <~~~ obviously false 
>>> 10: 45 2.482482e-08 
>>> 
>>>
Best, 
>>> Arun

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130430/d966ac8b/attachment-0001.html>


More information about the datatable-help mailing list