[datatable-help] using a UPC as identifier broken in 1.9.2 (related to 'tolerance of precision' NEWS item)

Matt Dowle mdowle at mdowle.plus.com
Wed Mar 5 13:24:58 CET 2014


On 04/03/14 23:49, James Sams wrote:
> I suspect there are plenty of data.table users that use UPCs and other 
> large integer-like doubles as identifiers in their data. Storing UPCs 
> as character data takes up an order of magnitude more space compared 
> to a double; not really an acceptable alternative for a 1.5 billion 
> row table, i.e. 10 GiB of RAM just for UPCs as doubles (*crosses 
> fingers for long vector support*).
>
> However, the newest data.table breaks that (see example below). The 
> developers are aware of this, but I guess speed for imprecise numbers 
> is a higher priority than proper results for people using data with 
> large IDs.
I knew of such ids but I hadn't fully connected that numeric was being 
used for them currently which relied on the old value for tolerance.  In 
my mind, such ids are what we've been working on integer64 for.  Which 
is what the sweeping changes to sorting have been leading up to.  The 
new radix sort for integer can now be applied to integer64 which is the 
right type for UPCs it seems. Yike is having a look at that.   I'll see 
if I can quickly add the option to do full 8 byte radix passes 
optionally (it isn't just a single number somewhere otherwise the option 
would have been trivial).

Matt

>
> In any case, I thought people should be more aware of this, and maybe 
> someone would have a suggested workaround. I'm currently stuck at SVN 
> r1129 because I was hitting some crashing bugs in 1.8.10.
>
> For the interested, you can track the feature request at:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978 
>
>
> The relevant NEWS item:
>> Numeric data is still joined and grouped within tolerance as before 
>> but instead of tolerance
>>       being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as 
>> base::all.equal's default) the
>>       the significand is now rounded to the last 2 bytes, apx 11 s.f. 
>> This is more appropriate
>>       for large (1.23e20) and small (1.23e-20) numerics and is faster 
>> via a simple bit twiddle.
>>       A few functions provided a 'tolerance' argument but this wasn't 
>> being passed through so has
>>       been removed. We aim to add a global option (e.g. 2, 1 or 0 
>> byte rounding) in a future release.
>
>
> library(data.table)
> DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
>                          314775802940, 314775803490, 314775803491,
>                          314775815510, 314775815511, 314933000171,
>                          314933000172), d=rnorm(10), key='upc')
>
> DT[, list(length=length(d)), keyby=upc]
>
> Output with 1.9.2 is:
> > DT[, list(length=length(d)), keyby=upc]
>             upc length
> 1: 301426027592      2
> 2: 314775802939      2
> 3: 314775803490      2
> 4: 314775815510      2
> 5: 314933000171      2
>
> Instead of:
> > DT[, list(length=length(d)), keyby=upc]
>              upc length
>  1: 301426027592      1
>  2: 301426027593      1
>  3: 314775802939      1
>  4: 314775802940      1
>  5: 314775803490      1
>  6: 314775803491      1
>  7: 314775815510      1
>  8: 314775815511      1
>  9: 314933000171      1
> 10: 314933000172      1
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help 
>
>



More information about the datatable-help mailing list