[datatable-help] using a UPC as identifier broken in 1.9.2 (related to 'tolerance of precision' NEWS item)

James Sams sams.james at gmail.com
Wed Mar 5 00:49:50 CET 2014


I suspect there are plenty of data.table users that use UPCs and other 
large integer-like doubles as identifiers in their data. Storing UPCs as 
character data takes up an order of magnitude more space compared to a 
double; not really an acceptable alternative for a 1.5 billion row 
table, i.e. 10 GiB of RAM just for UPCs as doubles (*crosses fingers for 
long vector support*).

However, the newest data.table breaks that (see example below). The 
developers are aware of this, but I guess speed for imprecise numbers is 
a higher priority than proper results for people using data with large IDs.

In any case, I thought people should be more aware of this, and maybe 
someone would have a suggested workaround. I'm currently stuck at SVN 
r1129 because I was hitting some crashing bugs in 1.8.10.

For the interested, you can track the feature request at:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978

The relevant NEWS item:
> Numeric data is still joined and grouped within tolerance as before but instead of tolerance
>       being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the
>       the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate
>       for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle.
>       A few functions provided a 'tolerance' argument but this wasn't being passed through so has
>       been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release.


library(data.table)
DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
                          314775802940, 314775803490, 314775803491,
                          314775815510, 314775815511, 314933000171,
                          314933000172), d=rnorm(10), key='upc')

DT[, list(length=length(d)), keyby=upc]

Output with 1.9.2 is:
 > DT[, list(length=length(d)), keyby=upc]
             upc length
1: 301426027592      2
2: 314775802939      2
3: 314775803490      2
4: 314775815510      2
5: 314933000171      2

Instead of:
 > DT[, list(length=length(d)), keyby=upc]
              upc length
  1: 301426027592      1
  2: 301426027593      1
  3: 314775802939      1
  4: 314775802940      1
  5: 314775803490      1
  6: 314775803491      1
  7: 314775815510      1
  8: 314775815511      1
  9: 314933000171      1
10: 314933000172      1



More information about the datatable-help mailing list