[datatable-help] using a UPC as identifier broken in 1.9.2 (related to 'tolerance of precision' NEWS item)
James Sams
sams.james at gmail.com
Wed Mar 5 00:49:50 CET 2014
I suspect there are plenty of data.table users that use UPCs and other
large integer-like doubles as identifiers in their data. Storing UPCs as
character data takes up an order of magnitude more space compared to a
double; not really an acceptable alternative for a 1.5 billion row
table, i.e. 10 GiB of RAM just for UPCs as doubles (*crosses fingers for
long vector support*).
However, the newest data.table breaks that (see example below). The
developers are aware of this, but I guess speed for imprecise numbers is
a higher priority than proper results for people using data with large IDs.
In any case, I thought people should be more aware of this, and maybe
someone would have a suggested workaround. I'm currently stuck at SVN
r1129 because I was hitting some crashing bugs in 1.8.10.
For the interested, you can track the feature request at:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5369&group_id=240&atid=978
The relevant NEWS item:
> Numeric data is still joined and grouped within tolerance as before but instead of tolerance
> being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the
> the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate
> for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle.
> A few functions provided a 'tolerance' argument but this wasn't being passed through so has
> been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release.
library(data.table)
DT <- data.table(upc = c(301426027592, 301426027593, 314775802939,
314775802940, 314775803490, 314775803491,
314775815510, 314775815511, 314933000171,
314933000172), d=rnorm(10), key='upc')
DT[, list(length=length(d)), keyby=upc]
Output with 1.9.2 is:
> DT[, list(length=length(d)), keyby=upc]
upc length
1: 301426027592 2
2: 314775802939 2
3: 314775803490 2
4: 314775815510 2
5: 314933000171 2
Instead of:
> DT[, list(length=length(d)), keyby=upc]
upc length
1: 301426027592 1
2: 301426027593 1
3: 314775802939 1
4: 314775802940 1
5: 314775803490 1
6: 314775803491 1
7: 314775815510 1
8: 314775815511 1
9: 314933000171 1
10: 314933000172 1
More information about the datatable-help
mailing list