[datatable-help] New package bit64 over package int64
Matthew Dowle
mdowle at mdowle.plus.com
Wed Feb 22 18:33:43 CET 2012
I saw the announcement but didn't look yet. Yes it is v. interesting. Is
bit64 a single atomic type? I seem to remember the problem (for
data.table) was that package int64 stores it as two integer vectors.
That's a 'problem' because data.table only supports atomic columns
currently. If bit64 is an atomic type can support it much more easily. The
atomic restriction is buried in data.table's C code. Changing that (e.g.
also to allow type complex) might require some sponsorship ;) Aside:
list() columns are atomic (vector of pointers) so that's why they're
supported.
We're talking about storing large ints here aren't we, we're not talking
about longer vectors than 2^31, right? That could only be R itself
changing R_len_t from 'int' to 'size_t' or whatever, iiuc.
Btw, I have to say I don't know why Jens has a problem with Google
sponsoring int64. Google paid Romain for his time, and allowed it to be
released (they weren't obliged to release it at all). What's the problem
with that? int64's license is pure GPL afaics. Have I missed something?
Matthew
> FYI, I remembered we raised an idea awhile ago for data.table leveraging
> on
> int64 package. Now we seem to have a better alternative? If we are indeed
> considering to support int64 package, maybe you should take a look at this
> (though 90% chance you already did :p).
>
> [Copy from R-packages Digest, Vol 101, Issue 3]
>
> Package 'bit64' provides fast serializable S3 atomic 64bit (signed)
> integers that can be used in vectors, matrices, arrays and data.frames.
> Methods are available for coercion from and to logicals, integers,
> doubles, characters as well as many elementwise and summary functions.
>
> Package 'bit64' has the following advantages over package 'int64' (which
> was sponsored by Google):
> - true atomic vectors usable with length, dim, names etc.
> - only S3, not S4 class system used to dispatch methods
> - less RAM consumption by factor 7 (under 64 bit OS)
> - faster operations by factor 4 to 2000 (under 64 bit OS)
> - no slow-down of R's garbage collection (as caused by the pure
> existence of 'int64' objects)
> - pure GPL, no copyrights from transnational commercial company
>
> While the advantage of the atomic S3 design over the complicated S4
> object design is obvious, it is less obvious that an external package is
> the best way to enrich R with 64bit integers. An external package will
> not give us literals such as 1LL or directly allow us to address larger
> vectors than possible with base R. But it allows us to properly address
> larger vectors in other packages such as 'ff' or 'bigmemory' and it
> allows us to properly work with large surrogate keys from external
> databases. An external package realizing just one data type also makes a
> perfect test bed to play with innovative performance enhancements.
> Performance tuned sorting and hashing are planned for the next release,
> which will give us fast versions of sort, order, merge, duplicated,
> unique, and table - for 64bit integers.
>
> For those who still hope that R's 'integer' will be 64bit some day, here
> is my key learning: migrating R's 'integer' from 32 to 64 bit would be
> RAM expensive. It would most likely require to also migrate R's 'double'
> from 64 to 128 bit - in order to again have a data type to which we can
> lossless coerce. The assumption that 'integer' is a proper subset of
> 'double' is scattered over R's semantics. We all expect that binary and
> n-ary functions such as '+' and 'c' do return 'double' and do not
> destroy information. With solely extending 64bit integers but not 128bit
> doubles, we have semantic changes potentially disappointing such
> expectations: integer64+double returns integer64 and does kill decimals.
> I did my best to make operations involving integer64 consistent and
> numerically stable - please consult the documentation at ?bit64 for
> details.
>
> Since this package is 'at risk' to create a lot of dependencies from
> other packages, I'd appreciate serious beta-testing and also
> code-review, ideally from the R-Core team. Please check the
> 'Limitations' sections at the help page and the numerics involving "long
> double" in C. If the conclusion is that this should be better done in
> Base R - I happly donate the code and drop this package. If we have to
> go with an external package for 64bit integers, it would be great if
> this work could convince the Rcpp team including Romain about the
> advantages of this approach. Shouldn't we join forces here?
>
> Best regards
>
> Jens Oehlschl?gel
> Munich, 21.2.2012
>
More information about the datatable-help
mailing list