[datatable-help] New package bit64 over package int64

Branson Owen branson.owen at gmail.com
Fri Feb 24 20:05:42 CET 2012


Hello Matthew,

2012/2/22 Matthew Dowle <mdowle at mdowle.plus.com>
>
> I saw the announcement but didn't look yet. Yes it is v. interesting. Is
> bit64 a single atomic type? I seem to remember the problem (for
> data.table) was that package int64 stores it as two integer vectors.
> That's a 'problem' because data.table only supports atomic columns
> currently. If bit64 is an atomic type can support it much more easily. The
> atomic restriction is buried in data.table's C code. Changing that (e.g.
> also to allow type complex) might require some sponsorship ;)  Aside:
> list() columns are atomic (vector of pointers) so that's why they're
> supported.
>
>
Interesting, looks like bit64 is indeed a single atomic type. Below is the
excerpt from manual.

*Our creator function integer64 takes an argument length, creates an atomic
double vector of this length, attaches an S3 class attribute ’integer64’ to
it, and that’s it. We simply rely on S3 method dispatch and interpret those
64bit elements as ’long long int’.*

*If we introduce 64 bit integers not natively in Base R but as an external
package, we should at least strive to make*
*them as ’basic’ as possible. **Like the other atomic types in Base R, we
model data type ’integer64’ as a contiguous **atomic vector in memory, and
we use the more basic S3 class system, not S4. Like package int64*
*we want our ’integer64’ to be serializeable, therefore we also use an
existing data type as the*
*basis. Again the choice is obvious: R has only one 64 bit data type:
doubles. By using doubles,*
*integer64 inherits some functionality such as is.atomic, length, length<-,
names, names<-, dim, dim<-, dimnames, dimnames.*

So, bit64 is internally a 64-bit double that pretends to be an
integer. But? will it cause a problem to use it as a key column, cause
currently double is not supported for key?

We're talking about storing large ints here aren't we, we're not talking
> about longer vectors than 2^31, right?  That could only be R itself
> changing R_len_t from 'int' to 'size_t' or whatever, iiuc.
>

?... Oh, no! I thought 64-bit R already break the limitation of length
2^31. So 64-bit R only break the 4G memory barrier? I have very unreliable
human memory ... :) In this case, it seems to greatly limit the advantage
of using big integer. Unless we use ff or bigmemory package as the author
promises to support. Not sure how hard for data.table to join the
combination? But I remember you was indeed considering to support ff
package, right?

So the most ideal scenario would be: data.table + ff + bit64?

I also highly recommend to read "*Limitations inherited from Base R, Core
team, can you change this*" on manual page.7 for anyone who is interested
in more details.

Other interesting Excerpt:

*vector size of atomic vectors is still limited to .Machine$integer.max.
However, external memory extending packages such as ff or bigmemory can
extend their address space now with integer64.*

*However, internally we deviate from the strict paradigm in order to boost
performance. Our C functions do not create new return values, instead we
pass-in the memory to be returned as an argument. This gives us the freedom
to apply the C-function to new or old vectors, which helps to avoid
unnecessary memory allocation, unnecessary copying and unnessary4
bit64-package garbage collection. Within our R functions we also deviate
from conventional R programming by not using attr<- and attributes<-
because they always do new memory allocation and copying.*
*If we want to set attributes of return values that we have freshly
created, we instead use functions*
*setattr and setattributes from package bit. If you want to see both tricks
at work, see method integer64.*

*The fact that we introduce 64 bit long long integers – without introducing
128-bit long doubles creates some subtle challenges: Unlike 32 bit
integers, the integer64 are no longer a proper subset of double. If an
integer64 meets a double, it is not trivial what type to return. Switching
to integer64 limits our ability to represent very large*
*numbers, switching to integer64 limits our ability to distinguish x from
x+1. Since the latter is purpose of introducing 64 bit integers, we usually
return integer64 from functions involving integer64, for example in c,
cbind and rbind.*

Thanks for interesting discussion, Matthew :) Very much look forward to
possible new features.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120224/47632cc9/attachment.html>


More information about the datatable-help mailing list