[datatable-help] New package bit64 over package int64

Matthew Dowle mdowle at mdowle.plus.com
Sat Feb 25 03:48:09 CET 2012


Ok quite a bit to get through here. Comments in line below. Thanks for
quoting passages from its documentation, that helps to reply quickly.
I'm relying on that being correct without looking myself so I may be
missing context. So, with that said ...

On Fri, 2012-02-24 at 13:05 -0600, Branson Owen wrote:
> Hello Matthew, 
> 
> 2012/2/22 Matthew Dowle <mdowle at mdowle.plus.com>
>         I saw the announcement but didn't look yet. Yes it is v.
>         interesting. Is
>         bit64 a single atomic type? I seem to remember the problem
>         (for
>         data.table) was that package int64 stores it as two integer
>         vectors.
>         That's a 'problem' because data.table only supports atomic
>         columns
>         currently. If bit64 is an atomic type can support it much more
>         easily. The
>         atomic restriction is buried in data.table's C code. Changing
>         that (e.g.
>         also to allow type complex) might require some sponsorship ;)
>          Aside:
>         list() columns are atomic (vector of pointers) so that's why
>         they're
>         supported.
>         
> 
> 
> Interesting, looks like bit64 is indeed a single atomic type. Below is
> the excerpt from manual. 
> 
> 
> Our creator function integer64 takes an argument length, creates an
> atomic double vector of this length, attaches an S3 class attribute
> ’integer64’ to it, and that’s it. We simply rely on S3 method dispatch
> and interpret those 64bit elements as ’long long int’.
> 
> 
> If we introduce 64 bit integers not natively in Base R but as an
> external package, we should at least strive to make
> them as ’basic’ as possible. Like the other atomic types in Base R, we
> model data type ’integer64’ as a contiguous atomic vector in memory,
> and we use the more basic S3 class system, not S4. Like package int64
> we want our ’integer64’ to be serializeable, therefore we also use an
> existing data type as the
> basis. Again the choice is obvious: R has only one 64 bit data type:
> doubles. By using doubles,
> integer64 inherits some functionality such as is.atomic, length,
> length<-, names, names<-, dim, dim<-, dimnames, dimnames.

Yes this choice seems reasonable. But, strictly speaking, it isn't the
only 64bit type. What about pointers?  Pointers include strings and
lists. They are both 64bit atomic vectors on 64bit machines and could be
used to store integer64, too. Does bit64 works on 32bit machines,
though?  On *32bit machines* yes double is the only 64bit type.  So if
bit64 works on 32bit too then it makes sense to say that.

Something similar (but not the same) has been suggested by r-core in
section 11.1 of the R internals manual with the words "it is often
overlooked". However, bit64 does seem to take that concept further it
seems.

> So, bit64 is internally a 64-bit double that pretends to be an
> integer. But? will it cause a problem to use it as a key column, cause
> currently double is not supported for key?

Great, that is is promising. The main difficulty in 'allowing' different
column types is sorting them efficiently. By efficiently we mean as
fast, or close to as fast, as radix sorting (actually a counting sort)
of integers.  If there is a way to sort bit64 then it should be fine.
I'm not quite clear if bit64 is for 64bit machines only or not. But that
can be switched without too much difficulty.

> 
>         We're talking about storing large ints here aren't we, we're
>         not talking
>         about longer vectors than 2^31, right?  That could only be R
>         itself
>         changing R_len_t from 'int' to 'size_t' or whatever, iiuc.
> 
> 
> ?... Oh, no! I thought 64-bit R already break the limitation of length
> 2^31. 

Nope, 64bit R is still limited to 2^31 vector length. What is freed in
64bit R is that you can have many more 2^31 vectors in memory at once.
So a data.table can be 2 billion rows and as many columns that can fit
in RAM. Remember a 2 billion (2^31) numeric vector is 2^31 * 8 / 1024^3
= 16GB. That's quite a bit for a single vector! Lets say hardware
limitations are 128GB of RAM currently (at reasonable cost).  With just
8 columns and 2 billion rows, your RAM is full anyway with no room for
copies, let alone the OS itself. In practice the vector length
limitation rarely bites. Note that data.tables can be much larger than
the largest possible matrix because they consist of many vectors, not
just the one vector that a matrix does. That seems often overlooked when
data.frame is compared to matrix. One reason why set() has been added to
data.table v1.8.0 is to demonstrate that there's nothing fundamentally
wrong with data.frame being a list of vectors (in fact that's a very
good choice).

Section 11.1 ("64-bit types") of R-ints goes deeper into this topic, and
the background, and has probably been there for a very long time.

> So 64-bit R only break the 4G memory barrier? I have very unreliable
> human memory ... :) In this case, it seems to greatly limit the
> advantage of using big integer. 

Big integer is still useful. Think encodings, keys and hashes. Or
distributions of integers with range greater than 2^31. I guess.

> Unless we use ff or bigmemory package as the author promises to
> support. Not sure how hard for data.table to join the combination? But
> I remember you was indeed considering to support ff package, right? 

Yes in the past data.table did depend on ff and had tests using it.
There was little interest in that at that time however (no user asked
for it, and several said they had no need), and the internals of ff
seemed difficult to work with anyway (I did try). Would like to try
again, though.

Most data.table users have plentiful RAM, and with cloud computing and
virtual ram, a lot can be done with say 128GB RAM even with the 2^31
single vector limitation. 

> 
> So the most ideal scenario would be: data.table + ff + bit64? 
> 
> 
> I also highly recommend to read "Limitations inherited from Base R,
> Core team, can you change this" on manual page.7 for anyone who is
> interested in more details. 

Ok, will do. Hopefully it builds upon section 11.1 of R-ints.

> 
> 
> Other interesting Excerpt:
> 
> 
> vector size of atomic vectors is still limited to .Machine
> $integer.max. However, external memory extending packages such as ff
> or bigmemory can extend their address space now with integer64.

Interesting. I wonder how/why ff's / bigmemory's address space is
currently limited then. I thought they were bound by disk space.
Whatever is pulled in from those disk storage schemes would still be
bound by the 2^31 limitation, iiuc.

> However, internally we deviate from the strict paradigm in order
> to boost performance. Our C functions do not create new return values,
> instead we pass-in the memory to be returned as an argument. This
> gives us the freedom to apply the C-function to new or old vectors,
> which helps to avoid unnecessary memory allocation, unnecessary
> copying and unnessary4 bit64-package garbage collection. Within our R
> functions we also deviate from conventional R programming by not using
> attr<- and attributes<- because they always do new memory allocation
> and copying.
> If we want to set attributes of return values that we have freshly
> created, we instead use functions
> setattr and setattributes from package bit. If you want to see both
> tricks at work, see method integer64.

data.table also has setattr(), added recently, after package 'bit'. I
was aware that package 'bit' had this function and considered depending
on or referencing package 'bit'. However we (I) decided not to. There is
nothing to the function. There is no code to it. It is merely a wrapper
to R's own setAttrib(). The function is literally an entry point to R's
API that exists already with a lower case a instead of an uppercase A.
It does depart from conventional R programming, that's true. But, we
depart because we don't follow the advice to duplicate NAMED objects in
the .Call. *That's* really the departure, and that's why data.table
exports copy() which is merely a wrapper to R's duplicate().
[ duplicate() was too long and too close to duplicated() ]


> 
> The fact that we introduce 64 bit long long integers – without
> introducing 128-bit long doubles creates some subtle challenges:
> Unlike 32 bit integers, the integer64 are no longer a proper subset of
> double. If an integer64 meets a double, it is not trivial what type to
> return. Switching to integer64 limits our ability to represent very
> large
> numbers, switching to integer64 limits our ability to distinguish x
> from x+1. Since the latter is purpose of introducing 64 bit integers,
> we usually return integer64 from functions involving integer64, for
> example in c, cbind and rbind.
> 
> 
> Thanks for interesting discussion, Matthew :) Very much look forward
> to possible new features. 
> 
> 




More information about the datatable-help mailing list