Hello Matthew, <br><br><div class="gmail_quote">2012/2/22 Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>></span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I saw the announcement but didn't look yet. Yes it is v. interesting. Is<br>
bit64 a single atomic type? I seem to remember the problem (for<br>
data.table) was that package int64 stores it as two integer vectors.<br>
That's a 'problem' because data.table only supports atomic columns<br>
currently. If bit64 is an atomic type can support it much more easily. The<br>
atomic restriction is buried in data.table's C code. Changing that (e.g.<br>
also to allow type complex) might require some sponsorship ;) Aside:<br>
list() columns are atomic (vector of pointers) so that's why they're<br>
supported.<br>
<br></blockquote><div><br></div><div>Interesting, looks like bit64 is indeed a single atomic type. Below is the excerpt from manual. </div><div><br></div><div><div><div><i><b>Our creator function integer64 takes an argument length, creates an atomic double vector of this length, attaches an S3 class attribute ’integer64’ to it, and that’s it. We simply rely on S3 method dispatch and interpret those 64bit elements as ’long long int’.</b></i></div>
</div></div><div><br></div><div><div><i>If we introduce 64 bit integers not natively in Base R but as an external package, we should at least strive to make</i></div><div><i>them as ’basic’ as possible. </i><i><b>Like the other atomic types in Base R, we model data type ’integer64’ as a contiguous </b></i><i><b>atomic vector in memory</b>, and we use the more basic S3 class system, not S4. Like package int64</i></div>
<div><i>we want our ’integer64’ to be serializeable, therefore we also <b>use an existing data type as the</b></i></div><div><i><b>basis</b>. Again the choice is obvious: R has only one 64 bit data type: doubles. <b>By using doubles</b>,</i></div>
<div><i>integer64 inherits some functionality such as is.atomic, length, length<-, names, names<-, dim, dim<-, dimnames, dimnames.</i></div></div><div><br></div><div>So, bit64 is internally a 64-bit double that pretends to be an integer. But? will it cause a problem to use it as a key column, cause currently double is not supported for key?</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
We're talking about storing large ints here aren't we, we're not talking<br>
about longer vectors than 2^31, right? That could only be R itself<br>
changing R_len_t from 'int' to 'size_t' or whatever, iiuc.<br></blockquote><div><br></div><div>?... Oh, no! I thought 64-bit R already break the limitation of length 2^31. So 64-bit R only break the 4G memory barrier? I have very unreliable human memory ... :) In this case, it seems to greatly limit the advantage of using big integer. Unless we use ff or bigmemory package as the author promises to support. Not sure how hard for data.table to join the combination? But I remember you was indeed considering to support ff package, right? </div>
<div><br></div><div>So the most ideal scenario would be: data.table + ff + bit64? </div><div><br></div><div><div>I also highly recommend to read "<b>Limitations inherited from Base R, Core team, can you change this</b>" on manual page.7 for anyone who is interested in more details. </div>
</div><div><br></div><div>Other interesting Excerpt:</div><div><br></div><div><div><i><b>vector size of atomic vectors is still limited to .Machine$integer.max. However, external memory extending packages such as ff or bigmemory can extend their address space now with integer64.</b></i></div>
</div><div><br></div><div><div><i>However, internally we deviate from the strict paradigm in order to <b>boost performance</b>. Our C functions do not create new return values, instead we pass-in the memory to be returned as an argument. This gives us the freedom to apply the C-function to new or old vectors, which helps to <b>avoid unnecessary memory allocation</b>, unnecessary copying and unnessary4 bit64-package garbage collection. Within our R functions we also deviate from conventional R programming <b>by not using attr<- and attributes<- because they always do new memory allocation and copying</b>.</i></div>
<div><i>If we want to set attributes of return values that we have freshly created, we instead use functions</i></div><div><i>setattr and setattributes from package bit. If you want to see both tricks at work, see method integer64.</i></div>
</div><div><br></div><div><div><i>The fact that we introduce 64 bit long long integers – <b>without introducing 128-bit long doubles</b> creates some subtle challenges: Unlike 32 bit integers, the integer64 are no longer a proper subset of double. If an integer64 meets a double, it is not trivial what type to return. Switching to integer64 limits our ability to represent very large</i></div>
<div><i>numbers, switching to integer64 limits our ability to distinguish x from x+1. Since the latter is purpose of introducing 64 bit integers, <b>we usually return integer64 from functions involving integer64</b>, for example in c, cbind and rbind.</i></div>
</div><div><br></div><div>Thanks for interesting discussion, Matthew :) Very much look forward to possible new features. </div><div><br></div></div>