[datatable-help] R vector size limits and merging

J R fe292a at gmail.com
Thu Jan 10 11:37:41 CET 2013


While investigating the following SO question

http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql

the asker ran into a segfault during a merge.

I tried to reproduce it based on his description of his data (a 4
million row table and a 1 million row table, merging on two columns,
one with 20-some unique strings and one with "+" or "-").

The following setup code:

set.seed(456)
X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand =
sample(c("+","-"), 4e6, replace=TRUE), tags = as.integer(runif(4e6) *
100), start = as.integer(runif(4e6) * 60000), end =
as.integer(runif(4e6) * 60000))
Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand =
sample(c("+","-"), 1e6, replace=TRUE), tags = as.integer(runif(1e6) *
5), start = as.integer(runif(1e6) * 60000), end =
as.integer(runif(1e6) * 60000))
setkey(X, chr, strand)
setkey(Y, chr, strand)

Gives the following errors:

> merge(X,Y)
Error in vecseq(f__, len__) : negative length vectors are not allowed
> Y[X]
Error in vecseq(f__, len__) : negative length vectors are not allowed

In data.table 1.8.7 on Windowx x64.  Doing some poking around in
debug(data.table:::`[.data.table`) makes it seems like sum(len__) >
.Machine$integer.max after the binary merge, which seems like the
above errors might come from these lines in vecseq.c:

for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
ans = PROTECT(allocVector(INTSXP, reslen));

Does that mean this size and structure and dataset is bumping up
against R's vector size limits for this type of merge?


More information about the datatable-help mailing list