[datatable-help] R vector size limits and merging

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 10 12:33:01 CET 2013


Hi,

Fantastic. Thanks so much for this - same for me, yes.

It's similar to a huge cartesian join where the result
would have more than 2^31 rows. data.table should
be trapping that gracefully and giving an error
like this:

"i's key is non unique; i.e., each duplicated key value
of i will join to the same group in x over and over.
The result will be huge. Are you sure?"

Filed as bug here :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975

Will make it a graceful error, if I understood corectly?

Thanks!
Matthew


On 10.01.2013 10:37, J R wrote:
> While investigating the following SO question
>
> 
> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
>
> the asker ran into a segfault during a merge.
>
> I tried to reproduce it based on his description of his data (a 4
> million row table and a 1 million row table, merging on two columns,
> one with 20-some unique strings and one with "+" or "-").
>
> The following setup code:
>
> set.seed(456)
> X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand =
> sample(c("+","-"), 4e6, replace=TRUE), tags = as.integer(runif(4e6) *
> 100), start = as.integer(runif(4e6) * 60000), end =
> as.integer(runif(4e6) * 60000))
> Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand =
> sample(c("+","-"), 1e6, replace=TRUE), tags = as.integer(runif(1e6) *
> 5), start = as.integer(runif(1e6) * 60000), end =
> as.integer(runif(1e6) * 60000))
> setkey(X, chr, strand)
> setkey(Y, chr, strand)
>
> Gives the following errors:
>
>> merge(X,Y)
> Error in vecseq(f__, len__) : negative length vectors are not allowed
>> Y[X]
> Error in vecseq(f__, len__) : negative length vectors are not allowed
>
> In data.table 1.8.7 on Windowx x64.  Doing some poking around in
> debug(data.table:::`[.data.table`) makes it seems like sum(len__) >
> .Machine$integer.max after the binary merge, which seems like the
> above errors might come from these lines in vecseq.c:
>
> for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
> ans = PROTECT(allocVector(INTSXP, reslen));
>
> Does that mean this size and structure and dataset is bumping up
> against R's vector size limits for this type of merge?
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list