[datatable-help] Advance warning

J R fe292a at gmail.com
Wed Jan 16 20:57:54 CET 2013


I have one little nitpick that may be important as you write
documentation.  In 796, the global option doesn't have the second
period in it:

$datatable.allowcartesian
[1] FALSE


On Tue, Jan 15, 2013 at 3:00 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Thanks to the bug report below and S.O. question, 'allow.cartesian' is now
> in 1.8.7.
> Please shout if anyone spots any issues with this.
>
> =====
> New argument 'allow.cartesian' (default FALSE) added to X[Y] and merge(X,Y),
> #2464.
> Prevents large allocations due to misspecified joins; e.g., duplicate key
> values in Y
> joining to the same group in X over and over again. The word 'cartesian' is
> used loosely
> for when more than max(nrow(X),nrow(Y)) rows would be returned. The error
> message is
> verbose and includes advice. Thanks to a question by Nick Clark :
>
> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
> help from user1935457 and a detailed reproducible crash report from JR.
> If the new option affects existing code you can set :
>   options(datatable.allow.cartesian=TRUE)
> to restore the previous behaviour until you have time to address.
> =====
>
>
>
> On 10.01.2013 11:33, Matthew Dowle wrote:
>>
>> Hi,
>>
>> Fantastic. Thanks so much for this - same for me, yes.
>>
>> It's similar to a huge cartesian join where the result
>> would have more than 2^31 rows. data.table should
>> be trapping that gracefully and giving an error
>> like this:
>>
>> "i's key is non unique; i.e., each duplicated key value
>> of i will join to the same group in x over and over.
>> The result will be huge. Are you sure?"
>>
>> Filed as bug here :
>>
>>
>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975
>>
>> Will make it a graceful error, if I understood corectly?
>>
>> Thanks!
>> Matthew
>>
>>
>> On 10.01.2013 10:37, J R wrote:
>>>
>>> While investigating the following SO question
>>>
>>>
>>>
>>> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
>>>
>>> the asker ran into a segfault during a merge.
>>>
>>> I tried to reproduce it based on his description of his data (a 4
>>> million row table and a 1 million row table, merging on two columns,
>>> one with 20-some unique strings and one with "+" or "-").
>>>
>>> The following setup code:
>>>
>>> set.seed(456)
>>> X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand =
>>> sample(c("+","-"), 4e6, replace=TRUE), tags = as.integer(runif(4e6) *
>>> 100), start = as.integer(runif(4e6) * 60000), end =
>>> as.integer(runif(4e6) * 60000))
>>> Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand =
>>> sample(c("+","-"), 1e6, replace=TRUE), tags = as.integer(runif(1e6) *
>>> 5), start = as.integer(runif(1e6) * 60000), end =
>>> as.integer(runif(1e6) * 60000))
>>> setkey(X, chr, strand)
>>> setkey(Y, chr, strand)
>>>
>>> Gives the following errors:
>>>
>>>> merge(X,Y)
>>>
>>> Error in vecseq(f__, len__) : negative length vectors are not allowed
>>>>
>>>> Y[X]
>>>
>>> Error in vecseq(f__, len__) : negative length vectors are not allowed
>>>
>>> In data.table 1.8.7 on Windowx x64.  Doing some poking around in
>>> debug(data.table:::`[.data.table`) makes it seems like sum(len__) >
>>> .Machine$integer.max after the binary merge, which seems like the
>>> above errors might come from these lines in vecseq.c:
>>>
>>> for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
>>> ans = PROTECT(allocVector(INTSXP, reslen));
>>>
>>> Does that mean this size and structure and dataset is bumping up
>>> against R's vector size limits for this type of merge?
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list