[datatable-help] Advance warning

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 17 01:58:26 CET 2013


It was pretty clear so I just changed it (797) to: 
datatable.allow.cartesian.

Where the option corresponds to an argument name, the option is 
"datatable.<exact.arg.name>", then.

On 17.01.2013 00:21, Matthew Dowle wrote:
> Thanks. Commit 796 corrected that typo in NEWS just after I sent the
> email below.  So the global option was intended to be
> $datatable.allowcartesian as coded and documented. I grep'd code and
> man pages just in case and seems ok.
>
> Rightly or wrongly for the global options we dropped the dot in
> datatable, followed by dot, followed by the argument/option with dots
> dropped. That's consistent across the 9 global options ..... oh no, 
> it
> isn't .. other than print.nrows and print.topn.  Darn it, those
> slipped through.
>
> All the options :
>
> datatable.verbose            = FALSE
> datatable.dfdispatchwarn     = TRUE
> datatable.alloccol           = quote(max(100,2*ncol(DT)))
> datatable.nomatch            = NA_integer_
> datatable.optimize           = Inf
> datatable.print.nrows        = 100L
> datatable.print.topn         = 5L
> datatable.warnredundantby    = TRUE
> datatable.allowcartesian     = FALSE
>
> Could we get away with dropping the 2nd dots in print.nrows and
> print.topn I wonder?
>
> Or, it could be "datatable.allow.cartesian" if that's what most
> people would expect then?  warnredundantby and dfdispatchwarn aren't
> argument names to functions, and datatable.alloccol actually
> corresponds to n of alloc.col.
>
> So datatable.allow.cartesian then?
>
>
>
> On 16.01.2013 19:57, J R wrote:
>> I have one little nitpick that may be important as you write
>> documentation.  In 796, the global option doesn't have the second
>> period in it:
>>
>> $datatable.allowcartesian
>> [1] FALSE
>>
>>
>> On Tue, Jan 15, 2013 at 3:00 PM, Matthew Dowle
>> <mdowle at mdowle.plus.com> wrote:
>>>
>>> Thanks to the bug report below and S.O. question, 'allow.cartesian' 
>>> is now
>>> in 1.8.7.
>>> Please shout if anyone spots any issues with this.
>>>
>>> =====
>>> New argument 'allow.cartesian' (default FALSE) added to X[Y] and 
>>> merge(X,Y),
>>> #2464.
>>> Prevents large allocations due to misspecified joins; e.g., 
>>> duplicate key
>>> values in Y
>>> joining to the same group in X over and over again. The word 
>>> 'cartesian' is
>>> used loosely
>>> for when more than max(nrow(X),nrow(Y)) rows would be returned. The 
>>> error
>>> message is
>>> verbose and includes advice. Thanks to a question by Nick Clark :
>>>
>>> 
>>> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
>>> help from user1935457 and a detailed reproducible crash report from 
>>> JR.
>>> If the new option affects existing code you can set :
>>>   options(datatable.allow.cartesian=TRUE)
>>> to restore the previous behaviour until you have time to address.
>>> =====
>>>
>>>
>>>
>>> On 10.01.2013 11:33, Matthew Dowle wrote:
>>>>
>>>> Hi,
>>>>
>>>> Fantastic. Thanks so much for this - same for me, yes.
>>>>
>>>> It's similar to a huge cartesian join where the result
>>>> would have more than 2^31 rows. data.table should
>>>> be trapping that gracefully and giving an error
>>>> like this:
>>>>
>>>> "i's key is non unique; i.e., each duplicated key value
>>>> of i will join to the same group in x over and over.
>>>> The result will be huge. Are you sure?"
>>>>
>>>> Filed as bug here :
>>>>
>>>>
>>>> 
>>>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975
>>>>
>>>> Will make it a graceful error, if I understood corectly?
>>>>
>>>> Thanks!
>>>> Matthew
>>>>
>>>>
>>>> On 10.01.2013 10:37, J R wrote:
>>>>>
>>>>> While investigating the following SO question
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
>>>>>
>>>>> the asker ran into a segfault during a merge.
>>>>>
>>>>> I tried to reproduce it based on his description of his data (a 4
>>>>> million row table and a 1 million row table, merging on two 
>>>>> columns,
>>>>> one with 20-some unique strings and one with "+" or "-").
>>>>>
>>>>> The following setup code:
>>>>>
>>>>> set.seed(456)
>>>>> X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand 
>>>>> =
>>>>> sample(c("+","-"), 4e6, replace=TRUE), tags = 
>>>>> as.integer(runif(4e6) *
>>>>> 100), start = as.integer(runif(4e6) * 60000), end =
>>>>> as.integer(runif(4e6) * 60000))
>>>>> Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand 
>>>>> =
>>>>> sample(c("+","-"), 1e6, replace=TRUE), tags = 
>>>>> as.integer(runif(1e6) *
>>>>> 5), start = as.integer(runif(1e6) * 60000), end =
>>>>> as.integer(runif(1e6) * 60000))
>>>>> setkey(X, chr, strand)
>>>>> setkey(Y, chr, strand)
>>>>>
>>>>> Gives the following errors:
>>>>>
>>>>>> merge(X,Y)
>>>>>
>>>>> Error in vecseq(f__, len__) : negative length vectors are not 
>>>>> allowed
>>>>>>
>>>>>> Y[X]
>>>>>
>>>>> Error in vecseq(f__, len__) : negative length vectors are not 
>>>>> allowed
>>>>>
>>>>> In data.table 1.8.7 on Windowx x64.  Doing some poking around in
>>>>> debug(data.table:::`[.data.table`) makes it seems like sum(len__) 
>>>>> >
>>>>> .Machine$integer.max after the binary merge, which seems like the
>>>>> above errors might come from these lines in vecseq.c:
>>>>>
>>>>> for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
>>>>> ans = PROTECT(allocVector(INTSXP, reslen));
>>>>>
>>>>> Does that mean this size and structure and dataset is bumping up
>>>>> against R's vector size limits for this type of merge?
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>>
>>>>> 
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



More information about the datatable-help mailing list