[datatable-help] Advance warning
Matthew Dowle
mdowle at mdowle.plus.com
Thu Jan 17 01:21:47 CET 2013
Thanks. Commit 796 corrected that typo in NEWS just after I sent the
email below. So the global option was intended to be
$datatable.allowcartesian as coded and documented. I grep'd code and man
pages just in case and seems ok.
Rightly or wrongly for the global options we dropped the dot in
datatable, followed by dot, followed by the argument/option with dots
dropped. That's consistent across the 9 global options ..... oh no, it
isn't .. other than print.nrows and print.topn. Darn it, those slipped
through.
All the options :
datatable.verbose = FALSE
datatable.dfdispatchwarn = TRUE
datatable.alloccol = quote(max(100,2*ncol(DT)))
datatable.nomatch = NA_integer_
datatable.optimize = Inf
datatable.print.nrows = 100L
datatable.print.topn = 5L
datatable.warnredundantby = TRUE
datatable.allowcartesian = FALSE
Could we get away with dropping the 2nd dots in print.nrows and
print.topn I wonder?
Or, it could be "datatable.allow.cartesian" if that's what most people
would expect then? warnredundantby and dfdispatchwarn aren't argument
names to functions, and datatable.alloccol actually corresponds to n of
alloc.col.
So datatable.allow.cartesian then?
On 16.01.2013 19:57, J R wrote:
> I have one little nitpick that may be important as you write
> documentation. In 796, the global option doesn't have the second
> period in it:
>
> $datatable.allowcartesian
> [1] FALSE
>
>
> On Tue, Jan 15, 2013 at 3:00 PM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>>
>> Thanks to the bug report below and S.O. question, 'allow.cartesian'
>> is now
>> in 1.8.7.
>> Please shout if anyone spots any issues with this.
>>
>> =====
>> New argument 'allow.cartesian' (default FALSE) added to X[Y] and
>> merge(X,Y),
>> #2464.
>> Prevents large allocations due to misspecified joins; e.g.,
>> duplicate key
>> values in Y
>> joining to the same group in X over and over again. The word
>> 'cartesian' is
>> used loosely
>> for when more than max(nrow(X),nrow(Y)) rows would be returned. The
>> error
>> message is
>> verbose and includes advice. Thanks to a question by Nick Clark :
>>
>>
>> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
>> help from user1935457 and a detailed reproducible crash report from
>> JR.
>> If the new option affects existing code you can set :
>> options(datatable.allow.cartesian=TRUE)
>> to restore the previous behaviour until you have time to address.
>> =====
>>
>>
>>
>> On 10.01.2013 11:33, Matthew Dowle wrote:
>>>
>>> Hi,
>>>
>>> Fantastic. Thanks so much for this - same for me, yes.
>>>
>>> It's similar to a huge cartesian join where the result
>>> would have more than 2^31 rows. data.table should
>>> be trapping that gracefully and giving an error
>>> like this:
>>>
>>> "i's key is non unique; i.e., each duplicated key value
>>> of i will join to the same group in x over and over.
>>> The result will be huge. Are you sure?"
>>>
>>> Filed as bug here :
>>>
>>>
>>>
>>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975
>>>
>>> Will make it a graceful error, if I understood corectly?
>>>
>>> Thanks!
>>> Matthew
>>>
>>>
>>> On 10.01.2013 10:37, J R wrote:
>>>>
>>>> While investigating the following SO question
>>>>
>>>>
>>>>
>>>>
>>>> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
>>>>
>>>> the asker ran into a segfault during a merge.
>>>>
>>>> I tried to reproduce it based on his description of his data (a 4
>>>> million row table and a 1 million row table, merging on two
>>>> columns,
>>>> one with 20-some unique strings and one with "+" or "-").
>>>>
>>>> The following setup code:
>>>>
>>>> set.seed(456)
>>>> X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand =
>>>> sample(c("+","-"), 4e6, replace=TRUE), tags =
>>>> as.integer(runif(4e6) *
>>>> 100), start = as.integer(runif(4e6) * 60000), end =
>>>> as.integer(runif(4e6) * 60000))
>>>> Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand =
>>>> sample(c("+","-"), 1e6, replace=TRUE), tags =
>>>> as.integer(runif(1e6) *
>>>> 5), start = as.integer(runif(1e6) * 60000), end =
>>>> as.integer(runif(1e6) * 60000))
>>>> setkey(X, chr, strand)
>>>> setkey(Y, chr, strand)
>>>>
>>>> Gives the following errors:
>>>>
>>>>> merge(X,Y)
>>>>
>>>> Error in vecseq(f__, len__) : negative length vectors are not
>>>> allowed
>>>>>
>>>>> Y[X]
>>>>
>>>> Error in vecseq(f__, len__) : negative length vectors are not
>>>> allowed
>>>>
>>>> In data.table 1.8.7 on Windowx x64. Doing some poking around in
>>>> debug(data.table:::`[.data.table`) makes it seems like sum(len__)
>>>> >
>>>> .Machine$integer.max after the binary merge, which seems like the
>>>> above errors might come from these lines in vecseq.c:
>>>>
>>>> for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
>>>> ans = PROTECT(allocVector(INTSXP, reslen));
>>>>
>>>> Does that mean this size and structure and dataset is bumping up
>>>> against R's vector size limits for this type of merge?
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>>
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list