[datatable-help] Ran into infinite loop when setting key/sorting columns with NA
Branson Owen
branson.owen at gmail.com
Thu Jul 29 08:00:58 CEST 2010
Your answer explains another question I have.
I can't select/join with NA like this.
> DT = data.table(A = c(NA, "A"), Z = 1:10, key = "A")
> DT[J(NA)]
Error in `[.data.table`(DT, J(NA)) :
unsorted column V1 of i is not storage.mode integer.
> DT[CJ(NA)]
Error hint: the i expression sees the column variables. Column names
(variables) will mask variables in the calling frame. Check for any
conflicts.
Error in `[.data.table`(DT, CJ(NA)) :
Error in setkey(JDT) : All keyed columns must be storage mode integer
> DT[NA]
A Z
[1,] <NA> NA
[2,] <NA> NA
[3,] <NA> NA
[4,] <NA> NA
[5,] <NA> NA
[6,] <NA> NA
[7,] <NA> NA
[8,] <NA> NA
[9,] <NA> NA
[10,] <NA> NA
The only way is probably use the NOT match feature you provide like
> DT[-DT[J("A"), mult = "all", which = TRUE]]
If there is more convenient way to select NA key, please let me know.
If you believe that NAs in key columns is unusual, it makes sense that
the join with NA doesn't work. Actually, I just recalled that in SQL,
NA is not even allowed in key columns. You are right. I wish I can
recall this earlier. I guess I just really get used to data.frame
which never complain about NA values.
However, if you chance your mind to support DT[J(NA)] syntax, please
let us know. It's not bad for quickly browsing and filtering out
outlier or bad data.
Again, thank you very much for your kind help and wise answer. I think
I don't have more questions at this moment. I will leave most features
request alone since I see you already have a long wish list, except
for this one: memory limit supported by other package, since it's
natural that fast operation has more demands in big data.
As far as I know, there are three noticeable groups having significant
impact on this issue: ff and bigmemory package. Revolution R community
seems to finish their home package for big data. They were looking for
volunteers who have big data for testing with biglm package.
I did remember that I saw some posts about ff package support but I am
not sure how mature it is. If it's mature enough, I think it's very
worthy advertising it. Super fast data.table on unlimited object size
with very cool and concise syntax. Just my two cents. :-)
Best regards,
2010/7/29 Matthew Dowle <mdowle at mdowle.plus.com>:
> Its not usual to have NAs in key columns but they shouldn't cause
> problems either. Do you actually have data in non-key columns on rows
> where there are NA in the key columns? That seems odd but possible I
> suppose. If all non-key columns are NA for those rows, we normally
> remove the rows.
>
> We'll need exact error messages and code and data to investigate
> further, at least I will unless anyone else has seen this before. If you
> can't reproduce then thats ok just post what you can.
>
> There is an infinite loop problem when a data.table is created in v1.4
> and saved, then used with v1.5. Make sure the class of the data.table is
> c("data.table","data.frame"), if its just "data.table" in v1.5 then a
> loop can sometimes occur. But that isn't related to NA in the key afaik.
>
> Thanks.
>
> On Wed, 2010-07-28 at 17:58 -0500, Branson Owen wrote:
>> ** Should I avoid using NA in data.table forever? Any comments is
>> highly appreciated. **
>>
>> I am sorry that I can't present a code that reproduced the bug. I was
>> working with a lot of mid-size data, each is 100K rows, 10+ columns
>> with 5 columns set as key.
>>
>> When I use for loop to do calculation and then set key, only 5% chance
>> I will get the following bug. Don't really know why?
>>
>> I was using many NAs in two of the key factor columns. When I
>> transformed the data.table and try to set the same key again, it
>> reacted as follow:
>>
>> Version 1.4 in 64-bit R on windows: (seems to?) ran into infinite
>> loop. Can't break it manually. Keep consuming CPU as observed from
>> task manager.
>> Version 1.5 in 32-bit R on windows: throw an error message shortly
>> saying that ?(not sure the exact message) "sorting ran into infinite
>> loop/iteration?"
>>
>> However, 32-bit data is using the image file saved by 64-bit R.
>> Therefore, I am not sure whether the above message is valid for this
>> bug?
>>
>> It looks like that the bug has been noticed, but can't solve yet? I
>> also encountered many other problems, but the silent infinite freezing
>> always come from setkey/key().
>>
>> It shouldn't run into infinite loop because when I assign all NA to a
>> blank string/factor value "". Setting key and sorting work again.
>>
>> Currently, I reset all my data to avoid NA when using data.table
>> (painful). That's why I can't reproduce the bug. I tried to fake the
>> data but didn't work.
>>
>> Didn't see this issue been discussed so I report to everyone.
>>
>> Should I avoid using NA in data.table forever? Any comments is highly
>> appreciated.
>>
>> Best regards,
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list