[datatable-help] behavior of unique on data.tables with strings

Steven C. Bagley steven.bagley at gmail.com
Tue Jan 3 15:40:41 CET 2012


It looks like a 32 vs 64 bit problem. I just checked and the example runs fine in the 32-bit build on this Mac. 

Chris: thanks. I added your text to my bug report.

--Steve

On Jan 3, 2012, at 3:11 AM, Chris Neff wrote:

> Also one addendum, since I have verbose on, I got the following when
> trying to do the unique(foo2) that doesn't work:
> 
> Non-first column 1 failed radixorder1, reverting to regularorder1
> 
> Don't know if that helps at all.
> 
> On 3 January 2012 05:48, Chris Neff <caneff at gmail.com> wrote:
>> I'll confirm that I get the same behavior Steven does on 64-bit linux
>> on 1.7.8.  So 64-bit sounds like the culprit?
>> 
>> On 3 January 2012 03:01, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>> 
>>> Ok thanks. Please file a bug report (mentioning it might be a 64bit
>>> and/or mac only problem), so it's not forgotten. Trying to fix the Chris
>>> crash so will have to come back to it ...
>>> 
>>> On Mon, 2012-01-02 at 20:13 -0800, Steven C. Bagley wrote:
>>>> It still happens. (I deleted R and all packages, then reinstalled just to check.)
>>>> 
>>>> test.data.table() completes without errors.
>>>> 
>>>> Here's the session info.
>>>> 
>>>>> sessionInfo()
>>>> R version 2.14.0 (2011-10-31)
>>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>> 
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>> 
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>> 
>>>> other attached packages:
>>>> [1] data.table_1.7.7
>>>> 
>>>>> .Machine$double.eps ^ 0.5
>>>> [1] 1.490116e-08
>>>> 
>>>> --Steve
>>>> 
>>>> On Jan 2, 2012, at 3:27 PM, Matthew Dowle wrote:
>>>> 
>>>>> Thanks for the nice report. Oddly though, it seems to work ok for me
>>>>> both in 1.7.7 and latest 1.7.8.
>>>>> 
>>>>> $ R --vanilla
>>>>> R version 2.14.1 (2011-12-22)
>>>>> Platform: i686-pc-linux-gnu (32-bit)
>>>>>> require(data.table)
>>>>> Loading required package: data.table
>>>>> data.table 1.7.7  For help type: help("data.table")
>>>>>> foo2=as.data.table(data.frame(a=c("1", "1"), b=c(2,2),
>>>>> stringsAsFactors=FALSE))
>>>>>> unique(foo2)
>>>>>     a b
>>>>> [1,] 1 2
>>>>>> str(foo2)
>>>>> Classes ‘data.table’ and 'data.frame':      2 obs. of  2 variables:
>>>>> $ a: chr  "1" "1"
>>>>> $ b: num  2 2
>>>>>> .Machine$double.eps ^ 0.5
>>>>> [1] 1.490116e-08
>>>>> 
>>>>> Could you rerun and confirm please. If you are 64bit, please include
>>>>> sessionInfo(). I've included tolerance as a long shot - the numeric 2's
>>>>> are considered equal by data.table's unique() using tolerance. Perhaps
>>>>> that part is not working for you. Does test.data.table() work? It should
>>>>> test unique and tolerance fairly thoroughly. Otherwise I can't think why
>>>>> the character column isn't liked by unique, should be ok.
>>>>> 
>>>>> A fast unique for character columns is a good feature request, please
>>>>> could you add to the tracker. That is now possible to implement as we
>>>>> now have fast character methods.
>>>>> 
>>>>> Matthew
>>>>> 
>>>>> On Mon, 2011-12-26 at 19:33 -0800, Steven C. Bagley wrote:
>>>>>> In data.table 1.7.7:
>>>>>> 
>>>>>> The function unique works for datatables (without keys) that have factors, but not if they have strings. In the latter case, setting the key will convert the strings to factors. I can't figure out from the documentation if this is the intended behavior or not. (The documentation does say that keys can't be characters/strings). It would be nice if unique would work without having to convert strings to factors because of the conversion cost in very large datatables, but maybe this isn't possible.
>>>>>> 
>>>>>> --Steve
>>>>>> 
>>>>>>> library(data.table)
>>>>>>> foo1=as.data.table(data.frame(a=c("1", "1"), b=c(2,2)))
>>>>>>> foo1
>>>>>>     a b
>>>>>> [1,] 1 2
>>>>>> [2,] 1 2
>>>>>>> str(foo1)
>>>>>> Classes ‘data.table’ and 'data.frame':     2 obs. of  2 variables:
>>>>>> $ a: Factor w/ 1 level "1": 1 1
>>>>>> $ b: num  2 2
>>>>>>> unique(foo1)
>>>>>>     a b
>>>>>> [1,] 1 2
>>>>>>> foo2=as.data.table(data.frame(a=c("1", "1"), b=c(2,2), stringsAsFactors=FALSE))
>>>>>>> foo2
>>>>>>     a b
>>>>>> [1,] 1 2
>>>>>> [2,] 1 2
>>>>>>> str(foo2)
>>>>>> Classes ‘data.table’ and 'data.frame':     2 obs. of  2 variables:
>>>>>> $ a: chr  "1" "1"
>>>>>> $ b: num  2 2
>>>>>>> unique(foo2)
>>>>>>     a b
>>>>>> [1,] 1 2
>>>>>> [2,] 1 2
>>>>>>> setkey(foo2, a)
>>>>>>> str(foo2)
>>>>>> Classes ‘data.table’ and 'data.frame':     2 obs. of  2 variables:
>>>>>> $ a: Factor w/ 1 level "1": 1 1
>>>>>> $ b: num  2 2
>>>>>> - attr(*, "sorted")= chr "a"
>>>>>>> unique(foo2)
>>>>>>     a b
>>>>>> [1,] 1 2
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



More information about the datatable-help mailing list