[datatable-help] behavior of unique on data.tables with strings

Chris Neff caneff at gmail.com
Tue Jan 3 12:11:56 CET 2012


Also one addendum, since I have verbose on, I got the following when
trying to do the unique(foo2) that doesn't work:

Non-first column 1 failed radixorder1, reverting to regularorder1

Don't know if that helps at all.

On 3 January 2012 05:48, Chris Neff <caneff at gmail.com> wrote:
> I'll confirm that I get the same behavior Steven does on 64-bit linux
> on 1.7.8.  So 64-bit sounds like the culprit?
>
> On 3 January 2012 03:01, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>
>> Ok thanks. Please file a bug report (mentioning it might be a 64bit
>> and/or mac only problem), so it's not forgotten. Trying to fix the Chris
>> crash so will have to come back to it ...
>>
>> On Mon, 2012-01-02 at 20:13 -0800, Steven C. Bagley wrote:
>>> It still happens. (I deleted R and all packages, then reinstalled just to check.)
>>>
>>> test.data.table() completes without errors.
>>>
>>> Here's the session info.
>>>
>>> > sessionInfo()
>>> R version 2.14.0 (2011-10-31)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] data.table_1.7.7
>>>
>>> > .Machine$double.eps ^ 0.5
>>> [1] 1.490116e-08
>>>
>>> --Steve
>>>
>>> On Jan 2, 2012, at 3:27 PM, Matthew Dowle wrote:
>>>
>>> > Thanks for the nice report. Oddly though, it seems to work ok for me
>>> > both in 1.7.7 and latest 1.7.8.
>>> >
>>> > $ R --vanilla
>>> > R version 2.14.1 (2011-12-22)
>>> > Platform: i686-pc-linux-gnu (32-bit)
>>> >> require(data.table)
>>> > Loading required package: data.table
>>> > data.table 1.7.7  For help type: help("data.table")
>>> >> foo2=as.data.table(data.frame(a=c("1", "1"), b=c(2,2),
>>> > stringsAsFactors=FALSE))
>>> >> unique(foo2)
>>> >     a b
>>> > [1,] 1 2
>>> >> str(foo2)
>>> > Classes ‘data.table’ and 'data.frame':      2 obs. of  2 variables:
>>> > $ a: chr  "1" "1"
>>> > $ b: num  2 2
>>> >> .Machine$double.eps ^ 0.5
>>> > [1] 1.490116e-08
>>> >
>>> > Could you rerun and confirm please. If you are 64bit, please include
>>> > sessionInfo(). I've included tolerance as a long shot - the numeric 2's
>>> > are considered equal by data.table's unique() using tolerance. Perhaps
>>> > that part is not working for you. Does test.data.table() work? It should
>>> > test unique and tolerance fairly thoroughly. Otherwise I can't think why
>>> > the character column isn't liked by unique, should be ok.
>>> >
>>> > A fast unique for character columns is a good feature request, please
>>> > could you add to the tracker. That is now possible to implement as we
>>> > now have fast character methods.
>>> >
>>> > Matthew
>>> >
>>> > On Mon, 2011-12-26 at 19:33 -0800, Steven C. Bagley wrote:
>>> >> In data.table 1.7.7:
>>> >>
>>> >> The function unique works for datatables (without keys) that have factors, but not if they have strings. In the latter case, setting the key will convert the strings to factors. I can't figure out from the documentation if this is the intended behavior or not. (The documentation does say that keys can't be characters/strings). It would be nice if unique would work without having to convert strings to factors because of the conversion cost in very large datatables, but maybe this isn't possible.
>>> >>
>>> >> --Steve
>>> >>
>>> >>> library(data.table)
>>> >>> foo1=as.data.table(data.frame(a=c("1", "1"), b=c(2,2)))
>>> >>> foo1
>>> >>     a b
>>> >> [1,] 1 2
>>> >> [2,] 1 2
>>> >>> str(foo1)
>>> >> Classes ‘data.table’ and 'data.frame':     2 obs. of  2 variables:
>>> >> $ a: Factor w/ 1 level "1": 1 1
>>> >> $ b: num  2 2
>>> >>> unique(foo1)
>>> >>     a b
>>> >> [1,] 1 2
>>> >>> foo2=as.data.table(data.frame(a=c("1", "1"), b=c(2,2), stringsAsFactors=FALSE))
>>> >>> foo2
>>> >>     a b
>>> >> [1,] 1 2
>>> >> [2,] 1 2
>>> >>> str(foo2)
>>> >> Classes ‘data.table’ and 'data.frame':     2 obs. of  2 variables:
>>> >> $ a: chr  "1" "1"
>>> >> $ b: num  2 2
>>> >>> unique(foo2)
>>> >>     a b
>>> >> [1,] 1 2
>>> >> [2,] 1 2
>>> >>> setkey(foo2, a)
>>> >>> str(foo2)
>>> >> Classes ‘data.table’ and 'data.frame':     2 obs. of  2 variables:
>>> >> $ a: Factor w/ 1 level "1": 1 1
>>> >> $ b: num  2 2
>>> >> - attr(*, "sorted")= chr "a"
>>> >>> unique(foo2)
>>> >>     a b
>>> >> [1,] 1 2
>>> >> _______________________________________________
>>> >> datatable-help mailing list
>>> >> datatable-help at lists.r-forge.r-project.org
>>> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> >
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list