[datatable-help] behavior of unique on data.tables with strings

Matthew Dowle mdowle at mdowle.plus.com
Tue Jan 3 00:27:11 CET 2012


Thanks for the nice report. Oddly though, it seems to work ok for me
both in 1.7.7 and latest 1.7.8.

$ R --vanilla
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)
> require(data.table)
Loading required package: data.table
data.table 1.7.7  For help type: help("data.table")
> foo2=as.data.table(data.frame(a=c("1", "1"), b=c(2,2),
stringsAsFactors=FALSE)) 
> unique(foo2)
     a b
[1,] 1 2
> str(foo2)
Classes ‘data.table’ and 'data.frame':	2 obs. of  2 variables:
 $ a: chr  "1" "1"
 $ b: num  2 2
> .Machine$double.eps ^ 0.5
[1] 1.490116e-08

Could you rerun and confirm please. If you are 64bit, please include
sessionInfo(). I've included tolerance as a long shot - the numeric 2's
are considered equal by data.table's unique() using tolerance. Perhaps
that part is not working for you. Does test.data.table() work? It should
test unique and tolerance fairly thoroughly. Otherwise I can't think why
the character column isn't liked by unique, should be ok.

A fast unique for character columns is a good feature request, please
could you add to the tracker. That is now possible to implement as we
now have fast character methods.

Matthew

On Mon, 2011-12-26 at 19:33 -0800, Steven C. Bagley wrote:
> In data.table 1.7.7: 
> 
> The function unique works for datatables (without keys) that have factors, but not if they have strings. In the latter case, setting the key will convert the strings to factors. I can't figure out from the documentation if this is the intended behavior or not. (The documentation does say that keys can't be characters/strings). It would be nice if unique would work without having to convert strings to factors because of the conversion cost in very large datatables, but maybe this isn't possible.
> 
> --Steve
> 
> > library(data.table)
> > foo1=as.data.table(data.frame(a=c("1", "1"), b=c(2,2)))
> > foo1
>      a b
> [1,] 1 2
> [2,] 1 2
> > str(foo1)
> Classes ‘data.table’ and 'data.frame':	2 obs. of  2 variables:
>  $ a: Factor w/ 1 level "1": 1 1
>  $ b: num  2 2
> > unique(foo1)
>      a b
> [1,] 1 2
> > foo2=as.data.table(data.frame(a=c("1", "1"), b=c(2,2), stringsAsFactors=FALSE))
> > foo2
>      a b
> [1,] 1 2
> [2,] 1 2
> > str(foo2)
> Classes ‘data.table’ and 'data.frame':	2 obs. of  2 variables:
>  $ a: chr  "1" "1"
>  $ b: num  2 2
> > unique(foo2)
>      a b
> [1,] 1 2
> [2,] 1 2
> > setkey(foo2, a)
> > str(foo2)
> Classes ‘data.table’ and 'data.frame':	2 obs. of  2 variables:
>  $ a: Factor w/ 1 level "1": 1 1
>  $ b: num  2 2
>  - attr(*, "sorted")= chr "a"
> > unique(foo2)
>      a b
> [1,] 1 2
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list