[datatable-help] Factors may lose ordered class when used as keys

Matthew Dowle mdowle at mdowle.plus.com
Thu Mar 22 23:09:53 CET 2012


Just to clear up this thread too for the archives, ordered factors are
now supported in v1.8.0 and much of the below is no longer true. If
there are any further issues with ordered factors please let us know.

Matthew

On Tue, 2011-06-07 at 23:13 +0100, Matthew Dowle wrote:
> That's not the only place ordering comes into it, currently at least.
> Consider how the keys of 2 tables are matched when they contain the same
> character values but different integer factor values. There is a match
> of i levels to x levels first, to convert the i integers to x integers,
> before the binary search on the columns can start. That match of levels
> to levels is a binary search currently; for speed when there are a large
> number of levels. So, two different binary searches are done.
> 
> That was the idea anyway but there is an efficiency issue in there
> somewhere that Tom found. Now that the global character cache in R
> itself has had some improvements (e.g. match and unique of character) we
> might be able to allow character columns in keys soon, and a lot of
> these issues go away. That depends on sorting character vectors quickly,
> and for that countingcharacterorder.c was added to data.table a few
> months ago but isn't exposed to users yet (it is released in the package
> and can be called using .Call).  If that works out, data.table would no
> longer convert character to factor when setting keys. So if the
> recommended type for speed were character columns going forward, then
> that might open the door for allowing unsorted levels in ordered factor
> key columns, for convenience.
> 
> In the meantime we could add a warning when ordered-ness is dropped;
> i.e. when is.unsorted(levels(..)) returns true.
> 
> Long answer I know but hope it helps.
> 
> Matthew
> 
> 
> On Tue, 2011-06-07 at 20:35 +0100, Allan Engelhardt wrote:
> > On 07/06/11 19:35, Matthew Dowle wrote:
> > > The documentation could be improved, but ?setkey does say :
> > >
> > >    "The columns are sorted in ascending order always."
> > 
> > Yes, but my beef was not with the sort order but that the class of the 
> > column changes without warning.  I have no problems with the "A" value 
> > coming before "B" in the data.table but you can do that by sorting the 
> > levels() character information without changing the class of the 
> > column?  That is: I do not expect that X[1,A] < X[2,A] when A is an 
> > ordered factor; I expect that as.character(X[1,A]) < as.character(X[2,A]).
> > 
> > Allan
> > 
> > > More information in previous thread :
> > >
> > > http://r.789695.n4.nabble.com/Behavior-of-setkey-with-factors-tp2319612p2319612.html
> > >
> > >
> > > Matthew
> > >
> > >
> > >
> > > On Tue, 2011-06-07 at 07:50 +0100, Allan Engelhardt wrote:
> > >> Is it documented anywhere that factors may lose their ordered-ness when
> > >> used as keys?  E.g.
> > >>
> > >> library("data.table")
> > >> F<- factor(LETTERS[1:3], levels = rev(LETTERS), ordered = TRUE)
> > >> X<- data.table(A = F, B = F, key = "A")
> > >> str(X)                     # A is no longer ordered; B still is
> > >> stopifnot(is.ordered(X$B)) # OK
> > >> stopifnot(is.ordered(X$A)) # Fails!
> > >>
> > >> I can kind of see why it might happen, but it still caught me by
> > >> surprise, and if it ever happens on (some?) ad-hoc index lookups then it
> > >> will really cause bugs in my code....
> > >>
> > >> Allan
> > >> (Above is simpler version of example sent off-list to Matthew)
> > >>
> > >>   >  sessionInfo()
> > >> R version 2.13.0 (2011-04-13)
> > >> Platform: x86_64-unknown-linux-gnu (64-bit)
> > >>
> > >> locale:
> > >>    [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
> > >>    [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
> > >>    [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
> > >>    [7] LC_PAPER=en_GB.utf8       LC_NAME=C
> > >>    [9] LC_ADDRESS=C              LC_TELEPHONE=C
> > >> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
> > >>
> > >> attached base packages:
> > >> [1] stats     graphics  grDevices utils     datasets  methods   base
> > >>
> > >> other attached packages:
> > >> [1] data.table_1.6 ctv_0.7-2
> > >>
> > >> loaded via a namespace (and not attached):
> > >> [1] tools_2.13.0
> > >>
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list