[datatable-help] Factors may lose ordered class when used as keys

Matthew Dowle mdowle at mdowle.plus.com
Wed Jun 8 00:13:20 CEST 2011


That's not the only place ordering comes into it, currently at least.
Consider how the keys of 2 tables are matched when they contain the same
character values but different integer factor values. There is a match
of i levels to x levels first, to convert the i integers to x integers,
before the binary search on the columns can start. That match of levels
to levels is a binary search currently; for speed when there are a large
number of levels. So, two different binary searches are done.

That was the idea anyway but there is an efficiency issue in there
somewhere that Tom found. Now that the global character cache in R
itself has had some improvements (e.g. match and unique of character) we
might be able to allow character columns in keys soon, and a lot of
these issues go away. That depends on sorting character vectors quickly,
and for that countingcharacterorder.c was added to data.table a few
months ago but isn't exposed to users yet (it is released in the package
and can be called using .Call).  If that works out, data.table would no
longer convert character to factor when setting keys. So if the
recommended type for speed were character columns going forward, then
that might open the door for allowing unsorted levels in ordered factor
key columns, for convenience.

In the meantime we could add a warning when ordered-ness is dropped;
i.e. when is.unsorted(levels(..)) returns true.

Long answer I know but hope it helps.

Matthew


On Tue, 2011-06-07 at 20:35 +0100, Allan Engelhardt wrote:
> On 07/06/11 19:35, Matthew Dowle wrote:
> > The documentation could be improved, but ?setkey does say :
> >
> >    "The columns are sorted in ascending order always."
> 
> Yes, but my beef was not with the sort order but that the class of the 
> column changes without warning.  I have no problems with the "A" value 
> coming before "B" in the data.table but you can do that by sorting the 
> levels() character information without changing the class of the 
> column?  That is: I do not expect that X[1,A] < X[2,A] when A is an 
> ordered factor; I expect that as.character(X[1,A]) < as.character(X[2,A]).
> 
> Allan
> 
> > More information in previous thread :
> >
> > http://r.789695.n4.nabble.com/Behavior-of-setkey-with-factors-tp2319612p2319612.html
> >
> >
> > Matthew
> >
> >
> >
> > On Tue, 2011-06-07 at 07:50 +0100, Allan Engelhardt wrote:
> >> Is it documented anywhere that factors may lose their ordered-ness when
> >> used as keys?  E.g.
> >>
> >> library("data.table")
> >> F<- factor(LETTERS[1:3], levels = rev(LETTERS), ordered = TRUE)
> >> X<- data.table(A = F, B = F, key = "A")
> >> str(X)                     # A is no longer ordered; B still is
> >> stopifnot(is.ordered(X$B)) # OK
> >> stopifnot(is.ordered(X$A)) # Fails!
> >>
> >> I can kind of see why it might happen, but it still caught me by
> >> surprise, and if it ever happens on (some?) ad-hoc index lookups then it
> >> will really cause bugs in my code....
> >>
> >> Allan
> >> (Above is simpler version of example sent off-list to Matthew)
> >>
> >>   >  sessionInfo()
> >> R version 2.13.0 (2011-04-13)
> >> Platform: x86_64-unknown-linux-gnu (64-bit)
> >>
> >> locale:
> >>    [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
> >>    [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
> >>    [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
> >>    [7] LC_PAPER=en_GB.utf8       LC_NAME=C
> >>    [9] LC_ADDRESS=C              LC_TELEPHONE=C
> >> [11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
> >>
> >> attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>
> >> other attached packages:
> >> [1] data.table_1.6 ctv_0.7-2
> >>
> >> loaded via a namespace (and not attached):
> >> [1] tools_2.13.0
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >




More information about the datatable-help mailing list