[datatable-help] Behavior of setkey with factors
Matthew Dowle
mdowle at mdowle.plus.com
Thu Mar 22 22:51:45 CET 2012
Just to clear up this thread for the archives, ordered factors are now
supported in v1.8.0 and the workarounds below are no longer needed.
Matthew
On Tue, 2010-08-10 at 14:00 +0100, Matthew Dowle wrote:
> Exactly. We often have 000's of levels, or even 0000's or 00000's.
> Internally sortedmatch() takes advantage that the levels are sorted.
>
> One simple solution, and what I do sometimes, is to put the ordering into
> the level name: "01. Basic math", "02. Calculus", "03. Algebra I". There
> is very little performance penalty as those strings get hashed by R
> anyway. If you need to remove the prefix for presentation purposes, just
> "substring(course,5)" afterwards.
>
> Matthew
>
>
> > Damian,
> >
> > The fast lookup of data.table relies on the keys being sorted
> > alphabetically. If you do dt["Algebra II"], data.table uses an alphabetic
> > lookup to find "Algebra II". Speed is the reason (I think). If you had
> > many levels in the factor, the lookup to map the character to the integer
> > would be slow.
> >
> > One way around this is to set a key based on an integer and use an
> > indexing data.table to look up the course. Here's an example:
> >
> >> set.seed(100)
> >> my.course.sample <- sample(1:5, 10, replace=TRUE)
> >> X <- 1:10
> >> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic Math",
> >> "Calculus", "Geometry", "Algebra I", "Algebra II"))
> >> my.dt <- data.table(ID=X, COURSE=Y, k = as.integer(Y), key="k")
> >> my.dt
> > ID COURSE k
> > [1,] 4 Basic Math 1
> > [2,] 10 Basic Math 1
> > [3,] 1 Calculus 2
> > [4,] 2 Calculus 2
> > [5,] 8 Calculus 2
> > [6,] 3 Geometry 3
> > [7,] 5 Geometry 3
> > [8,] 6 Geometry 3
> > [9,] 9 Geometry 3
> > [10,] 7 Algebra II 5
> >> idx <- data.table(k = 1:5, course=c("Basic Math", "Calculus",
> >> "Geometry", "Algebra I", "Algebra II"), key = "course")
> >> my.dt[J(idx["Basic Math", k]), mult="all"]
> > ID COURSE k
> > [1,] 4 Basic Math 1
> > [2,] 10 Basic Math 1
> >> my.dt[J(idx["Algebra II", k]), mult="all"]
> > ID COURSE k
> > [1,] 7 Algebra II 5
> >
> > If you use something like that a lot, you could create a little function
> > to improve the notation a bit.
> >
> > - Tom
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: datatable-help-bounces at lists.r-forge.r-project.org
> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> On Behalf Of Damian Betebenner
> >> Sent: Tuesday, August 10, 2010 06:50
> >> To: datatable-help at lists.r-forge.r-project.org
> >> Subject: [datatable-help] Behavior of setkey with factors
> >>
> >> All,
> >>
> >> I was wondering how setkey orders a factor and whether it
> >> observes whether the factor is ordered or just alphabetically
> >> orders the factor
> >>
> >> I would like to have the key observe the order of a factor
> >> (e.g., a course taken field may run from 1 to 5 with 1=Basic
> >> Math, 2=Calculus, 3=Geometry, 4=Algebra I and 5=Algebra 2. I
> >> would like the sort imposed by data.table to "respect" the
> >> canonical ordering of the classes, no an alphabetical ordering.
> >>
> >> I can't however, seem to get the key to behave the way I want.
> >>
> >> Here's an example:
> >>
> >> setkey(123)
> >> my.course.sample <- sample(1:5, 10, replace=TRUE)
> >>
> >> X <- 1:10
> >> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic
> >> Math", "Calculus", "Geometry", "Algebra I", "Algebra II"))
> >>
> >> my.dt <- data.table(ID=X, COURSE=Y)
> >>
> >> > my.dt
> >> ID COURSE
> >> [1,] 1 Algebra II
> >> [2,] 2 Algebra I
> >> [3,] 3 Algebra I
> >> [4,] 4 Algebra II
> >> [5,] 5 Geometry
> >> [6,] 6 Algebra I
> >> [7,] 7 Geometry
> >> [8,] 8 Calculus
> >> [9,] 9 Algebra I
> >> [10,] 10 Geometry
> >>
> >>
> >> setkey(my.dt, COURSE)
> >>
> >> > my.dt
> >> ID COURSE
> >> [1,] 2 Algebra I
> >> [2,] 3 Algebra I
> >> [3,] 6 Algebra I
> >> [4,] 9 Algebra I
> >> [5,] 1 Algebra II
> >> [6,] 4 Algebra II
> >> [7,] 8 Calculus
> >> [8,] 5 Geometry
> >> [9,] 7 Geometry
> >> [10,] 10 Geometry
> >>
> >>
> >> ###
> >> ### The COURSE key is alphabetizing based upon the labels ###
> >>
> >> ###
> >> ### Now try to impose a different ordering ###
> >>
> >> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
> >> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
> >> "Algebra II"))
> >>
> >> my.dt <- data.table(ID=X, COURSE=Y)
> >>
> >> > my.dt
> >> ID COURSE
> >> [1,] 1 Algebra I
> >> [2,] 2 Calculus
> >> [3,] 3 Calculus
> >> [4,] 4 Algebra I
> >> [5,] 5 Geometry
> >> [6,] 6 Calculus
> >> [7,] 7 Geometry
> >> [8,] 8 Algebra II
> >> [9,] 9 Calculus
> >> [10,] 10 Geometry
> >>
> >> setkey(my.dt, COURSE)
> >>
> >> > my.dt
> >> ID COURSE
> >> [1,] 1 Algebra I
> >> [2,] 3 Algebra I
> >> [3,] 9 Algebra I
> >> [4,] 2 Algebra II
> >> [5,] 4 Algebra II
> >> [6,] 8 Algebra II
> >> [7,] 7 Basic Math
> >> [8,] 5 Calculus
> >> [9,] 6 Calculus
> >> [10,] 10 Geometry
> >>
> >>
> >> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
> >> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
> >> "Algebra II"), ordered=TRUE)
> >>
> >> my.dt <- data.table(ID=X, COURSE=Y)
> >>
> >> my.dt
> >>
> >> ID COURSE
> >> [1,] 1 Algebra I
> >> [2,] 2 Calculus
> >> [3,] 3 Calculus
> >> [4,] 4 Algebra I
> >> [5,] 5 Geometry
> >> [6,] 6 Calculus
> >> [7,] 7 Geometry
> >> [8,] 8 Algebra II
> >> [9,] 9 Calculus
> >> [10,] 10 Geometry
> >>
> >> setkey(my.dt, COURSE)
> >>
> >> my.dt
> >>
> >> ID COURSE
> >> [1,] 1 Algebra I
> >> [2,] 4 Algebra I
> >> [3,] 8 Algebra II
> >> [4,] 2 Calculus
> >> [5,] 3 Calculus
> >> [6,] 6 Calculus
> >> [7,] 9 Calculus
> >> [8,] 5 Geometry
> >> [9,] 7 Geometry
> >> [10,] 10 Geometry
> >>
> >>
> >> ### Setting COURSE as the key for an ordered factor seems to
> >> over-ride the ordering associated with the factor and impose
> >> an alphabetical order.
> >>
> >>
> >> I'd like the key to respect the order associated with the factor
> >>
> >>
> >> Any help with this greatly appreciated.
> >>
> >>
> >> Best regards,
> >>
> >>
> >>
> >> Damian Betebenner
> >> Center for Assessment
> >> PO Box 351
> >> Dover, NH 03821-0351
> >>
> >> Phone (office): (603) 516-7900
> >> Phone (cell): (857) 234-2474
> >> Fax: (603) 516-7910
> >>
> >> dbetebenner at nciea.org
> >> www.nciea.org
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> >>
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list