[datatable-help] Behavior of setkey with factors

Matthew Dowle mdowle at mdowle.plus.com
Thu Mar 22 22:51:45 CET 2012


Just to clear up this thread for the archives, ordered factors are now
supported in v1.8.0 and the workarounds below are no longer needed.

Matthew

On Tue, 2010-08-10 at 14:00 +0100, Matthew Dowle wrote:
> Exactly. We often have 000's of levels, or even 0000's or 00000's.
> Internally sortedmatch() takes advantage that the levels are sorted.
> 
> One simple solution, and what I do sometimes, is to put the ordering into
> the level name: "01. Basic math", "02. Calculus", "03. Algebra I".  There
> is very little performance penalty as those strings get hashed by R
> anyway. If you need to remove the prefix for presentation purposes, just
> "substring(course,5)" afterwards.
> 
> Matthew
> 
> 
> > Damian,
> >
> > The fast lookup of data.table relies on the keys being sorted
> > alphabetically. If you do dt["Algebra II"], data.table uses an alphabetic
> > lookup to find "Algebra II". Speed is the reason (I think). If you had
> > many levels in the factor, the lookup to map the character to the integer
> > would be slow.
> >
> > One way around this is to set a key based on an integer and use an
> > indexing data.table to look up the course. Here's an example:
> >
> >> set.seed(100)
> >> my.course.sample <- sample(1:5, 10, replace=TRUE)
> >> X <- 1:10
> >> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic Math",
> >> "Calculus", "Geometry", "Algebra I", "Algebra II"))
> >> my.dt <- data.table(ID=X, COURSE=Y, k = as.integer(Y), key="k")
> >> my.dt
> >       ID     COURSE k
> >  [1,]  4 Basic Math 1
> >  [2,] 10 Basic Math 1
> >  [3,]  1   Calculus 2
> >  [4,]  2   Calculus 2
> >  [5,]  8   Calculus 2
> >  [6,]  3   Geometry 3
> >  [7,]  5   Geometry 3
> >  [8,]  6   Geometry 3
> >  [9,]  9   Geometry 3
> > [10,]  7 Algebra II 5
> >> idx <- data.table(k = 1:5,  course=c("Basic Math", "Calculus",
> >> "Geometry", "Algebra I", "Algebra II"), key = "course")
> >> my.dt[J(idx["Basic Math", k]), mult="all"]
> >      ID     COURSE k
> > [1,]  4 Basic Math 1
> > [2,] 10 Basic Math 1
> >> my.dt[J(idx["Algebra II", k]), mult="all"]
> >      ID     COURSE k
> > [1,]  7 Algebra II 5
> >
> > If you use something like that a lot, you could create a little function
> > to improve the notation a bit.
> >
> > - Tom
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: datatable-help-bounces at lists.r-forge.r-project.org
> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> On Behalf Of Damian Betebenner
> >> Sent: Tuesday, August 10, 2010 06:50
> >> To: datatable-help at lists.r-forge.r-project.org
> >> Subject: [datatable-help] Behavior of setkey with factors
> >>
> >> All,
> >>
> >> I was wondering how setkey orders a factor and whether it
> >> observes whether the factor is ordered or just alphabetically
> >> orders the factor
> >>
> >> I would like to have the key observe the order of a factor
> >> (e.g., a course taken field may run from 1 to 5 with 1=Basic
> >> Math, 2=Calculus, 3=Geometry, 4=Algebra I and 5=Algebra 2. I
> >> would like the sort imposed by data.table to "respect" the
> >> canonical ordering of the classes, no an alphabetical ordering.
> >>
> >> I can't however, seem to get the key to behave the way I want.
> >>
> >> Here's an example:
> >>
> >> setkey(123)
> >> my.course.sample <- sample(1:5, 10, replace=TRUE)
> >>
> >> X <- 1:10
> >> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic
> >> Math", "Calculus", "Geometry", "Algebra I", "Algebra II"))
> >>
> >> my.dt <- data.table(ID=X, COURSE=Y)
> >>
> >> > my.dt
> >>       ID     COURSE
> >>  [1,]  1 Algebra II
> >>  [2,]  2  Algebra I
> >>  [3,]  3  Algebra I
> >>  [4,]  4 Algebra II
> >>  [5,]  5   Geometry
> >>  [6,]  6  Algebra I
> >>  [7,]  7   Geometry
> >>  [8,]  8   Calculus
> >>  [9,]  9  Algebra I
> >> [10,] 10   Geometry
> >>
> >>
> >> setkey(my.dt, COURSE)
> >>
> >> > my.dt
> >>       ID     COURSE
> >>  [1,]  2  Algebra I
> >>  [2,]  3  Algebra I
> >>  [3,]  6  Algebra I
> >>  [4,]  9  Algebra I
> >>  [5,]  1 Algebra II
> >>  [6,]  4 Algebra II
> >>  [7,]  8   Calculus
> >>  [8,]  5   Geometry
> >>  [9,]  7   Geometry
> >> [10,] 10   Geometry
> >>
> >>
> >> ###
> >> ### The COURSE key is alphabetizing based upon the labels ###
> >>
> >> ###
> >> ### Now try to impose a different ordering ###
> >>
> >> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
> >> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
> >> "Algebra II"))
> >>
> >> my.dt <- data.table(ID=X, COURSE=Y)
> >>
> >> > my.dt
> >>       ID     COURSE
> >>  [1,]  1  Algebra I
> >>  [2,]  2   Calculus
> >>  [3,]  3   Calculus
> >>  [4,]  4  Algebra I
> >>  [5,]  5   Geometry
> >>  [6,]  6   Calculus
> >>  [7,]  7   Geometry
> >>  [8,]  8 Algebra II
> >>  [9,]  9   Calculus
> >> [10,] 10   Geometry
> >>
> >> setkey(my.dt, COURSE)
> >>
> >> > my.dt
> >>       ID     COURSE
> >>  [1,]  1  Algebra I
> >>  [2,]  3  Algebra I
> >>  [3,]  9  Algebra I
> >>  [4,]  2 Algebra II
> >>  [5,]  4 Algebra II
> >>  [6,]  8 Algebra II
> >>  [7,]  7 Basic Math
> >>  [8,]  5   Calculus
> >>  [9,]  6   Calculus
> >> [10,] 10   Geometry
> >>
> >>
> >> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
> >> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
> >> "Algebra II"), ordered=TRUE)
> >>
> >> my.dt <- data.table(ID=X, COURSE=Y)
> >>
> >> my.dt
> >>
> >>       ID     COURSE
> >>  [1,]  1  Algebra I
> >>  [2,]  2   Calculus
> >>  [3,]  3   Calculus
> >>  [4,]  4  Algebra I
> >>  [5,]  5   Geometry
> >>  [6,]  6   Calculus
> >>  [7,]  7   Geometry
> >>  [8,]  8 Algebra II
> >>  [9,]  9   Calculus
> >> [10,] 10   Geometry
> >>
> >> setkey(my.dt, COURSE)
> >>
> >> my.dt
> >>
> >>       ID     COURSE
> >>  [1,]  1  Algebra I
> >>  [2,]  4  Algebra I
> >>  [3,]  8 Algebra II
> >>  [4,]  2   Calculus
> >>  [5,]  3   Calculus
> >>  [6,]  6   Calculus
> >>  [7,]  9   Calculus
> >>  [8,]  5   Geometry
> >>  [9,]  7   Geometry
> >> [10,] 10   Geometry
> >>
> >>
> >> ### Setting COURSE as the key for an ordered factor seems to
> >> over-ride the ordering associated with the factor and impose
> >> an alphabetical order.
> >>
> >>
> >> I'd like the key to respect the order associated with the factor
> >>
> >>
> >> Any help with this greatly appreciated.
> >>
> >>
> >> Best regards,
> >>
> >>
> >>
> >> Damian Betebenner
> >> Center for Assessment
> >> PO Box 351
> >> Dover, NH   03821-0351
> >>  
> >> Phone (office): (603) 516-7900
> >> Phone (cell): (857) 234-2474
> >> Fax: (603) 516-7910
> >>
> >> dbetebenner at nciea.org
> >> www.nciea.org
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> >>
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list