[datatable-help] Behavior of setkey with factors

Short, Tom TShort at epri.com
Tue Aug 10 13:34:41 CEST 2010


Damian, 

The fast lookup of data.table relies on the keys being sorted alphabetically. If you do dt["Algebra II"], data.table uses an alphabetic lookup to find "Algebra II". Speed is the reason (I think). If you had many levels in the factor, the lookup to map the character to the integer would be slow.

One way around this is to set a key based on an integer and use an indexing data.table to look up the course. Here's an example:

> set.seed(100)
> my.course.sample <- sample(1:5, 10, replace=TRUE)
> X <- 1:10
> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic Math", "Calculus", "Geometry", "Algebra I", "Algebra II"))
> my.dt <- data.table(ID=X, COURSE=Y, k = as.integer(Y), key="k")
> my.dt
      ID     COURSE k
 [1,]  4 Basic Math 1
 [2,] 10 Basic Math 1
 [3,]  1   Calculus 2
 [4,]  2   Calculus 2
 [5,]  8   Calculus 2
 [6,]  3   Geometry 3
 [7,]  5   Geometry 3
 [8,]  6   Geometry 3
 [9,]  9   Geometry 3
[10,]  7 Algebra II 5
> idx <- data.table(k = 1:5,  course=c("Basic Math", "Calculus", "Geometry", "Algebra I", "Algebra II"), key = "course")
> my.dt[J(idx["Basic Math", k]), mult="all"]
     ID     COURSE k
[1,]  4 Basic Math 1
[2,] 10 Basic Math 1
> my.dt[J(idx["Algebra II", k]), mult="all"]
     ID     COURSE k
[1,]  7 Algebra II 5

If you use something like that a lot, you could create a little function to improve the notation a bit.

- Tom


 

> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Damian Betebenner
> Sent: Tuesday, August 10, 2010 06:50
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] Behavior of setkey with factors
> 
> All,
> 
> I was wondering how setkey orders a factor and whether it 
> observes whether the factor is ordered or just alphabetically 
> orders the factor 
> 
> I would like to have the key observe the order of a factor 
> (e.g., a course taken field may run from 1 to 5 with 1=Basic 
> Math, 2=Calculus, 3=Geometry, 4=Algebra I and 5=Algebra 2. I 
> would like the sort imposed by data.table to "respect" the 
> canonical ordering of the classes, no an alphabetical ordering.
> 
> I can't however, seem to get the key to behave the way I want.
> 
> Here's an example:
> 
> setkey(123)
> my.course.sample <- sample(1:5, 10, replace=TRUE)
> 
> X <- 1:10
> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic 
> Math", "Calculus", "Geometry", "Algebra I", "Algebra II"))
> 
> my.dt <- data.table(ID=X, COURSE=Y)
> 
> > my.dt
>       ID     COURSE
>  [1,]  1 Algebra II
>  [2,]  2  Algebra I
>  [3,]  3  Algebra I
>  [4,]  4 Algebra II
>  [5,]  5   Geometry
>  [6,]  6  Algebra I
>  [7,]  7   Geometry
>  [8,]  8   Calculus
>  [9,]  9  Algebra I
> [10,] 10   Geometry
> 
> 
> setkey(my.dt, COURSE)
> 
> > my.dt
>       ID     COURSE
>  [1,]  2  Algebra I
>  [2,]  3  Algebra I
>  [3,]  6  Algebra I
>  [4,]  9  Algebra I
>  [5,]  1 Algebra II
>  [6,]  4 Algebra II
>  [7,]  8   Calculus
>  [8,]  5   Geometry
>  [9,]  7   Geometry
> [10,] 10   Geometry
> 
> 
> ###
> ### The COURSE key is alphabetizing based upon the labels ###
> 
> ###
> ### Now try to impose a different ordering ###
> 
> Y <- factor(my.course.sample, levels=c(1,4,3,5,2), 
> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I", 
> "Algebra II"))
> 
> my.dt <- data.table(ID=X, COURSE=Y)
> 
> > my.dt
>       ID     COURSE
>  [1,]  1  Algebra I
>  [2,]  2   Calculus
>  [3,]  3   Calculus
>  [4,]  4  Algebra I
>  [5,]  5   Geometry
>  [6,]  6   Calculus
>  [7,]  7   Geometry
>  [8,]  8 Algebra II
>  [9,]  9   Calculus
> [10,] 10   Geometry
> 
> setkey(my.dt, COURSE)
> 
> > my.dt
>       ID     COURSE
>  [1,]  1  Algebra I
>  [2,]  3  Algebra I
>  [3,]  9  Algebra I
>  [4,]  2 Algebra II
>  [5,]  4 Algebra II
>  [6,]  8 Algebra II
>  [7,]  7 Basic Math
>  [8,]  5   Calculus
>  [9,]  6   Calculus
> [10,] 10   Geometry
> 
> 
> Y <- factor(my.course.sample, levels=c(1,4,3,5,2), 
> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I", 
> "Algebra II"), ordered=TRUE)
> 
> my.dt <- data.table(ID=X, COURSE=Y)
> 
> my.dt
> 
>       ID     COURSE
>  [1,]  1  Algebra I
>  [2,]  2   Calculus
>  [3,]  3   Calculus
>  [4,]  4  Algebra I
>  [5,]  5   Geometry
>  [6,]  6   Calculus
>  [7,]  7   Geometry
>  [8,]  8 Algebra II
>  [9,]  9   Calculus
> [10,] 10   Geometry
> 
> setkey(my.dt, COURSE)
> 
> my.dt
> 
>       ID     COURSE
>  [1,]  1  Algebra I
>  [2,]  4  Algebra I
>  [3,]  8 Algebra II
>  [4,]  2   Calculus
>  [5,]  3   Calculus
>  [6,]  6   Calculus
>  [7,]  9   Calculus
>  [8,]  5   Geometry
>  [9,]  7   Geometry
> [10,] 10   Geometry
> 
> 
> ### Setting COURSE as the key for an ordered factor seems to 
> over-ride the ordering associated with the factor and impose 
> an alphabetical order.
> 
> 
> I'd like the key to respect the order associated with the factor
> 
> 
> Any help with this greatly appreciated.
> 
> 
> Best regards,
> 
> 
> 
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH   03821-0351
>  
> Phone (office): (603) 516-7900
> Phone (cell): (857) 234-2474
> Fax: (603) 516-7910
> 
> dbetebenner at nciea.org
> www.nciea.org
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
> 


More information about the datatable-help mailing list