[datatable-help] Behavior of setkey with factors

Matthew Dowle mdowle at mdowle.plus.com
Tue Aug 10 15:00:50 CEST 2010


Exactly. We often have 000's of levels, or even 0000's or 00000's.
Internally sortedmatch() takes advantage that the levels are sorted.

One simple solution, and what I do sometimes, is to put the ordering into
the level name: "01. Basic math", "02. Calculus", "03. Algebra I".  There
is very little performance penalty as those strings get hashed by R
anyway. If you need to remove the prefix for presentation purposes, just
"substring(course,5)" afterwards.

Matthew


> Damian,
>
> The fast lookup of data.table relies on the keys being sorted
> alphabetically. If you do dt["Algebra II"], data.table uses an alphabetic
> lookup to find "Algebra II". Speed is the reason (I think). If you had
> many levels in the factor, the lookup to map the character to the integer
> would be slow.
>
> One way around this is to set a key based on an integer and use an
> indexing data.table to look up the course. Here's an example:
>
>> set.seed(100)
>> my.course.sample <- sample(1:5, 10, replace=TRUE)
>> X <- 1:10
>> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic Math",
>> "Calculus", "Geometry", "Algebra I", "Algebra II"))
>> my.dt <- data.table(ID=X, COURSE=Y, k = as.integer(Y), key="k")
>> my.dt
>       ID     COURSE k
>  [1,]  4 Basic Math 1
>  [2,] 10 Basic Math 1
>  [3,]  1   Calculus 2
>  [4,]  2   Calculus 2
>  [5,]  8   Calculus 2
>  [6,]  3   Geometry 3
>  [7,]  5   Geometry 3
>  [8,]  6   Geometry 3
>  [9,]  9   Geometry 3
> [10,]  7 Algebra II 5
>> idx <- data.table(k = 1:5,  course=c("Basic Math", "Calculus",
>> "Geometry", "Algebra I", "Algebra II"), key = "course")
>> my.dt[J(idx["Basic Math", k]), mult="all"]
>      ID     COURSE k
> [1,]  4 Basic Math 1
> [2,] 10 Basic Math 1
>> my.dt[J(idx["Algebra II", k]), mult="all"]
>      ID     COURSE k
> [1,]  7 Algebra II 5
>
> If you use something like that a lot, you could create a little function
> to improve the notation a bit.
>
> - Tom
>
>
>
>
>> -----Original Message-----
>> From: datatable-help-bounces at lists.r-forge.r-project.org
>> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> On Behalf Of Damian Betebenner
>> Sent: Tuesday, August 10, 2010 06:50
>> To: datatable-help at lists.r-forge.r-project.org
>> Subject: [datatable-help] Behavior of setkey with factors
>>
>> All,
>>
>> I was wondering how setkey orders a factor and whether it
>> observes whether the factor is ordered or just alphabetically
>> orders the factor
>>
>> I would like to have the key observe the order of a factor
>> (e.g., a course taken field may run from 1 to 5 with 1=Basic
>> Math, 2=Calculus, 3=Geometry, 4=Algebra I and 5=Algebra 2. I
>> would like the sort imposed by data.table to "respect" the
>> canonical ordering of the classes, no an alphabetical ordering.
>>
>> I can't however, seem to get the key to behave the way I want.
>>
>> Here's an example:
>>
>> setkey(123)
>> my.course.sample <- sample(1:5, 10, replace=TRUE)
>>
>> X <- 1:10
>> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic
>> Math", "Calculus", "Geometry", "Algebra I", "Algebra II"))
>>
>> my.dt <- data.table(ID=X, COURSE=Y)
>>
>> > my.dt
>>       ID     COURSE
>>  [1,]  1 Algebra II
>>  [2,]  2  Algebra I
>>  [3,]  3  Algebra I
>>  [4,]  4 Algebra II
>>  [5,]  5   Geometry
>>  [6,]  6  Algebra I
>>  [7,]  7   Geometry
>>  [8,]  8   Calculus
>>  [9,]  9  Algebra I
>> [10,] 10   Geometry
>>
>>
>> setkey(my.dt, COURSE)
>>
>> > my.dt
>>       ID     COURSE
>>  [1,]  2  Algebra I
>>  [2,]  3  Algebra I
>>  [3,]  6  Algebra I
>>  [4,]  9  Algebra I
>>  [5,]  1 Algebra II
>>  [6,]  4 Algebra II
>>  [7,]  8   Calculus
>>  [8,]  5   Geometry
>>  [9,]  7   Geometry
>> [10,] 10   Geometry
>>
>>
>> ###
>> ### The COURSE key is alphabetizing based upon the labels ###
>>
>> ###
>> ### Now try to impose a different ordering ###
>>
>> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
>> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
>> "Algebra II"))
>>
>> my.dt <- data.table(ID=X, COURSE=Y)
>>
>> > my.dt
>>       ID     COURSE
>>  [1,]  1  Algebra I
>>  [2,]  2   Calculus
>>  [3,]  3   Calculus
>>  [4,]  4  Algebra I
>>  [5,]  5   Geometry
>>  [6,]  6   Calculus
>>  [7,]  7   Geometry
>>  [8,]  8 Algebra II
>>  [9,]  9   Calculus
>> [10,] 10   Geometry
>>
>> setkey(my.dt, COURSE)
>>
>> > my.dt
>>       ID     COURSE
>>  [1,]  1  Algebra I
>>  [2,]  3  Algebra I
>>  [3,]  9  Algebra I
>>  [4,]  2 Algebra II
>>  [5,]  4 Algebra II
>>  [6,]  8 Algebra II
>>  [7,]  7 Basic Math
>>  [8,]  5   Calculus
>>  [9,]  6   Calculus
>> [10,] 10   Geometry
>>
>>
>> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
>> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
>> "Algebra II"), ordered=TRUE)
>>
>> my.dt <- data.table(ID=X, COURSE=Y)
>>
>> my.dt
>>
>>       ID     COURSE
>>  [1,]  1  Algebra I
>>  [2,]  2   Calculus
>>  [3,]  3   Calculus
>>  [4,]  4  Algebra I
>>  [5,]  5   Geometry
>>  [6,]  6   Calculus
>>  [7,]  7   Geometry
>>  [8,]  8 Algebra II
>>  [9,]  9   Calculus
>> [10,] 10   Geometry
>>
>> setkey(my.dt, COURSE)
>>
>> my.dt
>>
>>       ID     COURSE
>>  [1,]  1  Algebra I
>>  [2,]  4  Algebra I
>>  [3,]  8 Algebra II
>>  [4,]  2   Calculus
>>  [5,]  3   Calculus
>>  [6,]  6   Calculus
>>  [7,]  9   Calculus
>>  [8,]  5   Geometry
>>  [9,]  7   Geometry
>> [10,] 10   Geometry
>>
>>
>> ### Setting COURSE as the key for an ordered factor seems to
>> over-ride the ordering associated with the factor and impose
>> an alphabetical order.
>>
>>
>> I'd like the key to respect the order associated with the factor
>>
>>
>> Any help with this greatly appreciated.
>>
>>
>> Best regards,
>>
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>  
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> dbetebenner at nciea.org
>> www.nciea.org
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> atatable-help
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>




More information about the datatable-help mailing list