[datatable-help] Behavior of setkey with factors
Matthew Dowle
mdowle at mdowle.plus.com
Tue Aug 10 15:00:50 CEST 2010
Exactly. We often have 000's of levels, or even 0000's or 00000's.
Internally sortedmatch() takes advantage that the levels are sorted.
One simple solution, and what I do sometimes, is to put the ordering into
the level name: "01. Basic math", "02. Calculus", "03. Algebra I". There
is very little performance penalty as those strings get hashed by R
anyway. If you need to remove the prefix for presentation purposes, just
"substring(course,5)" afterwards.
Matthew
> Damian,
>
> The fast lookup of data.table relies on the keys being sorted
> alphabetically. If you do dt["Algebra II"], data.table uses an alphabetic
> lookup to find "Algebra II". Speed is the reason (I think). If you had
> many levels in the factor, the lookup to map the character to the integer
> would be slow.
>
> One way around this is to set a key based on an integer and use an
> indexing data.table to look up the course. Here's an example:
>
>> set.seed(100)
>> my.course.sample <- sample(1:5, 10, replace=TRUE)
>> X <- 1:10
>> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic Math",
>> "Calculus", "Geometry", "Algebra I", "Algebra II"))
>> my.dt <- data.table(ID=X, COURSE=Y, k = as.integer(Y), key="k")
>> my.dt
> ID COURSE k
> [1,] 4 Basic Math 1
> [2,] 10 Basic Math 1
> [3,] 1 Calculus 2
> [4,] 2 Calculus 2
> [5,] 8 Calculus 2
> [6,] 3 Geometry 3
> [7,] 5 Geometry 3
> [8,] 6 Geometry 3
> [9,] 9 Geometry 3
> [10,] 7 Algebra II 5
>> idx <- data.table(k = 1:5, course=c("Basic Math", "Calculus",
>> "Geometry", "Algebra I", "Algebra II"), key = "course")
>> my.dt[J(idx["Basic Math", k]), mult="all"]
> ID COURSE k
> [1,] 4 Basic Math 1
> [2,] 10 Basic Math 1
>> my.dt[J(idx["Algebra II", k]), mult="all"]
> ID COURSE k
> [1,] 7 Algebra II 5
>
> If you use something like that a lot, you could create a little function
> to improve the notation a bit.
>
> - Tom
>
>
>
>
>> -----Original Message-----
>> From: datatable-help-bounces at lists.r-forge.r-project.org
>> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> On Behalf Of Damian Betebenner
>> Sent: Tuesday, August 10, 2010 06:50
>> To: datatable-help at lists.r-forge.r-project.org
>> Subject: [datatable-help] Behavior of setkey with factors
>>
>> All,
>>
>> I was wondering how setkey orders a factor and whether it
>> observes whether the factor is ordered or just alphabetically
>> orders the factor
>>
>> I would like to have the key observe the order of a factor
>> (e.g., a course taken field may run from 1 to 5 with 1=Basic
>> Math, 2=Calculus, 3=Geometry, 4=Algebra I and 5=Algebra 2. I
>> would like the sort imposed by data.table to "respect" the
>> canonical ordering of the classes, no an alphabetical ordering.
>>
>> I can't however, seem to get the key to behave the way I want.
>>
>> Here's an example:
>>
>> setkey(123)
>> my.course.sample <- sample(1:5, 10, replace=TRUE)
>>
>> X <- 1:10
>> Y <- factor(my.course.sample, levels=1:5, labels=c("Basic
>> Math", "Calculus", "Geometry", "Algebra I", "Algebra II"))
>>
>> my.dt <- data.table(ID=X, COURSE=Y)
>>
>> > my.dt
>> ID COURSE
>> [1,] 1 Algebra II
>> [2,] 2 Algebra I
>> [3,] 3 Algebra I
>> [4,] 4 Algebra II
>> [5,] 5 Geometry
>> [6,] 6 Algebra I
>> [7,] 7 Geometry
>> [8,] 8 Calculus
>> [9,] 9 Algebra I
>> [10,] 10 Geometry
>>
>>
>> setkey(my.dt, COURSE)
>>
>> > my.dt
>> ID COURSE
>> [1,] 2 Algebra I
>> [2,] 3 Algebra I
>> [3,] 6 Algebra I
>> [4,] 9 Algebra I
>> [5,] 1 Algebra II
>> [6,] 4 Algebra II
>> [7,] 8 Calculus
>> [8,] 5 Geometry
>> [9,] 7 Geometry
>> [10,] 10 Geometry
>>
>>
>> ###
>> ### The COURSE key is alphabetizing based upon the labels ###
>>
>> ###
>> ### Now try to impose a different ordering ###
>>
>> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
>> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
>> "Algebra II"))
>>
>> my.dt <- data.table(ID=X, COURSE=Y)
>>
>> > my.dt
>> ID COURSE
>> [1,] 1 Algebra I
>> [2,] 2 Calculus
>> [3,] 3 Calculus
>> [4,] 4 Algebra I
>> [5,] 5 Geometry
>> [6,] 6 Calculus
>> [7,] 7 Geometry
>> [8,] 8 Algebra II
>> [9,] 9 Calculus
>> [10,] 10 Geometry
>>
>> setkey(my.dt, COURSE)
>>
>> > my.dt
>> ID COURSE
>> [1,] 1 Algebra I
>> [2,] 3 Algebra I
>> [3,] 9 Algebra I
>> [4,] 2 Algebra II
>> [5,] 4 Algebra II
>> [6,] 8 Algebra II
>> [7,] 7 Basic Math
>> [8,] 5 Calculus
>> [9,] 6 Calculus
>> [10,] 10 Geometry
>>
>>
>> Y <- factor(my.course.sample, levels=c(1,4,3,5,2),
>> labels=c("Basic Math", "Calculus", "Geometry", "Algebra I",
>> "Algebra II"), ordered=TRUE)
>>
>> my.dt <- data.table(ID=X, COURSE=Y)
>>
>> my.dt
>>
>> ID COURSE
>> [1,] 1 Algebra I
>> [2,] 2 Calculus
>> [3,] 3 Calculus
>> [4,] 4 Algebra I
>> [5,] 5 Geometry
>> [6,] 6 Calculus
>> [7,] 7 Geometry
>> [8,] 8 Algebra II
>> [9,] 9 Calculus
>> [10,] 10 Geometry
>>
>> setkey(my.dt, COURSE)
>>
>> my.dt
>>
>> ID COURSE
>> [1,] 1 Algebra I
>> [2,] 4 Algebra I
>> [3,] 8 Algebra II
>> [4,] 2 Calculus
>> [5,] 3 Calculus
>> [6,] 6 Calculus
>> [7,] 9 Calculus
>> [8,] 5 Geometry
>> [9,] 7 Geometry
>> [10,] 10 Geometry
>>
>>
>> ### Setting COURSE as the key for an ordered factor seems to
>> over-ride the ordering associated with the factor and impose
>> an alphabetical order.
>>
>>
>> I'd like the key to respect the order associated with the factor
>>
>>
>> Any help with this greatly appreciated.
>>
>>
>> Best regards,
>>
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH 03821-0351
>>
>> Phone (office): (603) 516-7900
>> Phone (cell): (857) 234-2474
>> Fax: (603) 516-7910
>>
>> dbetebenner at nciea.org
>> www.nciea.org
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> atatable-help
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list