[datatable-help] data.table indexing approaches

Short, Tom TShort at epri.com
Tue May 18 04:11:49 CEST 2010


This email discusses ways to index data tables for more complicated
queries. I'm interested in feedback, especially on how to implement
"AND" for multiple columns. Consider the following data table:

    N <- 1e2
    dt <- data.table(l = letters[rep(1:10, each = N/10)],
                     L = LETTERS[rep(1:5, N/5)],
                     key = "l")

***** Binary search using secondary keys

One thing I've found useful is using a data table as a secondary key
to index on columns other than "l". This is especially helpful for
cases where I may want to pull out data by different keys, and I don't
have the memory to hold separate copies of the data table for each
sorting option. Here's how to set up the keying data table:

    i.L <- data.table(k = dt$L, .i = 1:nrow(dt), key = "k")

This is a data table that's indexed on "L" from my original data table,
and it returns indices to the original data table in the ".i" column.
To query on column "L", use:

    # query dt on L
    dt[ i.L[c("A","C"), mult = "all"]$.i ]

You can wrap these operations in a couple of helper functions:

    idt <- function(x, i)
        x[i, mult = "all"]$.i
     
    secondarykey <- function(dt, column) 
        data.table(.k = dt[[column]], .i = 1:nrow(dt), key = ".k")

    # query dt on L using idt()
    i.L <- secondarykey(dt, "L")
    dt[ idt(i.L, c("A","C")) ]

This is relatively efficient. You do have to be careful to keep your
secondary keys in sync with your original data.

***** Combining queries (AND)

Now, I'd like a more efficient version of the following:

    dt[l %in% c("a","b") & L == "C"]

If dt is key'd by both "l" and "L", it's easy:

    dt[J(c("a","b"), "C"), mult = "all"]

Sometimes, that's not feasable with several columns where you may have
different combinations. One option is to use a binary search on the
primary key followed by a vector scan:

    # binary search on key'd term followed by vector scan
    dt[c("b","a"), mult = "all"][L == "C"]

Another option is to find indices for each of the queries, and then
find the intersection for AND (and union for OR). 

    # intersecting indices
    dt[ intersect(dt[c("a","b"), mult = "all", which = TRUE],
                  idt(i.L, "C")) ] # idt is from above

If some of the indices are quite long, finding the intersection may be
time consuming. One option is to cascade queries and re-sort at each
step.

    # cascading queries with re-sorting
    tmp <- dt[c("b","a"), mult = "all"]
    setkey(tmp, L)
    tmp["C", mult = "all"]

In a few timing tests, I didn't find much difference between these
methods. None are as lightning fast as directly searching via keys.

Comments welcome, especially any insights on doing faster AND's.

- Tom

Tom Short


More information about the datatable-help mailing list