[datatable-help] data.table indexing approaches

Matthew Dowle mdowle at mdowle.plus.com
Thu Jul 29 02:17:36 CEST 2010


Tom,

Sorry for delay on this one.

On secondary keys a todo is in setkey.R :

# TO DO: allow secondary index, which stores the sorted key + a column
of integer rows in the primary sorted table. Will need to either drop or
maintain keys with updates or inserts.
    
'sorted key' meant a character vector like the "sorted" attribute e.g.
c("col3","col2"), paired with the 'integer rows' meaning the result of
DT[,order(col3,col2)].  The "sorted2nd" (?) attribute could be set by
'set2key(DT,col3,col2)'.  The binary search could then proceeds on this
ordering vector, using the data from the DT columns at those positions.
This has the advantage that only one integer vector need be stored for
each secondary key, rather than a secondary key table containing as many
columns as the secondary key.

Trouble with that is how to tell X[Y] syntax which one of X's secondary
keys to use rather than the primary key. At the risk of yet another
parameter maybe X[Y,usekey=3] would mean use X's 2nd secondary key (!).
The default for usekey would be 1 meaning the primary key as now. 

On the faster ANDs, not sure. FR#415 seems related but I didn't write it
well at the time.  Also #FR203. Both involve allowing more complicated
structures either passed to i, or non-vectors as columns in a data.table
which mean special things when that data.table is passed as i. The
FR#415 mentions chained grouping, but not sure if it helps for this?

Matthew


On Mon, 2010-05-17 at 19:11 -0700, Short, Tom wrote:
> This email discusses ways to index data tables for more complicated
> queries. I'm interested in feedback, especially on how to implement
> "AND" for multiple columns. Consider the following data table:
> 
>     N <- 1e2
>     dt <- data.table(l = letters[rep(1:10, each = N/10)],
>                      L = LETTERS[rep(1:5, N/5)],
>                      key = "l")
> 
> ***** Binary search using secondary keys
> 
> One thing I've found useful is using a data table as a secondary key
> to index on columns other than "l". This is especially helpful for
> cases where I may want to pull out data by different keys, and I don't
> have the memory to hold separate copies of the data table for each
> sorting option. Here's how to set up the keying data table:
> 
>     i.L <- data.table(k = dt$L, .i = 1:nrow(dt), key = "k")
> 
> This is a data table that's indexed on "L" from my original data table,
> and it returns indices to the original data table in the ".i" column.
> To query on column "L", use:
> 
>     # query dt on L
>     dt[ i.L[c("A","C"), mult = "all"]$.i ]
> 
> You can wrap these operations in a couple of helper functions:
> 
>     idt <- function(x, i)
>         x[i, mult = "all"]$.i
>      
>     secondarykey <- function(dt, column) 
>         data.table(.k = dt[[column]], .i = 1:nrow(dt), key = ".k")
> 
>     # query dt on L using idt()
>     i.L <- secondarykey(dt, "L")
>     dt[ idt(i.L, c("A","C")) ]
> 
> This is relatively efficient. You do have to be careful to keep your
> secondary keys in sync with your original data.
> 
> ***** Combining queries (AND)
> 
> Now, I'd like a more efficient version of the following:
> 
>     dt[l %in% c("a","b") & L == "C"]
> 
> If dt is key'd by both "l" and "L", it's easy:
> 
>     dt[J(c("a","b"), "C"), mult = "all"]
> 
> Sometimes, that's not feasable with several columns where you may have
> different combinations. One option is to use a binary search on the
> primary key followed by a vector scan:
> 
>     # binary search on key'd term followed by vector scan
>     dt[c("b","a"), mult = "all"][L == "C"]
> 
> Another option is to find indices for each of the queries, and then
> find the intersection for AND (and union for OR). 
> 
>     # intersecting indices
>     dt[ intersect(dt[c("a","b"), mult = "all", which = TRUE],
>                   idt(i.L, "C")) ] # idt is from above
> 
> If some of the indices are quite long, finding the intersection may be
> time consuming. One option is to cascade queries and re-sort at each
> step.
> 
>     # cascading queries with re-sorting
>     tmp <- dt[c("b","a"), mult = "all"]
>     setkey(tmp, L)
>     tmp["C", mult = "all"]
> 
> In a few timing tests, I didn't find much difference between these
> methods. None are as lightning fast as directly searching via keys.
> 
> Comments welcome, especially any insights on doing faster AND's.
> 
> - Tom
> 
> Tom Short
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list