[datatable-help] Faster "CJ"

Arunkumar Srinivasan aragorn168b at gmail.com
Fri Aug 23 11:49:08 CEST 2013


Hi everybody, 

I think there's a faster version of "CJ" function that's possible. The issue currently is that the "sort" is done at the very end by using `setkey` which will work on the data *after* getting all the combinations, and therefore sorting a huge amount of entries.

However, a faster way would be to get it first sorted (even before working out all combinations) and then use the hack:

setattr(l, 'sorted', names(l))

Basically there are just 2 lines that need change (see bottom of the post).

---------
Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively big data:

w <- sample(1e4, 1e3)
x <- sample(letters, 12)
y <- sample(letters, 12)
z <- sample(letters, 12)

system.time(t1 <- do.call(CJ, list(w,x,y,z)))
   user  system elapsed 
  0.775   0.052   0.835 

system.time(t2 <- do.call(CJ_fast, list(w,x,y,z)))
   user  system elapsed 
  0.220   0.001   0.221 


identical(t1, t2)
[1] TRUE
---------

The function: (there are only two changes)

CJ_fast <- function (...) 
{
    l = list(...)
    if (length(l) > 1) {
        n = sapply(l, length)
        nrow = prod(n)
        x = c(rev(data.table:::take(cumprod(rev(n)))), 1L)
        # 1) SORT HERE
        for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE), each = x[i], 
            length = nrow)
    }
    setattr(l, "row.names", .set_row_names(length(l[[1]])))
    setattr(l, "class", c("data.table", "data.frame"))
    vnames = names(l)
    if (is.null(vnames)) 
        vnames = rep("", length(l))
    tt = vnames == ""
    if (any(tt)) {
        vnames[tt] = paste("V", which(tt), sep = "")
        setattr(l, "names", vnames)
    }
    data.table:::settruelength(l, 0L)
    l = alloc.col(l)
    # 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED"
    setattr(l, 'sorted', names(l))
    l
}


Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130823/56ae8fb6/attachment.html>


More information about the datatable-help mailing list