[datatable-help] Faster "CJ"

Arunkumar Srinivasan aragorn168b at gmail.com
Fri Aug 23 12:21:59 CEST 2013


Filed this as FR #4849 here: 
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4849&group_id=240&atid=978

Arun


On Friday, August 23, 2013 at 11:49 AM, Arunkumar Srinivasan wrote:

> Hi everybody, 
> 
> I think there's a faster version of "CJ" function that's possible. The issue currently is that the "sort" is done at the very end by using `setkey` which will work on the data *after* getting all the combinations, and therefore sorting a huge amount of entries.
> 
> However, a faster way would be to get it first sorted (even before working out all combinations) and then use the hack:
> 
> setattr(l, 'sorted', names(l))
> 
> Basically there are just 2 lines that need change (see bottom of the post).
> 
> ---------
> Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively big data:
> 
> w <- sample(1e4, 1e3)
> x <- sample(letters, 12)
> y <- sample(letters, 12)
> z <- sample(letters, 12)
> 
> system.time(t1 <- do.call(CJ, list(w,x,y,z)))
>    user  system elapsed 
>   0.775   0.052   0.835 
> 
> system.time(t2 <- do.call(CJ_fast, list(w,x,y,z)))
>    user  system elapsed 
>   0.220   0.001   0.221 
> 
> 
> identical(t1, t2)
> [1] TRUE
> ---------
> 
> The function: (there are only two changes)
> 
> CJ_fast <- function (...) 
> {
>     l = list(...)
>     if (length(l) > 1) {
>         n = sapply(l, length)
>         nrow = prod(n)
>         x = c(rev(data.table:::take(cumprod(rev(n)))), 1L)
>         # 1) SORT HERE
>         for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE), each = x[i], 
>             length = nrow)
>     }
>     setattr(l, "row.names", .set_row_names(length(l[[1]])))
>     setattr(l, "class", c("data.table", "data.frame"))
>     vnames = names(l)
>     if (is.null(vnames)) 
>         vnames = rep("", length(l))
>     tt = vnames == ""
>     if (any(tt)) {
>         vnames[tt] = paste("V", which(tt), sep = "")
>         setattr(l, "names", vnames)
>     }
>     data.table:::settruelength(l, 0L)
>     l = alloc.col(l)
>     # 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED"
>     setattr(l, 'sorted', names(l))
>     l
> }
> 
> 
> Arun
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130823/1e5c8de4/attachment.html>


More information about the datatable-help mailing list