[datatable-help] Faster "CJ"
Arunkumar Srinivasan
aragorn168b at gmail.com
Fri Aug 23 11:49:08 CEST 2013
Hi everybody,
I think there's a faster version of "CJ" function that's possible. The issue currently is that the "sort" is done at the very end by using `setkey` which will work on the data *after* getting all the combinations, and therefore sorting a huge amount of entries.
However, a faster way would be to get it first sorted (even before working out all combinations) and then use the hack:
setattr(l, 'sorted', names(l))
Basically there are just 2 lines that need change (see bottom of the post).
---------
Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively big data:
w <- sample(1e4, 1e3)
x <- sample(letters, 12)
y <- sample(letters, 12)
z <- sample(letters, 12)
system.time(t1 <- do.call(CJ, list(w,x,y,z)))
user system elapsed
0.775 0.052 0.835
system.time(t2 <- do.call(CJ_fast, list(w,x,y,z)))
user system elapsed
0.220 0.001 0.221
identical(t1, t2)
[1] TRUE
---------
The function: (there are only two changes)
CJ_fast <- function (...)
{
l = list(...)
if (length(l) > 1) {
n = sapply(l, length)
nrow = prod(n)
x = c(rev(data.table:::take(cumprod(rev(n)))), 1L)
# 1) SORT HERE
for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE), each = x[i],
length = nrow)
}
setattr(l, "row.names", .set_row_names(length(l[[1]])))
setattr(l, "class", c("data.table", "data.frame"))
vnames = names(l)
if (is.null(vnames))
vnames = rep("", length(l))
tt = vnames == ""
if (any(tt)) {
vnames[tt] = paste("V", which(tt), sep = "")
setattr(l, "names", vnames)
}
data.table:::settruelength(l, 0L)
l = alloc.col(l)
# 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED"
setattr(l, 'sorted', names(l))
l
}
Arun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130823/56ae8fb6/attachment.html>
More information about the datatable-help
mailing list