<div>
Filed this as FR #4849 here:
</div><div><a href="https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4849&group_id=240&atid=978">https://r-forge.r-project.org/tracker/index.php?func=detail&aid=4849&group_id=240&atid=978</a></div>
<div><div><br></div><div>Arun</div><div><br></div></div>
<p style="color: #A0A0A8;">On Friday, August 23, 2013 at 11:49 AM, Arunkumar Srinivasan wrote:</p>
<blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px;">
<span><div><div>
<div>
Hi everybody,
</div><div><br></div><div>I think there's a faster version of "CJ" function that's possible. The issue currently is that the "sort" is done at the very end by using `setkey` which will work on the data *after* getting all the combinations, and therefore sorting a huge amount of entries.</div><div><br></div><div>However, a faster way would be to get it first sorted (even before working out all combinations) and then use the hack:</div><div><br></div><div>setattr(l, 'sorted', names(l))</div><div><br></div><div>Basically there are just 2 lines that need change (see bottom of the post).</div><div><br></div><div>---------</div><div>Here's first some benchmarks on `CJ_fast` (see below) and `CJ` on a relatively big data:</div><div><br></div><div>w <- sample(1e4, 1e3)</div><div>x <- sample(letters, 12)</div><div>y <- sample(letters, 12)</div><div>z <- sample(letters, 12)</div><div><br></div><div>system.time(t1 <- do.call(CJ, list(w,x,y,z)))</div><div><div> user system elapsed </div><div> 0.775 0.052 0.835 </div></div><div>system.time(t2 <- do.call(CJ_fast, list(w,x,y,z)))</div><div><div> user system elapsed </div><div> 0.220 0.001 0.221 </div></div><div><br></div><div>identical(t1, t2)</div><div>[1] TRUE</div><div>---------</div><div><br></div>
<div><div>The function: (there are only two changes)</div><div><br></div><div><div>CJ_fast <- function (...) </div><div>{</div><div> l = list(...)</div><div> if (length(l) > 1) {</div><div> n = sapply(l, length)</div><div> nrow = prod(n)</div><div> x = c(rev(data.table:::take(cumprod(rev(n)))), 1L)</div><div> # 1) SORT HERE</div><div> for (i in seq(along = x)) l[[i]] = rep(sort(l[[i]], na.last = TRUE), each = x[i], </div><div> length = nrow)</div><div> }</div><div> setattr(l, "row.names", .set_row_names(length(l[[1]])))</div><div> setattr(l, "class", c("data.table", "data.frame"))</div><div> vnames = names(l)</div><div> if (is.null(vnames)) </div><div> vnames = rep("", length(l))</div><div> tt = vnames == ""</div><div> if (any(tt)) {</div><div> vnames[tt] = paste("V", which(tt), sep = "")</div><div> setattr(l, "names", vnames)</div><div> }</div><div> data.table:::settruelength(l, 0L)</div><div> l = alloc.col(l)</div><div> # 2) REPLACE SETKEY WITH ATTRIBUTE "SORTED"</div><div> setattr(l, 'sorted', names(l))</div><div> l</div><div>}</div></div><div><br></div><div>Arun</div><div><br></div></div>
</div></div></span>
</blockquote>
<div>
<br>
</div>