[datatable-help] FR #2722 testing

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Mar 19 03:00:52 CET 2014


Hi everybody,

FR #2722 is now implemented and committed recently. It'd be great if people who're used to using devel versions could test it out and let us know if things are alright.

Here's an explanation of what the FR is and what's being optimised: 
Assuming a data.table with 4 columns x,y,z,grp, something like:

DT[, c(sum(y), lapply(.SD, sum), .N .I, lapply(.SD, mean)), by=grp]
will usually be quite slow because of using eval with lapply. This will now be optimised to:

DT[, list(sum(y), sum(x), sum(y), sum(z), .N, .I, mean(x), mean(y), mean(z)), by=grp]
However, we don't optimise if .SD is present in j in the form c(.) in any other form other than lapply(.SD, fun), because there are quite a few possibilities with .SD:

DT[, c(.SD, .SD[1], .SD+a, .SD[x>1], .SD[J(.), .SD[.(.)], lapply(.SD, sum)), by=grp]
Also, consider the case .SD[sample(.N, 1)] - this can't be optimised to list(x=x[sample(.)], y=y[sample(.)], z=y[sample(.)] obviously. So, the expression inside .SD has to be evaluated first, checked for type - logical, numeric, integer, data.table? and then must be optimised accordingly.

Therefore, this'll be postponed, if at all possible in a clear way. However, we've not come across such a case here on the mailing list or on SO yet. I'm therefore assuming it's a very rare case, which is good.

Summary: The most common cases should therefore be very fast. Here's a benchmark comparing the timings with and without optimisation:

require(data.table)
set.seed(1L)
dt <- data.table(x=rep(1:1e6, each=10), y=sample(10), z=sample(2))

options(datatable.verbose=TRUE) # not pasting verbose messages here.

# without optimisation
options(datatable.optimize=0L)
system.time(ans1 <- dt[, c(bla = sum(y), lapply(.SD, mean)), by=x])
#   user  system elapsed 
# 90.705   5.184 121.274 

# with optimisation
options(datatable.optimize=Inf)
system.time(ans2 <- dt[, c(bla = sum(y), lapply(.SD, mean)), by=x])
#   user  system elapsed 
#  0.450   0.128   0.690 
Note that the case DT[, c(sum(y), lapply(.SD, sum)), by=grp, .SDcols=..] is still not implemented - FR #5222. So the optimisation will also result in object not found. When this FR is taken care of, the optimisation will also work automatically.



Arun
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140319/184dc199/attachment.html>


More information about the datatable-help mailing list