[datatable-help] data.table() function regarding

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Aug 14 18:26:05 CEST 2013


This question comes from a recent SO question on Why is transform.data.table so much slower than transform.data.frame? (http://stackoverflow.com/questions/18216658/why-is-transform-data-table-so-much-slower-than-transform-data-frame)

Suppose I've,

DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5)) 

And I want to transform this data.table by adding an extra column z = 1 (I'm aware of the idiomatic way of using :=, but let's keep that aside for the moment), I'd do:

transform(DT, z = 1)) 

However, this is terribly slow. I debugged the code and found out the reason for this slowness. To gist the issue, transform.data.table calls:

ans <- do.call("data.table", c(list(`_data`), e[!matched])) 

which calls data.table() where, the slowness happens here:

exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`! 

Now, the point is, exptxt is only used under one other if-statement, pasted below.

if (any(novname) && length(exptxt)==length(vnames)) { okexptxt = exptxt[novname] == make.names(exptxt[novname]) vnames[novname][okexptxt] = exptxt[novname][okexptxt] } tt = vnames=="" 

And this statement is basically useful, for example, if one does:

x <- 1:5 y <- 6:10 DT <- data.table(x, y) x y 1: 1 6 2: 2 7 3: 3 8 4: 4 9 5: 5 10 

This gives a data.table with column names the same as input variables instead of giving V1 and V2.

But, this is what is slowing down do.call("data.table", ...) function. For example,

ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1) system.time(do.call("data.table", ll)) # 30 seconds on my mac 

But, this exptxt <- as.character(tt) and the above mentioned if-statement can be replaced with (with help from data.frame function):

for (i in which(novname)) { tmp <- deparse(tt[[i]]) if (tmp == make.names(tmp)) vnames[i] <- tmp } 

And by replacing with this and running do.call("data.table", ...) takes 0.04 seconds. Also,data.table(x,y) gives the intended result with column names x and y.

In essence, by replacing the above mentioned lines, the desired function of data.table remains unchanged while do.call("data.table", ...) is faster (and hence transform and other functions that depend on it).

What do you think? To my knowledge, this doesn't seem to break anything else...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130814/689ef5b1/attachment-0001.html>

More information about the datatable-help mailing list