[datatable-help] data.table() function regarding
Steve Lianoglou
lianoglou.steve at gene.com
Wed Aug 14 19:18:40 CEST 2013
Hi Arun,
Thanks for this very detailed analysis!
The slowness of transform.data.table is something that's been bugging
me for a while but have not had the time to dig into it myself yet, so
this is really great.
I quickly tried to apply your proposed fix and recompiled/reinstalled
data.table. It looks like there are some errors that do pop up after
running test.data.table(), but I *think* they are trivial -- I don't
have time to investigate further right now, but will do so in due time
if Matthew (or you :-) don't be me to it.
Thanks again,
-steve
On Wed, Aug 14, 2013 at 9:26 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Hello,
>
> This question comes from a recent SO question on Why is transform.data.table
> so much slower than transform.data.frame?
>
> Suppose I've,
>
> DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
>
> And I want to transform this data.table by adding an extra column z = 1 (I'm
> aware of the idiomatic way of using :=, but let's keep that aside for the
> moment), I'd do:
>
> transform(DT, z = 1))
>
> However, this is terribly slow. I debugged the code and found out the reason
> for this slowness. To gist the issue, transform.data.table calls:
>
> ans <- do.call("data.table", c(list(`_data`), e[!matched]))
>
> which calls data.table() where, the slowness happens here:
>
> exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`!
>
> Now, the point is, exptxt is only used under one other if-statement, pasted
> below.
>
> if (any(novname) && length(exptxt)==length(vnames)) {
> okexptxt = exptxt[novname] == make.names(exptxt[novname])
> vnames[novname][okexptxt] = exptxt[novname][okexptxt]
> }
> tt = vnames==""
>
> And this statement is basically useful, for example, if one does:
>
> x <- 1:5
> y <- 6:10
> DT <- data.table(x, y)
> x y
> 1: 1 6
> 2: 2 7
> 3: 3 8
> 4: 4 9
> 5: 5 10
>
> This gives a data.table with column names the same as input variables
> instead of giving V1 and V2.
>
> But, this is what is slowing down do.call("data.table", ...) function. For
> example,
>
> ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1)
> system.time(do.call("data.table", ll)) # 30 seconds on my mac
>
> But, this exptxt <- as.character(tt) and the above mentioned if-statement
> can be replaced with (with help from data.frame function):
>
> for (i in which(novname)) {
> tmp <- deparse(tt[[i]])
> if (tmp == make.names(tmp))
> vnames[i] <- tmp
> }
>
> And by replacing with this and running do.call("data.table", ...) takes 0.04
> seconds. Also,data.table(x,y) gives the intended result with column names x
> and y.
>
> In essence, by replacing the above mentioned lines, the desired function of
> data.table remains unchanged while do.call("data.table", ...) is faster (and
> hence transform and other functions that depend on it).
>
> What do you think? To my knowledge, this doesn't seem to break anything
> else...
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech
More information about the datatable-help
mailing list