[datatable-help] data.table() function regarding

Steve Lianoglou lianoglou.steve at gene.com
Wed Aug 14 19:24:08 CEST 2013


In fact, we already had a ticket on the tracker for this, so just
updating this (and that) with this thread:

https://r-forge.r-project.org/tracker/?group_id=240&atid=975&func=detail&aid=2599

On Wed, Aug 14, 2013 at 10:18 AM, Steve Lianoglou
<lianoglou.steve at gene.com> wrote:
> Hi Arun,
>
> Thanks for this very detailed analysis!
>
> The slowness of transform.data.table is something that's been bugging
> me for a while but have not had the time to dig into it myself yet, so
> this is really great.
>
> I quickly tried to apply your proposed fix and recompiled/reinstalled
> data.table. It looks like there are some errors that do pop up after
> running test.data.table(), but I *think* they are trivial -- I don't
> have time to investigate further right now, but will do so in due time
> if Matthew (or you :-) don't be me to it.
>
> Thanks again,
> -steve
>
>
> On Wed, Aug 14, 2013 at 9:26 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> Hello,
>>
>> This question comes from a recent SO question on Why is transform.data.table
>> so much slower than transform.data.frame?
>>
>> Suppose I've,
>>
>> DT <- data.table(x=sample(1e5), y=sample(1e5), z=sample(1e5))
>>
>> And I want to transform this data.table by adding an extra column z = 1 (I'm
>> aware of the idiomatic way of using :=, but let's keep that aside for the
>> moment), I'd do:
>>
>> transform(DT, z = 1))
>>
>> However, this is terribly slow. I debugged the code and found out the reason
>> for this slowness. To gist the issue, transform.data.table calls:
>>
>> ans <- do.call("data.table", c(list(`_data`), e[!matched]))
>>
>> which calls data.table() where, the slowness happens here:
>>
>> exptxt = as.character(tt) # <~~~~~~~~ SLOW when called with `do.call`!
>>
>> Now, the point is, exptxt is only used under one other if-statement, pasted
>> below.
>>
>> if (any(novname) && length(exptxt)==length(vnames)) {
>>     okexptxt =  exptxt[novname] == make.names(exptxt[novname])
>>     vnames[novname][okexptxt] = exptxt[novname][okexptxt]
>> }
>> tt = vnames==""
>>
>> And this statement is basically useful, for example, if one does:
>>
>> x <- 1:5
>> y <- 6:10
>> DT <- data.table(x, y)
>>    x  y
>> 1: 1  6
>> 2: 2  7
>> 3: 3  8
>> 4: 4  9
>> 5: 5 10
>>
>> This gives a data.table with column names the same as input variables
>> instead of giving V1 and V2.
>>
>> But, this is what is slowing down do.call("data.table", ...) function. For
>> example,
>>
>> ll <- list(data.table(x=runif(1e5), y=runif(1e5)), z=runif(1e5), w=1)
>> system.time(do.call("data.table", ll)) # 30 seconds on my mac
>>
>> But, this exptxt <- as.character(tt) and the above mentioned if-statement
>> can be replaced with (with help from data.frame function):
>>
>> for (i in which(novname)) {
>>     tmp <- deparse(tt[[i]])
>>     if (tmp == make.names(tmp))
>>         vnames[i] <- tmp
>> }
>>
>> And by replacing with this and running do.call("data.table", ...) takes 0.04
>> seconds. Also,data.table(x,y) gives the intended result with column names x
>> and y.
>>
>> In essence, by replacing the above mentioned lines, the desired function of
>> data.table remains unchanged while do.call("data.table", ...) is faster (and
>> hence transform and other functions that depend on it).
>>
>> What do you think? To my knowledge, this doesn't seem to break anything
>> else...
>>
>> Arun
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


More information about the datatable-help mailing list