[datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Arunkumar Srinivasan aragorn168b at gmail.com
Wed May 21 09:23:12 CEST 2014


Great. That makes total sense to me. No defaults are affected as well. Thanks again.

Arun

From: Gabor Grothendieck ggrothendieck at gmail.com
Reply: Gabor Grothendieck ggrothendieck at gmail.com
Date: May 21, 2014 at 1:03:03 AM
To: Arunkumar Srinivasan aragorn168b at gmail.com
Cc: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments  

In that case I suggest just changing rbindlist to have use.names =  
fill and leave rbind as is.  

On Tue, May 20, 2014 at 5:16 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com> wrote:  
> In the current CRAN:  
>  
> rbindlist corresponds to use.names=FALSE and fill = FALSE  
> rbind corresponds to use.names=TRUE and fill = FALSE  
>  
> Just to be clear, again, are you suggesting that I change *just* rbindlist's  
> defaults to use.names=fill and fill=FALSE or for both?  
> Arun  
>  
> From: Gabor Grothendieck ggrothendieck at gmail.com  
> Reply: Gabor Grothendieck ggrothendieck at gmail.com  
> Date: May 20, 2014 at 11:14:15 PM  
>  
> To: Arunkumar Srinivasan aragorn168b at gmail.com  
> Cc: datatable-help at lists.r-forge.r-project.org  
> datatable-help at lists.r-forge.r-project.org  
> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill  
> arguments  
>  
> Yes. That is what I intended.  
>  
> rbindlist on CRAN currently has no fill or use.names arguments. What  
> combo of the new fill and use.names does the currrent CRAN rbindlst  
> correspond to?  
>  
>  
>  
> On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan  
> <aragorn168b at gmail.com> wrote:  
>> I think I understand now what you’re trying to say. Going back to an  
>> earlier  
>> post, you wrote:  
>>  
>> Then why not make the default of `use.names` be `fill`. Then you don't get  
>> the warning and you can tell just from the argument list what the  
>> dependencies are.  
>>  
>> You mean to basically do?  
>>  
>> rbindlist <- function(l, use.names=fill, fill=FALSE)  
>> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)  
>>  
>> Is this what you mean? If so, the defaults from the previous versions will  
>> be changed. The ones who use rbind directly without setting use.names will  
>> have different results.. (assuming I understand you correctly this time).  
>>  
>>  
>> Arun  
>>  
>> From: Gabor Grothendieck ggrothendieck at gmail.com  
>> Reply: Gabor Grothendieck ggrothendieck at gmail.com  
>> Date: May 20, 2014 at 10:49:54 PM  
>>  
>> To: Arunkumar Srinivasan aragorn168b at gmail.com  
>> Cc: datatable-help at lists.r-forge.r-project.org  
>> datatable-help at lists.r-forge.r-project.org  
>> Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and  
>> fill  
>> arguments  
>>  
>> If I understand this right then the table below shows the valid  
>> logical combinations in order of speed (slowest first). Is that  
>> right? If so then if fill = FALSE and use.names = fill then we get  
>> the fastest case by default.  
>>  
>> Furthermore if you were concerned that we might be T/T when F/T would  
>> be sufficient I don't think that is likely since getting F/T is done  
>> by setting use.names = TRUE.  
>>  
>> fill/use.names  
>> T/T (slowest)  
>> F/T  
>> F/F (fasetest)  
>>  
>>  
>> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan  
>> <aragorn168b at gmail.com> wrote:  
>>> I’ve filed FR #5690 to remind myself of the recycling feature; that’d be  
>>> awesome to have.  
>>>  
>>> One feature I forgot to point out in the previous post is that, even when  
>>> there are duplicate names, rbind/rbindlist binds them consistent with  
>>> ‘base’  
>>> when use.names=TRUE. And it fills the duplicate columns properly (in the  
>>> order of occurrence) also when fill=TRUE.  
>>>  
>>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with  
>>> columns ranging from V1 to V500 in random order (all integers for  
>>> simplicity). We’ll need to just use use.names=TRUE (as all columns are  
>>> available in all data.tables).  
>>>  
>>> I think this data is big enough to illustrate the point. Also, I was  
>>> curious  
>>> to see a comparison against dplyr’s rbind_all (commit 1504 devel  
>>> version).  
>>> So, I’ve added it as well to the benchmarks.  
>>>  
>>> Here’s the data generation. Note: It takes a while for this step to  
>>> finish.  
>>>  
>>> require(data.table) ## 1.9.3 commit 1267  
>>> require(dplyr) ## commit 1504 devel  
>>> set.seed(1L)  
>>> foo <- function(k) {  
>>> ans = setDT(lapply(1:k, function(x) sample(10)))  
>>> }  
>>> bar <- function(ans, k, n) {  
>>> bla = sample(paste0("V", 1:k), n)  
>>> setnames(ans, bla)  
>>> }  
>>> n = 10000L  
>>> ll = vector("list", n)  
>>> for (i in 1:n) {  
>>> bla = bar(foo(500L), 500L, 500L)  
>>> .Call("Csetlistelt", ll, i, bla)  
>>> }  
>>>  
>>> And here are the timings:  
>>>  
>>> ## data.table v1.9.3 commit 1267's rbindlist  
>>> ## Timings of three consecutive runs:  
>>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))  
>>> user system elapsed  
>>> 10.909 0.449 11.843  
>>>  
>>> user system elapsed  
>>> 5.219 0.386 5.640  
>>>  
>>> user system elapsed  
>>> 5.355 0.429 5.898  
>>>  
>>> ## dplyr's rbind_all  
>>> ## Timings for three consecutive runs  
>>> system.time(ans2 <- rbind_all(ll))  
>>> user system elapsed  
>>> 62.769 0.247 63.941  
>>>  
>>> user system elapsed  
>>> 62.010 0.335 65.876  
>>>  
>>> user system elapsed  
>>> 55.345 0.359 60.193  
>>>  
>>>> identical(ans1, setDT(ans2)) # [1] TRUE  
>>>  
>>> ## data.table v1.9.2's rbind version:  
>>> ## ran only once as it took a bit more.  
>>> system.time(ans1 <- do.call("rbind", ll))  
>>> user system elapsed  
>>> 125.356 2.247 139.000  
>>>  
>>>> identical(ans1, setDT(ans2)) # [1] TRUE  
>>>  
>>> In summary, the newer implementation is about ~11–23x faster than  
>>> data.table’s older implementation and is ~5.5–10x faster against dplyr on  
>>> this (relatively huge) data.  
>>>  
>>> Arun  
>>>  
>>> From: Arunkumar Srinivasan aragorn168b at gmail.com  
>>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com  
>>> Date: May 20, 2014 at 9:27:56 PM  
>>> To: datatable-help at lists.r-forge.r-project.org  
>>> datatable-help at lists.r-forge.r-project.org  
>>> Subject: FR #5249 - rbindlist gains use.names and fill arguments  
>>>  
>>> Hello everyone,  
>>>  
>>> With the latest commit #1266, the extra functionality offered via rbind  
>>> (use.names and fill) is also now available to rbindlist. In addition, the  
>>> implementation is completely moved to C, and is therefore tremendously  
>>> fast,  
>>> especially for cases where one has to bind using with use.names=TRUE  
>>> and/or  
>>> with fill=TRUE. I’ll try to put out a benchmark comparing speed  
>>> differences  
>>> with the older implementation ASAP.  
>>>  
>>> Note that this change comes with a very low cost to the default speed to  
>>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding  
>>> 10,000 data.tables with 20 columns each, resulted in the new version  
>>> running  
>>> in 0.107 seconds, where as the older version ran in 0.095 seconds.  
>>>  
>>> In addition the documentation for ?rbindlist also has been improved  
>>> (#5158  
>>> from Alexander). Here’s the change log from NEWS:  
>>>  
>>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now  
>>> implemented entirely in C. Closes #5249  
>>> -> use.names by default is FALSE for backwards compatibility  
>>> (doesn't bind by names by default)  
>>> -> rbind(...) now just calls rbindlist() internally, except that  
>>> 'use.names' is TRUE by default,  
>>> for compatibility with base (and backwards compatibility).  
>>> -> fill by default is FALSE. If fill is TRUE, use.names has to be  
>>> TRUE.  
>>> -> At least one item of the input list has to have non-null column  
>>> names.  
>>> -> Duplicate columns are bound in the order of occurrence, like  
>>> base.  
>>> -> Attributes that might exist in individual items would be lost in  
>>> the bound result.  
>>> -> Columns are coerced to the highest SEXPTYPE, if they are  
>>> different, if/when possible.  
>>> -> And incredibly fast ;).  
>>> -> Documentation updated in much detail. Closes DR #5158.  
>>> Eddi's (excellent) work on finding factor levels, type coercion of  
>>> columns etc. are all retained.  
>>>  
>>> Please try it and write back if things aren’t working as it was before.  
>>> The  
>>> tests that had to be fixed are extremely rare cases. I suspect there  
>>> should  
>>> be minimal issue, if at all, in this version. However, I do find the  
>>> changes  
>>> here bring consistency to the function.  
>>>  
>>> One (very rare) feature that is not available due to this implementation  
>>> is  
>>> the ability to recycle.  
>>>  
>>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))  
>>> lst1 <- list(x=4, y=5, z=as.list(1:3))  
>>>  
>>> rbind(dt1, lst1)  
>>> # x y z  
>>> # 1: 1 4 1,2  
>>> # 2: 2 5 1,2,3  
>>> # 3: 3 6 1,2,3,4  
>>> # 4: 4 5 1  
>>> # 5: 4 5 2  
>>> # 6: 4 5 3  
>>>  
>>> The 4,5 are recycled very nicely here.. This is not possible at the  
>>> moment.  
>>> This is because the earlier rbind implementation used as.data.table to  
>>> convert to data.table, however it takes a copy (very inefficient on huge  
>>> /  
>>> many tables). I’d love to add this feature in C as well, as it would help  
>>> incredibly for use within [.data.table (now that we can fill columns and  
>>> bind by names faster). Will add a FR.  
>>>  
>>> In summary, I think there should be minimal issues, if any and should be  
>>> much faster (for rbind cases). Please write back what you think, if you  
>>> happen to try out.  
>>>  
>>>  
>>>  
>>> Arun  
>>>  
>>>  
>>> _______________________________________________  
>>> datatable-help mailing list  
>>> datatable-help at lists.r-forge.r-project.org  
>>>  
>>>  
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>>  
>>  
>>  
>> --  
>> Statistics & Software Consulting  
>> GKX Group, GKX Associates Inc.  
>> tel: 1-877-GKX-GROUP  
>> email: ggrothendieck at gmail.com  
>  
>  
>  
> --  
> Statistics & Software Consulting  
> GKX Group, GKX Associates Inc.  
> tel: 1-877-GKX-GROUP  
> email: ggrothendieck at gmail.com  



--  
Statistics & Software Consulting  
GKX Group, GKX Associates Inc.  
tel: 1-877-GKX-GROUP  
email: ggrothendieck at gmail.com  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140521/841bf92e/attachment-0001.html>


More information about the datatable-help mailing list