[datatable-help] FR #5249 - rbindlist gains use.names and fill arguments

Tue May 20 23:13:55 CEST 2014

Yes.  That is what I intended.

rbindlist on CRAN currently has no fill or use.names arguments.  What
combo of the new fill and use.names does the currrent CRAN rbindlst
correspond to?

On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> I think I understand now what you’re trying to say. Going back to an earlier
> post, you wrote:
>
> Then why not make the default of `use.names` be `fill`. Then you don't get
> the warning and you can tell just from the argument list what the
> dependencies are.
>
> You mean to basically do?
>
> rbindlist <- function(l, use.names=fill, fill=FALSE)
> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)
>
> Is this what you mean? If so, the defaults from the previous versions will
> be changed. The ones who use rbind directly without setting use.names will
> have different results.. (assuming I understand you correctly this time).
>
>
> Arun
>
> From: Gabor Grothendieck ggrothendieck at gmail.com
> Reply: Gabor Grothendieck ggrothendieck at gmail.com
> Date: May 20, 2014 at 10:49:54 PM
>
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> Cc: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill
> arguments
>
> If I understand this right then the table below shows the valid
> logical combinations in order of speed (slowest first). Is that
> right? If so then if fill = FALSE and use.names = fill then we get
> the fastest case by default.
>
> Furthermore if you were concerned that we might be T/T when F/T would
> be sufficient I don't think that is likely since getting F/T is done
> by setting use.names = TRUE.
>
> fill/use.names
> T/T (slowest)
> F/T
> F/F (fasetest)
>
>
> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>> I’ve filed FR #5690 to remind myself of the recycling feature; that’d be
>> awesome to have.
>>
>> One feature I forgot to point out in the previous post is that, even when
>> there are duplicate names, rbind/rbindlist binds them consistent with
>> ‘base’
>> when use.names=TRUE. And it fills the duplicate columns properly (in the
>> order of occurrence) also when fill=TRUE.
>>
>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with
>> columns ranging from V1 to V500 in random order (all integers for
>> simplicity). We’ll need to just use use.names=TRUE (as all columns are
>> available in all data.tables).
>>
>> I think this data is big enough to illustrate the point. Also, I was
>> curious
>> to see a comparison against dplyr’s rbind_all (commit 1504 devel version).
>> So, I’ve added it as well to the benchmarks.
>>
>> Here’s the data generation. Note: It takes a while for this step to
>> finish.
>>
>> require(data.table) ## 1.9.3 commit 1267
>> require(dplyr) ## commit 1504 devel
>> set.seed(1L)
>> foo <- function(k) {
>> ans = setDT(lapply(1:k, function(x) sample(10)))
>> }
>> bar <- function(ans, k, n) {
>> bla = sample(paste0("V", 1:k), n)
>> setnames(ans, bla)
>> }
>> n = 10000L
>> ll = vector("list", n)
>> for (i in 1:n) {
>> bla = bar(foo(500L), 500L, 500L)
>> .Call("Csetlistelt", ll, i, bla)
>> }
>>
>> And here are the timings:
>>
>> ## data.table v1.9.3 commit 1267's rbindlist
>> ## Timings of three consecutive runs:
>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
>> user system elapsed
>> 10.909 0.449 11.843
>>
>> user system elapsed
>> 5.219 0.386 5.640
>>
>> user system elapsed
>> 5.355 0.429 5.898
>>
>> ## dplyr's rbind_all
>> ## Timings for three consecutive runs
>> system.time(ans2 <- rbind_all(ll))
>> user system elapsed
>> 62.769 0.247 63.941
>>
>> user system elapsed
>> 62.010 0.335 65.876
>>
>> user system elapsed
>> 55.345 0.359 60.193
>>
>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>
>> ## data.table v1.9.2's rbind version:
>> ## ran only once as it took a bit more.
>> system.time(ans1 <- do.call("rbind", ll))
>> user system elapsed
>> 125.356 2.247 139.000
>>
>>> identical(ans1, setDT(ans2)) # [1] TRUE
>>
>> In summary, the newer implementation is about ~11–23x faster than
>> data.table’s older implementation and is ~5.5–10x faster against dplyr on
>> this (relatively huge) data.
>>
>> Arun
>>
>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>> Date: May 20, 2014 at 9:27:56 PM
>> To: datatable-help at lists.r-forge.r-project.org
>> datatable-help at lists.r-forge.r-project.org
>> Subject: FR #5249 - rbindlist gains use.names and fill arguments
>>
>> Hello everyone,
>>
>> With the latest commit #1266, the extra functionality offered via rbind
>> (use.names and fill) is also now available to rbindlist. In addition, the
>> implementation is completely moved to C, and is therefore tremendously
>> fast,
>> especially for cases where one has to bind using with use.names=TRUE
>> and/or
>> with fill=TRUE. I’ll try to put out a benchmark comparing speed
>> differences
>> with the older implementation ASAP.
>>
>> Note that this change comes with a very low cost to the default speed to
>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
>> 10,000 data.tables with 20 columns each, resulted in the new version
>> running
>> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>>
>> In addition the documentation for ?rbindlist also has been improved (#5158
>> from Alexander). Here’s the change log from NEWS:
>>
>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now
>> implemented entirely in C. Closes #5249
>> -> use.names by default is FALSE for backwards compatibility
>> (doesn't bind by names by default)
>> -> rbind(...) now just calls rbindlist() internally, except that
>> 'use.names' is TRUE by default,
>> for compatibility with base (and backwards compatibility).
>> -> fill by default is FALSE. If fill is TRUE, use.names has to be
>> TRUE.
>> -> At least one item of the input list has to have non-null column
>> names.
>> -> Duplicate columns are bound in the order of occurrence, like
>> base.
>> -> Attributes that might exist in individual items would be lost in
>> the bound result.
>> -> Columns are coerced to the highest SEXPTYPE, if they are
>> different, if/when possible.
>> -> And incredibly fast ;).
>> -> Documentation updated in much detail. Closes DR #5158.
>> Eddi's (excellent) work on finding factor levels, type coercion of
>> columns etc. are all retained.
>>
>> Please try it and write back if things aren’t working as it was before.
>> The
>> tests that had to be fixed are extremely rare cases. I suspect there
>> should
>> be minimal issue, if at all, in this version. However, I do find the
>> changes
>> here bring consistency to the function.
>>
>> One (very rare) feature that is not available due to this implementation
>> is
>> the ability to recycle.
>>
>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
>> lst1 <- list(x=4, y=5, z=as.list(1:3))
>>
>> rbind(dt1, lst1)
>> # x y z
>> # 1: 1 4 1,2
>> # 2: 2 5 1,2,3
>> # 3: 3 6 1,2,3,4
>> # 4: 4 5 1
>> # 5: 4 5 2
>> # 6: 4 5 3
>>
>> The 4,5 are recycled very nicely here.. This is not possible at the
>> moment.
>> This is because the earlier rbind implementation used as.data.table to
>> convert to data.table, however it takes a copy (very inefficient on huge /
>> many tables). I’d love to add this feature in C as well, as it would help
>> incredibly for use within [.data.table (now that we can fill columns and
>> bind by names faster). Will add a FR.
>>
>> In summary, I think there should be minimal issues, if any and should be
>> much faster (for rbind cases). Please write back what you think, if you
>> happen to try out.
>>
>>
>>
>> Arun
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com