[datatable-help] FR #5249 - rbindlist gains use.names and fill arguments
Arunkumar Srinivasan
aragorn168b at gmail.com
Tue May 20 23:01:52 CEST 2014
I think I understand now what you’re trying to say. Going back to an earlier post, you wrote:
Then why not make the default of `use.names` be `fill`. Then you don't get the warning and you can tell just from the argument list what the dependencies are.
You mean to basically do?
rbindlist <- function(l, use.names=fill, fill=FALSE)
.rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)
Is this what you mean? If so, the defaults from the previous versions will be changed. The ones who use rbind directly without setting use.names will have different results.. (assuming I understand you correctly this time).
Arun
From: Gabor Grothendieck ggrothendieck at gmail.com
Reply: Gabor Grothendieck ggrothendieck at gmail.com
Date: May 20, 2014 at 10:49:54 PM
To: Arunkumar Srinivasan aragorn168b at gmail.com
Cc: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments
If I understand this right then the table below shows the valid
logical combinations in order of speed (slowest first). Is that
right? If so then if fill = FALSE and use.names = fill then we get
the fastest case by default.
Furthermore if you were concerned that we might be T/T when F/T would
be sufficient I don't think that is likely since getting F/T is done
by setting use.names = TRUE.
fill/use.names
T/T (slowest)
F/T
F/F (fasetest)
On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> I’ve filed FR #5690 to remind myself of the recycling feature; that’d be
> awesome to have.
>
> One feature I forgot to point out in the previous post is that, even when
> there are duplicate names, rbind/rbindlist binds them consistent with ‘base’
> when use.names=TRUE. And it fills the duplicate columns properly (in the
> order of occurrence) also when fill=TRUE.
>
> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with
> columns ranging from V1 to V500 in random order (all integers for
> simplicity). We’ll need to just use use.names=TRUE (as all columns are
> available in all data.tables).
>
> I think this data is big enough to illustrate the point. Also, I was curious
> to see a comparison against dplyr’s rbind_all (commit 1504 devel version).
> So, I’ve added it as well to the benchmarks.
>
> Here’s the data generation. Note: It takes a while for this step to finish.
>
> require(data.table) ## 1.9.3 commit 1267
> require(dplyr) ## commit 1504 devel
> set.seed(1L)
> foo <- function(k) {
> ans = setDT(lapply(1:k, function(x) sample(10)))
> }
> bar <- function(ans, k, n) {
> bla = sample(paste0("V", 1:k), n)
> setnames(ans, bla)
> }
> n = 10000L
> ll = vector("list", n)
> for (i in 1:n) {
> bla = bar(foo(500L), 500L, 500L)
> .Call("Csetlistelt", ll, i, bla)
> }
>
> And here are the timings:
>
> ## data.table v1.9.3 commit 1267's rbindlist
> ## Timings of three consecutive runs:
> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
> user system elapsed
> 10.909 0.449 11.843
>
> user system elapsed
> 5.219 0.386 5.640
>
> user system elapsed
> 5.355 0.429 5.898
>
> ## dplyr's rbind_all
> ## Timings for three consecutive runs
> system.time(ans2 <- rbind_all(ll))
> user system elapsed
> 62.769 0.247 63.941
>
> user system elapsed
> 62.010 0.335 65.876
>
> user system elapsed
> 55.345 0.359 60.193
>
>> identical(ans1, setDT(ans2)) # [1] TRUE
>
> ## data.table v1.9.2's rbind version:
> ## ran only once as it took a bit more.
> system.time(ans1 <- do.call("rbind", ll))
> user system elapsed
> 125.356 2.247 139.000
>
>> identical(ans1, setDT(ans2)) # [1] TRUE
>
> In summary, the newer implementation is about ~11–23x faster than
> data.table’s older implementation and is ~5.5–10x faster against dplyr on
> this (relatively huge) data.
>
> Arun
>
> From: Arunkumar Srinivasan aragorn168b at gmail.com
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
> Date: May 20, 2014 at 9:27:56 PM
> To: datatable-help at lists.r-forge.r-project.org
> datatable-help at lists.r-forge.r-project.org
> Subject: FR #5249 - rbindlist gains use.names and fill arguments
>
> Hello everyone,
>
> With the latest commit #1266, the extra functionality offered via rbind
> (use.names and fill) is also now available to rbindlist. In addition, the
> implementation is completely moved to C, and is therefore tremendously fast,
> especially for cases where one has to bind using with use.names=TRUE and/or
> with fill=TRUE. I’ll try to put out a benchmark comparing speed differences
> with the older implementation ASAP.
>
> Note that this change comes with a very low cost to the default speed to
> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding
> 10,000 data.tables with 20 columns each, resulted in the new version running
> in 0.107 seconds, where as the older version ran in 0.095 seconds.
>
> In addition the documentation for ?rbindlist also has been improved (#5158
> from Alexander). Here’s the change log from NEWS:
>
> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now
> implemented entirely in C. Closes #5249
> -> use.names by default is FALSE for backwards compatibility
> (doesn't bind by names by default)
> -> rbind(...) now just calls rbindlist() internally, except that
> 'use.names' is TRUE by default,
> for compatibility with base (and backwards compatibility).
> -> fill by default is FALSE. If fill is TRUE, use.names has to be
> TRUE.
> -> At least one item of the input list has to have non-null column
> names.
> -> Duplicate columns are bound in the order of occurrence, like
> base.
> -> Attributes that might exist in individual items would be lost in
> the bound result.
> -> Columns are coerced to the highest SEXPTYPE, if they are
> different, if/when possible.
> -> And incredibly fast ;).
> -> Documentation updated in much detail. Closes DR #5158.
> Eddi's (excellent) work on finding factor levels, type coercion of
> columns etc. are all retained.
>
> Please try it and write back if things aren’t working as it was before. The
> tests that had to be fixed are extremely rare cases. I suspect there should
> be minimal issue, if at all, in this version. However, I do find the changes
> here bring consistency to the function.
>
> One (very rare) feature that is not available due to this implementation is
> the ability to recycle.
>
> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
> lst1 <- list(x=4, y=5, z=as.list(1:3))
>
> rbind(dt1, lst1)
> # x y z
> # 1: 1 4 1,2
> # 2: 2 5 1,2,3
> # 3: 3 6 1,2,3,4
> # 4: 4 5 1
> # 5: 4 5 2
> # 6: 4 5 3
>
> The 4,5 are recycled very nicely here.. This is not possible at the moment.
> This is because the earlier rbind implementation used as.data.table to
> convert to data.table, however it takes a copy (very inefficient on huge /
> many tables). I’d love to add this feature in C as well, as it would help
> incredibly for use within [.data.table (now that we can fill columns and
> bind by names faster). Will add a FR.
>
> In summary, I think there should be minimal issues, if any and should be
> much faster (for rbind cases). Please write back what you think, if you
> happen to try out.
>
>
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/e2c7e2cd/attachment-0001.html>
More information about the datatable-help
mailing list