<html><head><style>body{font-family:Helvetica,Arial;font-size:13px}</style></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">In the current CRAN:</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">rbindlist corresponds to use.names=FALSE and fill = FALSE</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">rbind corresponds to use.names=TRUE and fill = FALSE</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Just to be clear, again, are you suggesting that I change *just* rbindlist's defaults to use.names=fill and fill=FALSE or for both?</div> <div id="bloop_sign_1400620509831427072" class="bloop_sign"><div style="font-family:helvetica,arial;font-size:13px">Arun</div></div> <div style="color:black"><br>From: <span style="color:black">Gabor Grothendieck</span> <a href="mailto:ggrothendieck@gmail.com">ggrothendieck@gmail.com</a><br>Reply: <span style="color:black">Gabor Grothendieck</span> <a href="mailto:ggrothendieck@gmail.com">ggrothendieck@gmail.com</a><br>Date: <span style="color:black">May 20, 2014 at 11:14:15 PM</span><br>To: <span style="color:black">Arunkumar Srinivasan</span> <a href="mailto:aragorn168b@gmail.com">aragorn168b@gmail.com</a><br>Cc: <span style="color:black">datatable-help@lists.r-forge.r-project.org</span> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>Subject: <span style="color:black"> Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill arguments <br></span></div><br> <blockquote type="cite" class="clean_bq"><span><div><div></div><div>Yes.  That is what I intended.

<br>

<br>rbindlist on CRAN currently has no fill or use.names arguments.  What

<br>combo of the new fill and use.names does the currrent CRAN rbindlst

<br>correspond to?

<br>

<br>

<br>

<br>On Tue, May 20, 2014 at 5:01 PM, Arunkumar Srinivasan

<br><aragorn168b@gmail.com> wrote:

<br>> I think I understand now what you’re trying to say. Going back to an earlier

<br>> post, you wrote:

<br>>

<br>> Then why not make the default of `use.names` be `fill`. Then you don't get

<br>> the warning and you can tell just from the argument list what the

<br>> dependencies are.

<br>>

<br>> You mean to basically do?

<br>>

<br>> rbindlist <- function(l, use.names=fill, fill=FALSE)

<br>> .rbind.data.table <- function(..., use.names=fill, fill=TRUE/FALSE)

<br>>

<br>> Is this what you mean? If so, the defaults from the previous versions will

<br>> be changed. The ones who use rbind directly without setting use.names will

<br>> have different results.. (assuming I understand you correctly this time).

<br>>

<br>>

<br>> Arun

<br>>

<br>> From: Gabor Grothendieck ggrothendieck@gmail.com

<br>> Reply: Gabor Grothendieck ggrothendieck@gmail.com

<br>> Date: May 20, 2014 at 10:49:54 PM

<br>>

<br>> To: Arunkumar Srinivasan aragorn168b@gmail.com

<br>> Cc: datatable-help@lists.r-forge.r-project.org

<br>> datatable-help@lists.r-forge.r-project.org

<br>> Subject:  Re: [datatable-help] FR #5249 - rbindlist gains use.names and fill

<br>> arguments

<br>>

<br>> If I understand this right then the table below shows the valid

<br>> logical combinations in order of speed (slowest first). Is that

<br>> right? If so then if fill = FALSE and use.names = fill then we get

<br>> the fastest case by default.

<br>>

<br>> Furthermore if you were concerned that we might be T/T when F/T would

<br>> be sufficient I don't think that is likely since getting F/T is done

<br>> by setting use.names = TRUE.

<br>>

<br>> fill/use.names

<br>> T/T (slowest)

<br>> F/T

<br>> F/F (fasetest)

<br>>

<br>>

<br>> On Tue, May 20, 2014 at 4:28 PM, Arunkumar Srinivasan

<br>> <aragorn168b@gmail.com> wrote:

<br>>> I’ve filed FR #5690 to remind myself of the recycling feature; that’d be

<br>>> awesome to have.

<br>>>

<br>>> One feature I forgot to point out in the previous post is that, even when

<br>>> there are duplicate names, rbind/rbindlist binds them consistent with

<br>>> ‘base’

<br>>> when use.names=TRUE. And it fills the duplicate columns properly (in the

<br>>> order of occurrence) also when fill=TRUE.

<br>>>

<br>>> Okay, on to benchmarks. I took a set of 10,000 data.tables, each with

<br>>> columns ranging from V1 to V500 in random order (all integers for

<br>>> simplicity). We’ll need to just use use.names=TRUE (as all columns are

<br>>> available in all data.tables).

<br>>>

<br>>> I think this data is big enough to illustrate the point. Also, I was

<br>>> curious

<br>>> to see a comparison against dplyr’s rbind_all (commit 1504 devel version).

<br>>> So, I’ve added it as well to the benchmarks.

<br>>>

<br>>> Here’s the data generation. Note: It takes a while for this step to

<br>>> finish.

<br>>>

<br>>> require(data.table) ## 1.9.3 commit 1267

<br>>> require(dplyr) ## commit 1504 devel

<br>>> set.seed(1L)

<br>>> foo <- function(k) {

<br>>> ans = setDT(lapply(1:k, function(x) sample(10)))

<br>>> }

<br>>> bar <- function(ans, k, n) {

<br>>> bla = sample(paste0("V", 1:k), n)

<br>>> setnames(ans, bla)

<br>>> }

<br>>> n = 10000L

<br>>> ll = vector("list", n)

<br>>> for (i in 1:n) {

<br>>> bla = bar(foo(500L), 500L, 500L)

<br>>> .Call("Csetlistelt", ll, i, bla)

<br>>> }

<br>>>

<br>>> And here are the timings:

<br>>>

<br>>> ## data.table v1.9.3 commit 1267's rbindlist

<br>>> ## Timings of three consecutive runs:

<br>>> system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))

<br>>> user system elapsed

<br>>> 10.909 0.449 11.843

<br>>>

<br>>> user system elapsed

<br>>> 5.219 0.386 5.640

<br>>>

<br>>> user system elapsed

<br>>> 5.355 0.429 5.898

<br>>>

<br>>> ## dplyr's rbind_all

<br>>> ## Timings for three consecutive runs

<br>>> system.time(ans2 <- rbind_all(ll))

<br>>> user system elapsed

<br>>> 62.769 0.247 63.941

<br>>>

<br>>> user system elapsed

<br>>> 62.010 0.335 65.876

<br>>>

<br>>> user system elapsed

<br>>> 55.345 0.359 60.193

<br>>>

<br>>>> identical(ans1, setDT(ans2)) # [1] TRUE

<br>>>

<br>>> ## data.table v1.9.2's rbind version:

<br>>> ## ran only once as it took a bit more.

<br>>> system.time(ans1 <- do.call("rbind", ll))

<br>>> user system elapsed

<br>>> 125.356 2.247 139.000

<br>>>

<br>>>> identical(ans1, setDT(ans2)) # [1] TRUE

<br>>>

<br>>> In summary, the newer implementation is about ~11–23x faster than

<br>>> data.table’s older implementation and is ~5.5–10x faster against dplyr on

<br>>> this (relatively huge) data.

<br>>>

<br>>> Arun

<br>>>

<br>>> From: Arunkumar Srinivasan aragorn168b@gmail.com

<br>>> Reply: Arunkumar Srinivasan aragorn168b@gmail.com

<br>>> Date: May 20, 2014 at 9:27:56 PM

<br>>> To: datatable-help@lists.r-forge.r-project.org

<br>>> datatable-help@lists.r-forge.r-project.org

<br>>> Subject: FR #5249 - rbindlist gains use.names and fill arguments

<br>>>

<br>>> Hello everyone,

<br>>>

<br>>> With the latest commit #1266, the extra functionality offered via rbind

<br>>> (use.names and fill) is also now available to rbindlist. In addition, the

<br>>> implementation is completely moved to C, and is therefore tremendously

<br>>> fast,

<br>>> especially for cases where one has to bind using with use.names=TRUE

<br>>> and/or

<br>>> with fill=TRUE. I’ll try to put out a benchmark comparing speed

<br>>> differences

<br>>> with the older implementation ASAP.

<br>>>

<br>>> Note that this change comes with a very low cost to the default speed to

<br>>> rbindlist - with use.names=FALSE and fill=FALSE. As an example, binding

<br>>> 10,000 data.tables with 20 columns each, resulted in the new version

<br>>> running

<br>>> in 0.107 seconds, where as the older version ran in 0.095 seconds.

<br>>>

<br>>> In addition the documentation for ?rbindlist also has been improved (#5158

<br>>> from Alexander). Here’s the change log from NEWS:

<br>>>

<br>>> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now

<br>>> implemented entirely in C. Closes #5249

<br>>> -> use.names by default is FALSE for backwards compatibility

<br>>> (doesn't bind by names by default)

<br>>> -> rbind(...) now just calls rbindlist() internally, except that

<br>>> 'use.names' is TRUE by default,

<br>>> for compatibility with base (and backwards compatibility).

<br>>> -> fill by default is FALSE. If fill is TRUE, use.names has to be

<br>>> TRUE.

<br>>> -> At least one item of the input list has to have non-null column

<br>>> names.

<br>>> -> Duplicate columns are bound in the order of occurrence, like

<br>>> base.

<br>>> -> Attributes that might exist in individual items would be lost in

<br>>> the bound result.

<br>>> -> Columns are coerced to the highest SEXPTYPE, if they are

<br>>> different, if/when possible.

<br>>> -> And incredibly fast ;).

<br>>> -> Documentation updated in much detail. Closes DR #5158.

<br>>> Eddi's (excellent) work on finding factor levels, type coercion of

<br>>> columns etc. are all retained.

<br>>>

<br>>> Please try it and write back if things aren’t working as it was before.

<br>>> The

<br>>> tests that had to be fixed are extremely rare cases. I suspect there

<br>>> should

<br>>> be minimal issue, if at all, in this version. However, I do find the

<br>>> changes

<br>>> here bring consistency to the function.

<br>>>

<br>>> One (very rare) feature that is not available due to this implementation

<br>>> is

<br>>> the ability to recycle.

<br>>>

<br>>> dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))

<br>>> lst1 <- list(x=4, y=5, z=as.list(1:3))

<br>>>

<br>>> rbind(dt1, lst1)

<br>>> # x y z

<br>>> # 1: 1 4 1,2

<br>>> # 2: 2 5 1,2,3

<br>>> # 3: 3 6 1,2,3,4

<br>>> # 4: 4 5 1

<br>>> # 5: 4 5 2

<br>>> # 6: 4 5 3

<br>>>

<br>>> The 4,5 are recycled very nicely here.. This is not possible at the

<br>>> moment.

<br>>> This is because the earlier rbind implementation used as.data.table to

<br>>> convert to data.table, however it takes a copy (very inefficient on huge /

<br>>> many tables). I’d love to add this feature in C as well, as it would help

<br>>> incredibly for use within [.data.table (now that we can fill columns and

<br>>> bind by names faster). Will add a FR.

<br>>>

<br>>> In summary, I think there should be minimal issues, if any and should be

<br>>> much faster (for rbind cases). Please write back what you think, if you

<br>>> happen to try out.

<br>>>

<br>>>

<br>>>

<br>>> Arun

<br>>>

<br>>>

<br>>> _______________________________________________

<br>>> datatable-help mailing list

<br>>> datatable-help@lists.r-forge.r-project.org

<br>>>

<br>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

<br>>

<br>>

<br>>

<br>> --

<br>> Statistics & Software Consulting

<br>> GKX Group, GKX Associates Inc.

<br>> tel: 1-877-GKX-GROUP

<br>> email: ggrothendieck at gmail.com

<br>

<br>

<br>

<br>--  

<br>Statistics & Software Consulting

<br>GKX Group, GKX Associates Inc.

<br>tel: 1-877-GKX-GROUP

<br>email: ggrothendieck at gmail.com

<br></div></div></span></blockquote></body></html>