<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><p>I’ve filed <a href="https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5690&group_id=240&atid=978">FR #5690</a> to remind myself of the recycling feature; that’d be awesome to have. </p>
<p>One feature I forgot to point out in the previous post is that, even when there are duplicate names, <code>rbind/rbindlist</code> binds them consistent with ‘base’ when <code>use.names=TRUE</code>. And it fills the duplicate columns properly (in the order of occurrence) also when <code>fill=TRUE</code>.</p>
<p>Okay, on to benchmarks. I took a set of 10,000 data.tables, each with columns ranging from <code>V1</code> to <code>V500</code> in random order (all integers for simplicity). We’ll need to just use <code>use.names=TRUE</code> (as all columns are available in all data.tables).</p>
<p>I think this data is big enough to illustrate the point. Also, I was curious to see a comparison against <code>dplyr</code>’s <code>rbind_all</code> (commit 1504 devel version). So, I’ve added it as well to the benchmarks.</p>
<p>Here’s the data generation. Note: It takes a while for this step to finish.</p>
<pre><code>require(data.table) ## 1.9.3 commit 1267
require(dplyr) ## commit 1504 devel
set.seed(1L)
foo <- function(k) {
ans = setDT(lapply(1:k, function(x) sample(10)))
}
bar <- function(ans, k, n) {
bla = sample(paste0("V", 1:k), n)
setnames(ans, bla)
}
n = 10000L
ll = vector("list", n)
for (i in 1:n) {
bla = bar(foo(500L), 500L, 500L)
.Call("Csetlistelt", ll, i, bla)
}
</code></pre>
<p>And here are the timings:</p>
<pre><code>## data.table v1.9.3 commit 1267's rbindlist
## Timings of three consecutive runs:
system.time(ans1 <- rbindlist(ll, use.names=TRUE, fill=FALSE))
user system elapsed
10.909 0.449 11.843
user system elapsed
5.219 0.386 5.640
user system elapsed
5.355 0.429 5.898
## dplyr's rbind_all
## Timings for three consecutive runs
system.time(ans2 <- rbind_all(ll))
user system elapsed
62.769 0.247 63.941
user system elapsed
62.010 0.335 65.876
user system elapsed
55.345 0.359 60.193
> identical(ans1, setDT(ans2)) # [1] TRUE
## data.table v1.9.2's rbind version:
## ran only once as it took a bit more.
system.time(ans1 <- do.call("rbind", ll))
user system elapsed
125.356 2.247 139.000
> identical(ans1, setDT(ans2)) # [1] TRUE
</code></pre>
<p>In summary, the newer implementation is about ~11–23x faster than <code>data.table</code>’s older implementation and is ~5.5–10x faster against <code>dplyr</code> on this (relatively huge) data.</p>
<p><style>body{font-family:Helvetica,Arial;font-size:13px}</style><style>body {
font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
padding:1em;
margin:auto;
background:#fefefe;
}
h1, h2, h3, h4, h5, h6 {
font-weight: bold;
}
h1 {
color: #000000;
font-size: 28pt;
}
h2 {
border-bottom: 1px solid #CCCCCC;
color: #000000;
font-size: 24px;
}
h3 {
font-size: 18px;
}
h4 {
font-size: 16px;
}
h5 {
font-size: 14px;
}
h6 {
color: #777777;
background-color: inherit;
font-size: 14px;
}
hr {
height: 0.2em;
border: 0;
color: #CCCCCC;
background-color: #CCCCCC;
}
p, blockquote, ul, ol, dl, li, table, pre {
margin: 15px 0;
}
a, a:visited {
color: #4183C4;
background-color: inherit;
text-decoration: none;
}
#message {
border-radius: 6px;
border: 1px solid #ccc;
display:block;
width:100%;
height:60px;
margin:6px 0px;
}
button, #ws {
font-size: 12 pt;
padding: 4px 6px;
border-radius: 5px;
border: 1px solid #bbb;
background-color: #eee;
}
code, pre, #ws, #message {
font-family: Monaco;
font-size: 10pt;
border-radius: 3px;
background-color: #F8F8F8;
color: inherit;
}
code {
border: 1px solid #EAEAEA;
margin: 0 2px;
padding: 0 5px;
}
pre {
border: 1px solid #CCCCCC;
overflow: auto;
padding: 4px 8px;
}
pre > code {
border: 0;
margin: 0;
padding: 0;
}
#ws { background-color: #f8f8f8; }
table {
border-collapse: collapse;
font-family: Helvetica, arial, freesans, clean, sans-serif;
color: rgb(51, 51, 51);
font-size: 15px; line-height: 25px;
padding: 0; }
table tr {
border-top: 1px solid #cccccc;
background-color: white;
margin: 0;
padding: 0; }
table tr:nth-child(2n) {
background-color: #f8f8f8; }
table tr th {
font-weight: bold;
border: 1px solid #cccccc;
margin: 0;
padding: 6px 13px; }
table tr td {
border: 1px solid #cccccc;
margin: 0;
padding: 6px 13px; }
table tr th :first-child, table tr td :first-child {
margin-top: 0; }
table tr th :last-child, table tr td :last-child {
margin-bottom: 0; }
.send { color:#77bb77; }
.server { color:#7799bb; }
.error { color:#AA0000; }</style></p><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><span style="font-family: helvetica, arial; ">Arun</span></div> <div style="color:black"><br>From: <span style="color:black">Arunkumar Srinivasan</span> <a href="mailto:aragorn168b@gmail.com">aragorn168b@gmail.com</a><br>Reply: <span style="color:black">Arunkumar Srinivasan</span> <a href="mailto:aragorn168b@gmail.com">aragorn168b@gmail.com</a><br>Date: <span style="color:black">May 20, 2014 at 9:27:56 PM</span><br>To: <span style="color:black">datatable-help@lists.r-forge.r-project.org</span> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>Subject: <span style="color:black"> FR #5249 - rbindlist gains use.names and fill arguments <br></span></div><br> <blockquote type="cite" class="clean_bq"><span><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div></div><div>
<title></title>
<p>Hello everyone,</p>
<p>With the latest commit #1266, the extra functionality offered
via <code>rbind</code> (<code>use.names</code> and
<code>fill</code>) is also now available to <code>rbindlist</code>.
In addition, the implementation is completely moved to C, and is
therefore tremendously fast, especially for cases where one has to
bind using with <code>use.names=TRUE</code> and/or with
<code>fill=TRUE</code>. I’ll try to put out a benchmark comparing
speed differences with the older implementation ASAP.</p>
<p>Note that this change comes with a <em>very</em> low cost to the
default speed to <code>rbindlist</code> - with
<code>use.names=FALSE</code> and <code>fill=FALSE</code>. As an
example, binding 10,000 data.tables with 20 columns each, resulted
in the new version running in 0.107 seconds, where as the older
version ran in 0.095 seconds.</p>
<p>In addition the documentation for <code>?rbindlist</code> also
has been improved (#5158 from Alexander). Here’s the change log
from NEWS:</p>
<pre><code> o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249
-> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default)
-> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default,
for compatibility with base (and backwards compatibility).
-> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE.
-> At least one item of the input list has to have non-null column names.
-> Duplicate columns are bound in the order of occurrence, like base.
-> Attributes that might exist in individual items would be lost in the bound result.
-> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible.
-> And incredibly fast ;).
-> Documentation updated in much detail. Closes DR #5158.
Eddi's (excellent) work on finding factor levels, type coercion of columns etc. are all retained.
</code>
</pre>
<p>Please try it and write back if things aren’t working as it was
before. The tests that had to be fixed are extremely rare cases. I
suspect there should be minimal issue, if at all, in this version.
However, I do find the changes here bring consistency to the
function.</p>
<p>One (very rare) feature that is not available due to this
implementation is the ability to <em>recycle</em>.</p>
<pre><code>dt1 <- data.table(x=1:3, y=4:6, z=list(1:2, 1:3, 1:4))
lst1 <- list(x=4, y=5, z=as.list(1:3))
rbind(dt1, lst1)
# x y z
# 1: 1 4 1,2
# 2: 2 5 1,2,3
# 3: 3 6 1,2,3,4
# 4: 4 5 1
# 5: 4 5 2
# 6: 4 5 3
</code>
</pre>
<p>The 4,5 are recycled very nicely here.. This is not possible at
the moment. This is because the earlier <code>rbind</code>
implementation used <code>as.data.table</code> to convert to
data.table, however it takes a copy (very inefficient on huge /
many tables). I’d love to add this feature in C as well, as it
would help incredibly for use within <code>[.data.table</code> (now
that we can fill columns and bind by names faster). Will add a
FR.</p>
<p>In summary, I think there should be minimal issues, if any and
should be much faster (for <code>rbind</code> cases). Please write
back what you think, if you happen to try out.</p>
<div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">
<br></div>
<br>
<div id="bloop_sign_1400611553116250112" class="bloop_sign">
<div style="font-family:helvetica,arial;font-size:13px">Arun</div>
</div>
</div></div></span></blockquote><p></p></body></html>