<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><p>Nathaniel, Thanks.</p>

<pre><code>First, I use rbindlist pretty often, and I've been quite happy with it.  The new  use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation.  

</code></pre>

<p>A <code>data.table</code> implementation (in <code>rbind</code>) exists since the last release (v1.9.0/2). This one just builds on it.</p>

<p><style>body{font-family:Helvetica,Arial;font-size:13px}</style><style>body {

        font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;

        padding:1em;

        margin:auto;

        background:#fefefe;

}

h1, h2, h3, h4, h5, h6 {

        font-weight: bold;

}

h1 {

        color: #000000;

        font-size: 28pt;

}

h2 {

        border-bottom: 1px solid #CCCCCC;

        color: #000000;

        font-size: 24px;

}

h3 {

        font-size: 18px;

}

h4 {

        font-size: 16px;

}

h5 {

        font-size: 14px;

}

h6 {

        color: #777777;

        background-color: inherit;

        font-size: 14px;

}

hr {

        height: 0.2em;

        border: 0;

        color: #CCCCCC;

        background-color: #CCCCCC;

}

p, blockquote, ul, ol, dl, li, table, pre {

        margin: 15px 0;

}

a, a:visited {

        color: #4183C4;

        background-color: inherit;

        text-decoration: none;

}

#message {

        border-radius: 6px;

        border: 1px solid #ccc;

        display:block;

        width:100%;

        height:60px;

        margin:6px 0px;

}

button, #ws {

        font-size: 12 pt;

        padding: 4px 6px;

        border-radius: 5px;

        border: 1px solid #bbb;

        background-color: #eee;

}

code, pre, #ws, #message {

        font-family: Monaco;

        font-size: 10pt;

        border-radius: 3px;

        background-color: #F8F8F8;

        color: inherit;

}

code {

        border: 1px solid #EAEAEA;

        margin: 0 2px;

        padding: 0 5px;

}

pre {

        border: 1px solid #CCCCCC;

        overflow: auto;

        padding: 4px 8px;

}

pre > code {

        border: 0;

        margin: 0;

        padding: 0;

}

#ws { background-color: #f8f8f8; }

table {

border-collapse: collapse;  

font-family: Helvetica, arial, freesans, clean, sans-serif;  

color: rgb(51, 51, 51);  

font-size: 15px; line-height: 25px;

padding: 0; }

table tr {

border-top: 1px solid #cccccc;

background-color: white;

margin: 0;

padding: 0; }

table tr:nth-child(2n) {

background-color: #f8f8f8; }

table tr th {

font-weight: bold;

border: 1px solid #cccccc;

margin: 0;

padding: 6px 13px; }

table tr td {

border: 1px solid #cccccc;

margin: 0;

padding: 6px 13px; }

table tr th :first-child, table tr td :first-child {

margin-top: 0; }

table tr th :last-child, table tr td :last-child {

margin-bottom: 0; }

.send { color:#77bb77; }

.server { color:#7799bb; }

.error { color:#AA0000; }</style></p><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div> <div id="bloop_sign_1400669930351984128" class="bloop_sign"><div style="font-family:helvetica,arial;font-size:13px">Arun</div></div> <div style="color:black"><br>From: <span style="color:black">Nathaniel Graham</span> <a href="mailto:npgraham1@gmail.com">npgraham1@gmail.com</a><br>Reply: <span style="color:black">Nathaniel Graham</span> <a href="mailto:npgraham1@gmail.com">npgraham1@gmail.com</a><br>Date: <span style="color:black">May 21, 2014 at 2:20:44 AM</span><br>To: <span style="color:black">data.table source forge</span> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>Subject: <span style="color:black"> [datatable-help] rbindlist and unique <br></span></div><br> <blockquote type="cite" class="clean_bq"><span><div><div></div><div>

<title></title>

<div dir="ltr">First, I use rbindlist pretty often, and I've been

quite happy with it.  The new use.names and fill features

definitely scratch an itch for me; I wound up using rbind_all from

dplyr (which worked well, I'm not complaining), but I'm looking

forward to having a data.table implementation.  The speed

increase is also welcome.  So thank you for the new features!

 I don't personally have a preference with respect to the

use.names and fill defaults, so whatever you guys decide will be

fine with me.

<div><br></div>

<div>I do have a question regarding unique, which I use very, very

frequently, and often after rbindlist.  I have a fairly large

data set (tens of millions of raw observations), many of which are

duplicates.  The observations come from a variety of sources,

but the formats and variable names are (nearly) identical.</div>

<div><br></div>

<div>The problem is that many "duplicates" aren't perfect

duplicates, and some rows have more information than others.

 A simple example might look like this:</div>

<div><br></div>

<div>

<div>> foo</div>

<div>   V1 V2   V3</div>

<div>1:  1  3 TRUE</div>

<div>2:  1  4 TRUE</div>

<div>3:  2  3   NA</div>

<div>4:  2  4 TRUE</div>

<div>5:  1  3 TRUE</div>

<div>6:  1  4   NA</div>

<div>7:  2  3 TRUE</div>

<div>8:  2  4 TRUE</div>

<div>9:  3  1   NA</div>

<div>> unique(foo, by = c("V1", "V2"))</div>

<div>   V1 V2   V3</div>

<div>1:  1  3 TRUE</div>

<div>2:  1  4 TRUE</div>

<div>3:  2  3   NA</div>

<div>4:  2  4 TRUE</div>

<div>5:  3  1   NA</div>

<div><br></div>

<div><br></div>

<div>Sometimes V3 is present and sometimes it isn't.  V1 and

V2 (in my story) uniquely identify an observation, but if there's a

row where I also have V3, I'd prefer to have that row rather than a

row where it's missing.  You can see that a naive use of

unique here gets me the less-preferable 2,3 row.  If I only

had three columns, this would be easy to solve (sort/setkey first

would do it).  However, I have more than a dozen additional

columns, and when I drop duplicates I want to retain the row with

the greatest number of non-missing values.  Additionally, some

columns are more important than others.  If (to refer again to

the example above), there are no rows that have V3 for a given V1

& V2 (like 3,1), I still need to retain a row, so I can't just

condition on !<a href="http://is.na">is.na</a>(V3).</div>

<div><br></div>

<div>Does anybody have any insight or techniques for this sort of

thing?  I'm currently sorting on all columns prior to unique,

but I'm quite sure that this loses some information.</div>

<div><br></div>

<div><br></div>

<div>

<div>

<div dir="ltr">-------<br>

Nathaniel Graham<br>

<a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br>

<a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a>

<div><a href="https://sites.google.com/site/npgraham1/" target="_blank">https://sites.google.com/site/npgraham1/</a><br></div>

</div>

</div>

</div>

</div>

</div>

_______________________________________________

<br>datatable-help mailing list

<br>datatable-help@lists.r-forge.r-project.org

<br>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</div></div></span></blockquote><p></p></body></html>