<div dir="ltr">First, I use rbindlist pretty often, and I've been quite happy with it. The new use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation. The speed increase is also welcome. So thank you for the new features! I don't personally have a preference with respect to the use.names and fill defaults, so whatever you guys decide will be fine with me.<div>
<br></div><div>I do have a question regarding unique, which I use very, very frequently, and often after rbindlist. I have a fairly large data set (tens of millions of raw observations), many of which are duplicates. The observations come from a variety of sources, but the formats and variable names are (nearly) identical.</div>
<div><br></div><div>The problem is that many "duplicates" aren't perfect duplicates, and some rows have more information than others. A simple example might look like this:</div><div><br></div><div><div>> foo</div>
<div> V1 V2 V3</div><div>1: 1 3 TRUE</div><div>2: 1 4 TRUE</div><div>3: 2 3 NA</div><div>4: 2 4 TRUE</div><div>5: 1 3 TRUE</div><div>6: 1 4 NA</div><div>7: 2 3 TRUE</div><div>8: 2 4 TRUE</div><div>
9: 3 1 NA</div><div>> unique(foo, by = c("V1", "V2"))</div><div> V1 V2 V3</div><div>1: 1 3 TRUE</div><div>2: 1 4 TRUE</div><div>3: 2 3 NA</div><div>4: 2 4 TRUE</div><div>5: 3 1 NA</div>
<div><br></div><div><br></div><div>Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story) uniquely identify an observation, but if there's a row where I also have V3, I'd prefer to have that row rather than a row where it's missing. You can see that a naive use of unique here gets me the less-preferable 2,3 row. If I only had three columns, this would be easy to solve (sort/setkey first would do it). However, I have more than a dozen additional columns, and when I drop duplicates I want to retain the row with the greatest number of non-missing values. Additionally, some columns are more important than others. If (to refer again to the example above), there are no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I can't just condition on !<a href="http://is.na">is.na</a>(V3).</div>
<div><br></div><div>Does anybody have any insight or techniques for this sort of thing? I'm currently sorting on all columns prior to unique, but I'm quite sure that this loses some information.</div><div><br></div>
<div><br></div><div><div><div dir="ltr">-------<br>Nathaniel Graham<br><a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a><div>
<a href="https://sites.google.com/site/npgraham1/" target="_blank">https://sites.google.com/site/npgraham1/</a><br></div></div></div>
</div></div></div>