<div dir="ltr">Thanks!  That's a good idea, and a lot simpler than what I was concocting in my head.  I'll give that a try.  I think--just for for posterity--you mean<div><br></div><div>DT[, importance := 0 - <a href="http://is.na">is.na</a>(V3)]</div>

<div><br></div><div>rather than 0 + <a href="http://is.na">is.na</a>(V3), so that rows with V3 are lower than rows without.</div></div><div class="gmail_extra"><br clear="all"><div><div dir="ltr">-------<br>Nathaniel Graham<br>

<a href="mailto:npgraham1@gmail.com" target="_blank">npgraham1@gmail.com</a><br><a href="mailto:npgraham1@uky.edu" target="_blank">npgraham1@uky.edu</a><div><a href="https://sites.google.com/site/npgraham1/" target="_blank">https://sites.google.com/site/npgraham1/</a><br>

</div></div></div>

<br><br><div class="gmail_quote">On Tue, May 20, 2014 at 8:34 PM, Gabor Grothendieck <span dir="ltr"><<a href="mailto:ggrothendieck@gmail.com" target="_blank">ggrothendieck@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5">On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <<a href="mailto:npgraham1@gmail.com">npgraham1@gmail.com</a>> wrote:<br>

> First, I use rbindlist pretty often, and I've been quite happy with it.  The<br>

> new use.names and fill features definitely scratch an itch for me; I wound<br>

> up using rbind_all from dplyr (which worked well, I'm not complaining), but<br>

> I'm looking forward to having a data.table implementation.  The speed<br>

> increase is also welcome.  So thank you for the new features!  I don't<br>

> personally have a preference with respect to the use.names and fill<br>

> defaults, so whatever you guys decide will be fine with me.<br>

><br>

> I do have a question regarding unique, which I use very, very frequently,<br>

> and often after rbindlist.  I have a fairly large data set (tens of millions<br>

> of raw observations), many of which are duplicates.  The observations come<br>

> from a variety of sources, but the formats and variable names are (nearly)<br>

> identical.<br>

><br>

> The problem is that many "duplicates" aren't perfect duplicates, and some<br>

> rows have more information than others.  A simple example might look like<br>

> this:<br>

><br>

>> foo<br>

>    V1 V2   V3<br>

> 1:  1  3 TRUE<br>

> 2:  1  4 TRUE<br>

> 3:  2  3   NA<br>

> 4:  2  4 TRUE<br>

> 5:  1  3 TRUE<br>

> 6:  1  4   NA<br>

> 7:  2  3 TRUE<br>

> 8:  2  4 TRUE<br>

> 9:  3  1   NA<br>

>> unique(foo, by = c("V1", "V2"))<br>

>    V1 V2   V3<br>

> 1:  1  3 TRUE<br>

> 2:  1  4 TRUE<br>

> 3:  2  3   NA<br>

> 4:  2  4 TRUE<br>

> 5:  3  1   NA<br>

><br>

><br>

> Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)<br>

> uniquely identify an observation, but if there's a row where I also have V3,<br>

> I'd prefer to have that row rather than a row where it's missing.  You can<br>

> see that a naive use of unique here gets me the less-preferable 2,3 row.  If<br>

> I only had three columns, this would be easy to solve (sort/setkey first<br>

> would do it).  However, I have more than a dozen additional columns, and<br>

> when I drop duplicates I want to retain the row with the greatest number of<br>

> non-missing values.  Additionally, some columns are more important than<br>

> others.  If (to refer again to the example above), there are no rows that<br>

> have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I<br>

> can't just condition on !<a href="http://is.na" target="_blank">is.na</a>(V3).<br>

><br>

> Does anybody have any insight or techniques for this sort of thing?  I'm<br>

> currently sorting on all columns prior to unique, but I'm quite sure that<br>

> this loses some information.<br>

<br>

</div></div>Append an importance column which ranks the importance of that row<br>

(lower better) and make importance the low order component of the key.<br>

<br>

DT[, importance := 0+<a href="http://is.na" target="_blank">is.na</a>(V3)]<br>

setkey(DT, V1, V2, importance)<br>

unique(DT, by = c("V1", "V2"))<br>

<span class="HOEnZb"><font color="#888888"><br>

<br>

<br>

--<br>

Statistics & Software Consulting<br>

GKX Group, GKX Associates Inc.<br>

tel: 1-877-GKX-GROUP<br>

email: ggrothendieck at <a href="http://gmail.com" target="_blank">gmail.com</a><br>

</font></span></blockquote></div><br></div>