[datatable-help] rbindlist and unique

Wed May 21 02:20:34 CEST 2014

First, I use rbindlist pretty often, and I've been quite happy with it.
 The new use.names and fill features definitely scratch an itch for me; I
wound up using rbind_all from dplyr (which worked well, I'm not
complaining), but I'm looking forward to having a data.table
implementation.  The speed increase is also welcome.  So thank you for the
new features!  I don't personally have a preference with respect to the
use.names and fill defaults, so whatever you guys decide will be fine with
me.

I do have a question regarding unique, which I use very, very frequently,
and often after rbindlist.  I have a fairly large data set (tens of
millions of raw observations), many of which are duplicates.  The
observations come from a variety of sources, but the formats and variable
names are (nearly) identical.

The problem is that many "duplicates" aren't perfect duplicates, and some
rows have more information than others.  A simple example might look like
this:

> foo
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  1  3 TRUE
6:  1  4   NA
7:  2  3 TRUE
8:  2  4 TRUE
9:  3  1   NA
> unique(foo, by = c("V1", "V2"))
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  3  1   NA

Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)
uniquely identify an observation, but if there's a row where I also have
V3, I'd prefer to have that row rather than a row where it's missing.  You
can see that a naive use of unique here gets me the less-preferable 2,3
row.  If I only had three columns, this would be easy to solve (sort/setkey
first would do it).  However, I have more than a dozen additional columns,
and when I drop duplicates I want to retain the row with the greatest
number of non-missing values.  Additionally, some columns are more
important than others.  If (to refer again to the example above), there are
no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain
a row, so I can't just condition on !is.na(V3).

Does anybody have any insight or techniques for this sort of thing?  I'm
currently sorting on all columns prior to unique, but I'm quite sure that
this loses some information.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/3798fba2/attachment.html>