[datatable-help] rbindlist and unique

Wed May 21 02:34:10 CEST 2014

On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <npgraham1 at gmail.com> wrote:
> First, I use rbindlist pretty often, and I've been quite happy with it.  The
> new use.names and fill features definitely scratch an itch for me; I wound
> up using rbind_all from dplyr (which worked well, I'm not complaining), but
> I'm looking forward to having a data.table implementation.  The speed
> increase is also welcome.  So thank you for the new features!  I don't
> personally have a preference with respect to the use.names and fill
> defaults, so whatever you guys decide will be fine with me.
>
> I do have a question regarding unique, which I use very, very frequently,
> and often after rbindlist.  I have a fairly large data set (tens of millions
> of raw observations), many of which are duplicates.  The observations come
> from a variety of sources, but the formats and variable names are (nearly)
> identical.
>
> The problem is that many "duplicates" aren't perfect duplicates, and some
> rows have more information than others.  A simple example might look like
> this:
>
>> foo
>    V1 V2   V3
> 1:  1  3 TRUE
> 2:  1  4 TRUE
> 3:  2  3   NA
> 4:  2  4 TRUE
> 5:  1  3 TRUE
> 6:  1  4   NA
> 7:  2  3 TRUE
> 8:  2  4 TRUE
> 9:  3  1   NA
>> unique(foo, by = c("V1", "V2"))
>    V1 V2   V3
> 1:  1  3 TRUE
> 2:  1  4 TRUE
> 3:  2  3   NA
> 4:  2  4 TRUE
> 5:  3  1   NA
>
>
> Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)
> uniquely identify an observation, but if there's a row where I also have V3,
> I'd prefer to have that row rather than a row where it's missing.  You can
> see that a naive use of unique here gets me the less-preferable 2,3 row.  If
> I only had three columns, this would be easy to solve (sort/setkey first
> would do it).  However, I have more than a dozen additional columns, and
> when I drop duplicates I want to retain the row with the greatest number of
> non-missing values.  Additionally, some columns are more important than
> others.  If (to refer again to the example above), there are no rows that
> have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I
> can't just condition on !is.na(V3).
>
> Does anybody have any insight or techniques for this sort of thing?  I'm
> currently sorting on all columns prior to unique, but I'm quite sure that
> this loses some information.

Append an importance column which ranks the importance of that row
(lower better) and make importance the low order component of the key.

DT[, importance := 0+is.na(V3)]
setkey(DT, V1, V2, importance)
unique(DT, by = c("V1", "V2"))

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com