[datatable-help] rbindlist and unique
Gabor Grothendieck
ggrothendieck at gmail.com
Wed May 21 02:34:10 CEST 2014
On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <npgraham1 at gmail.com> wrote:
> First, I use rbindlist pretty often, and I've been quite happy with it. The
> new use.names and fill features definitely scratch an itch for me; I wound
> up using rbind_all from dplyr (which worked well, I'm not complaining), but
> I'm looking forward to having a data.table implementation. The speed
> increase is also welcome. So thank you for the new features! I don't
> personally have a preference with respect to the use.names and fill
> defaults, so whatever you guys decide will be fine with me.
>
> I do have a question regarding unique, which I use very, very frequently,
> and often after rbindlist. I have a fairly large data set (tens of millions
> of raw observations), many of which are duplicates. The observations come
> from a variety of sources, but the formats and variable names are (nearly)
> identical.
>
> The problem is that many "duplicates" aren't perfect duplicates, and some
> rows have more information than others. A simple example might look like
> this:
>
>> foo
> V1 V2 V3
> 1: 1 3 TRUE
> 2: 1 4 TRUE
> 3: 2 3 NA
> 4: 2 4 TRUE
> 5: 1 3 TRUE
> 6: 1 4 NA
> 7: 2 3 TRUE
> 8: 2 4 TRUE
> 9: 3 1 NA
>> unique(foo, by = c("V1", "V2"))
> V1 V2 V3
> 1: 1 3 TRUE
> 2: 1 4 TRUE
> 3: 2 3 NA
> 4: 2 4 TRUE
> 5: 3 1 NA
>
>
> Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story)
> uniquely identify an observation, but if there's a row where I also have V3,
> I'd prefer to have that row rather than a row where it's missing. You can
> see that a naive use of unique here gets me the less-preferable 2,3 row. If
> I only had three columns, this would be easy to solve (sort/setkey first
> would do it). However, I have more than a dozen additional columns, and
> when I drop duplicates I want to retain the row with the greatest number of
> non-missing values. Additionally, some columns are more important than
> others. If (to refer again to the example above), there are no rows that
> have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I
> can't just condition on !is.na(V3).
>
> Does anybody have any insight or techniques for this sort of thing? I'm
> currently sorting on all columns prior to unique, but I'm quite sure that
> this loses some information.
Append an importance column which ranks the importance of that row
(lower better) and make importance the low order component of the key.
DT[, importance := 0+is.na(V3)]
setkey(DT, V1, V2, importance)
unique(DT, by = c("V1", "V2"))
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the datatable-help
mailing list