[datatable-help] rbindlist and unique

Wed May 21 02:45:16 CEST 2014

Thanks!  That's a good idea, and a lot simpler than what I was concocting
in my head.  I'll give that a try.  I think--just for for posterity--you
mean

DT[, importance := 0 - is.na(V3)]

rather than 0 + is.na(V3), so that rows with V3 are lower than rows without.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/

On Tue, May 20, 2014 at 8:34 PM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:

> On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <npgraham1 at gmail.com>
> wrote:
> > First, I use rbindlist pretty often, and I've been quite happy with it.
>  The
> > new use.names and fill features definitely scratch an itch for me; I
> wound
> > up using rbind_all from dplyr (which worked well, I'm not complaining),
> but
> > I'm looking forward to having a data.table implementation.  The speed
> > increase is also welcome.  So thank you for the new features!  I don't
> > personally have a preference with respect to the use.names and fill
> > defaults, so whatever you guys decide will be fine with me.
> >
> > I do have a question regarding unique, which I use very, very frequently,
> > and often after rbindlist.  I have a fairly large data set (tens of
> millions
> > of raw observations), many of which are duplicates.  The observations
> come
> > from a variety of sources, but the formats and variable names are
> (nearly)
> > identical.
> >
> > The problem is that many "duplicates" aren't perfect duplicates, and some
> > rows have more information than others.  A simple example might look like
> > this:
> >
> >> foo
> >    V1 V2   V3
> > 1:  1  3 TRUE
> > 2:  1  4 TRUE
> > 3:  2  3   NA
> > 4:  2  4 TRUE
> > 5:  1  3 TRUE
> > 6:  1  4   NA
> > 7:  2  3 TRUE
> > 8:  2  4 TRUE
> > 9:  3  1   NA
> >> unique(foo, by = c("V1", "V2"))
> >    V1 V2   V3
> > 1:  1  3 TRUE
> > 2:  1  4 TRUE
> > 3:  2  3   NA
> > 4:  2  4 TRUE
> > 5:  3  1   NA
> >
> >
> > Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story)
> > uniquely identify an observation, but if there's a row where I also have
> V3,
> > I'd prefer to have that row rather than a row where it's missing.  You
> can
> > see that a naive use of unique here gets me the less-preferable 2,3 row.
>  If
> > I only had three columns, this would be easy to solve (sort/setkey first
> > would do it).  However, I have more than a dozen additional columns, and
> > when I drop duplicates I want to retain the row with the greatest number
> of
> > non-missing values.  Additionally, some columns are more important than
> > others.  If (to refer again to the example above), there are no rows that
> > have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so
> I
> > can't just condition on !is.na(V3).
> >
> > Does anybody have any insight or techniques for this sort of thing?  I'm
> > currently sorting on all columns prior to unique, but I'm quite sure that
> > this loses some information.
>
> Append an importance column which ranks the importance of that row
> (lower better) and make importance the low order component of the key.
>
> DT[, importance := 0+is.na(V3)]
> setkey(DT, V1, V2, importance)
> unique(DT, by = c("V1", "V2"))
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/af5ff1bd/attachment-0001.html>