[datatable-help] rbindlist and unique
Nathaniel Graham
npgraham1 at gmail.com
Wed May 21 02:45:16 CEST 2014
Thanks! That's a good idea, and a lot simpler than what I was concocting
in my head. I'll give that a try. I think--just for for posterity--you
mean
DT[, importance := 0 - is.na(V3)]
rather than 0 + is.na(V3), so that rows with V3 are lower than rows without.
-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/
On Tue, May 20, 2014 at 8:34 PM, Gabor Grothendieck <ggrothendieck at gmail.com
> wrote:
> On Tue, May 20, 2014 at 8:20 PM, Nathaniel Graham <npgraham1 at gmail.com>
> wrote:
> > First, I use rbindlist pretty often, and I've been quite happy with it.
> The
> > new use.names and fill features definitely scratch an itch for me; I
> wound
> > up using rbind_all from dplyr (which worked well, I'm not complaining),
> but
> > I'm looking forward to having a data.table implementation. The speed
> > increase is also welcome. So thank you for the new features! I don't
> > personally have a preference with respect to the use.names and fill
> > defaults, so whatever you guys decide will be fine with me.
> >
> > I do have a question regarding unique, which I use very, very frequently,
> > and often after rbindlist. I have a fairly large data set (tens of
> millions
> > of raw observations), many of which are duplicates. The observations
> come
> > from a variety of sources, but the formats and variable names are
> (nearly)
> > identical.
> >
> > The problem is that many "duplicates" aren't perfect duplicates, and some
> > rows have more information than others. A simple example might look like
> > this:
> >
> >> foo
> > V1 V2 V3
> > 1: 1 3 TRUE
> > 2: 1 4 TRUE
> > 3: 2 3 NA
> > 4: 2 4 TRUE
> > 5: 1 3 TRUE
> > 6: 1 4 NA
> > 7: 2 3 TRUE
> > 8: 2 4 TRUE
> > 9: 3 1 NA
> >> unique(foo, by = c("V1", "V2"))
> > V1 V2 V3
> > 1: 1 3 TRUE
> > 2: 1 4 TRUE
> > 3: 2 3 NA
> > 4: 2 4 TRUE
> > 5: 3 1 NA
> >
> >
> > Sometimes V3 is present and sometimes it isn't. V1 and V2 (in my story)
> > uniquely identify an observation, but if there's a row where I also have
> V3,
> > I'd prefer to have that row rather than a row where it's missing. You
> can
> > see that a naive use of unique here gets me the less-preferable 2,3 row.
> If
> > I only had three columns, this would be easy to solve (sort/setkey first
> > would do it). However, I have more than a dozen additional columns, and
> > when I drop duplicates I want to retain the row with the greatest number
> of
> > non-missing values. Additionally, some columns are more important than
> > others. If (to refer again to the example above), there are no rows that
> > have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so
> I
> > can't just condition on !is.na(V3).
> >
> > Does anybody have any insight or techniques for this sort of thing? I'm
> > currently sorting on all columns prior to unique, but I'm quite sure that
> > this loses some information.
>
> Append an importance column which ranks the importance of that row
> (lower better) and make importance the low order component of the key.
>
> DT[, importance := 0+is.na(V3)]
> setkey(DT, V1, V2, importance)
> unique(DT, by = c("V1", "V2"))
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140520/af5ff1bd/attachment-0001.html>
More information about the datatable-help
mailing list