[datatable-help] rbindlist and unique

Wed May 21 13:00:56 CEST 2014

Nathaniel, Thanks.

First, I use rbindlist pretty often, and I've been quite happy with it.  The new  use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation.  
A data.table implementation (in rbind) exists since the last release (v1.9.0/2). This one just builds on it.

Arun

From: Nathaniel Graham npgraham1 at gmail.com
Reply: Nathaniel Graham npgraham1 at gmail.com
Date: May 21, 2014 at 2:20:44 AM
To: data.table source forge datatable-help at lists.r-forge.r-project.org
Subject:  [datatable-help] rbindlist and unique  

First, I use rbindlist pretty often, and I've been quite happy with it.  The new use.names and fill features definitely scratch an itch for me; I wound up using rbind_all from dplyr (which worked well, I'm not complaining), but I'm looking forward to having a data.table implementation.  The speed increase is also welcome.  So thank you for the new features!  I don't personally have a preference with respect to the use.names and fill defaults, so whatever you guys decide will be fine with me.

I do have a question regarding unique, which I use very, very frequently, and often after rbindlist.  I have a fairly large data set (tens of millions of raw observations), many of which are duplicates.  The observations come from a variety of sources, but the formats and variable names are (nearly) identical.

The problem is that many "duplicates" aren't perfect duplicates, and some rows have more information than others.  A simple example might look like this:

> foo
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  1  3 TRUE
6:  1  4   NA
7:  2  3 TRUE
8:  2  4 TRUE
9:  3  1   NA
> unique(foo, by = c("V1", "V2"))
   V1 V2   V3
1:  1  3 TRUE
2:  1  4 TRUE
3:  2  3   NA
4:  2  4 TRUE
5:  3  1   NA

Sometimes V3 is present and sometimes it isn't.  V1 and V2 (in my story) uniquely identify an observation, but if there's a row where I also have V3, I'd prefer to have that row rather than a row where it's missing.  You can see that a naive use of unique here gets me the less-preferable 2,3 row.  If I only had three columns, this would be easy to solve (sort/setkey first would do it).  However, I have more than a dozen additional columns, and when I drop duplicates I want to retain the row with the greatest number of non-missing values.  Additionally, some columns are more important than others.  If (to refer again to the example above), there are no rows that have V3 for a given V1 & V2 (like 3,1), I still need to retain a row, so I can't just condition on !is.na(V3).

Does anybody have any insight or techniques for this sort of thing?  I'm currently sorting on all columns prior to unique, but I'm quite sure that this loses some information.

-------
Nathaniel Graham
npgraham1 at gmail.com
npgraham1 at uky.edu
https://sites.google.com/site/npgraham1/
_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140521/b06e5b53/attachment.html>