[datatable-help] unique, by full row (not just key)

Matthew Dowle mdowle at mdowle.plus.com
Thu Mar 7 18:03:00 CET 2013


 

Hi, 

Are the duplicates next to each other in the table? Or could
duplicates be within each key, separated by other rows? 

If duplicates
are together, calling data.table:::duplist directly should do it. (see
source of data.table:::unique.data.table). It loops through the rows by
column and works like diff(x)==0 would i.e. looking at the previous row
only, but does compare all columns. If a subset of columns are needed,
then maybe a data.table:::shallow followed by column removal of the ones
you don't need on that shallow copy (the shallow copy and column removal
being instant). Just because duplist doesn't accept a subset of the list
of columns it is passed. 

shallow() is on the agenda to be exported for
user use (so suggesting it is an excuse to get you to test it!). Hadn't
thought about duplist but could do, too. They are both relied on
internally, so should be reliable. But as soon as they're exported we
can't make non-backwards compatible changes to them. 

Matthew 

On
07.03.2013 16:45, Ricardo Saporta wrote: 

> I have a keyed data.table,
DT, with 800k rows, of which about 0.5% are duplicates that need to
removed. 
> Using unique(DT) of course widdles down the whole table to
one row per key. 
> I would like to get results similar to
unique.data.frame(DT) 
> Two problems with using unique.data.frame: (1)
Speed (2) loss of key(DT) 
> So instead Im using a wrapper that 
> (1)
caches key(DT) (2) removes the key (3) calls unique on DT (4) then
repplies the key 
> However, this is convoluted (and also requires
modifying setkey(.) and getdots(.)). 
> It occurs to me that I might be
overlooking a simpler alternative. 
> anythoughts? 
> Thanks, 
> Rick 
>
_Here is what I am using_: 
> uniqueRows 
> # If already keyed (or not a
DT), use regular unique(DT) 
> if (!haskey(DT) || !is.data.table(x) ) 
>
return(unique(DT)) 
> .key 
> setkey(DT, NULL) 
> setkeyE(unique(DT),
eval(.key)) 
> } 
> getdotsWithEval 
> dots 
>
as.character(match.call(sys.function(-1), call = sys.call(-1), 
>
expand.dots = FALSE)$...) 
> if (grepl("^eval\(", dots) && grepl("\)$",
dots)) 
> return(eval(parse(text=dots))) 
> return(dots) 
> } 
> setkeyE

> # SAME AS setkey(.) WITH ADDITION THAT 
> # IF KEY IS WRAPPED IN
eval(.) IT WILL BE PARSED 
> if (is.character(x)) 
> stop("x may no
longer be the character name of the data.table. The possibility was
undocumented and has been removed.") 
> #** THIS IS THE MODIFIED LINE
**# 
> # OLD**: cols = getdots() 
> cols 
> if (!length(cols)) 
> cols =
colnames(x) 
> else if (identical(cols, "NULL")) 
> cols = NULL 
>
setkeyv(x, cols, verbose = verbose) 
> } -- 
> 
> Ricardo Saporta 
>
Graduate Student, Data Analytics 
> Rutgers University, New Jersey 
> e:
saporta at rutgers.edu [1]

 

Links:
------
[1]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/f1438d87/attachment.html>


More information about the datatable-help mailing list