[datatable-help] unique, by full row (not just key)

Thu Mar 7 18:07:28 CET 2013

Which means that unique.data.table itself can be improved
internally, in the way I just suggested using shallow() ... 

Most of
the time the key will be small so that copy of the key columns to pass
to duplist won't be huge, but, still a copy. And could slow down key
only tables most, relatively. 

On 07.03.2013 17:03, Matthew Dowle
wrote: 

> Hi, 
> 
> Are the duplicates next to each other in the table?
Or could duplicates be within each key, separated by other rows? 
> 
>
If duplicates are together, calling data.table:::duplist directly should
do it. (see source of data.table:::unique.data.table). It loops through
the rows by column and works like diff(x)==0 would i.e. looking at the
previous row only, but does compare all columns. If a subset of columns
are needed, then maybe a data.table:::shallow followed by column removal
of the ones you don't need on that shallow copy (the shallow copy and
column removal being instant). Just because duplist doesn't accept a
subset of the list of columns it is passed. 
> 
> shallow() is on the
agenda to be exported for user use (so suggesting it is an excuse to get
you to test it!). Hadn't thought about duplist but could do, too. They
are both relied on internally, so should be reliable. But as soon as
they're exported we can't make non-backwards compatible changes to them.

> 
> Matthew 
> 
> On 07.03.2013 16:45, Ricardo Saporta wrote: 
> 
>> I
have a keyed data.table, DT, with 800k rows, of which about 0.5% are
duplicates that need to removed. 
>> Using unique(DT) of course widdles
down the whole table to one row per key. 
>> I would like to get results
similar to unique.data.frame(DT) 
>> Two problems with using
unique.data.frame: (1) Speed (2) loss of key(DT) 
>> So instead Im using
a wrapper that 
>> (1) caches key(DT) (2) removes the key (3) calls
unique on DT (4) then repplies the key 
>> However, this is convoluted
(and also requires modifying setkey(.) and getdots(.)). 
>> It occurs to
me that I might be overlooking a simpler alternative. 
>> anythoughts?

>> Thanks, 
>> Rick 
>> _Here is what I am using_: 
>> uniqueRows 
>> #
If already keyed (or not a DT), use regular unique(DT) 
>> if
(!haskey(DT) || !is.data.table(x) ) 
>> return(unique(DT)) 
>> .key 
>>
setkey(DT, NULL) 
>> setkeyE(unique(DT), eval(.key)) 
>> } 
>>
getdotsWithEval 
>> dots 
>> as.character(match.call(sys.function(-1),
call = sys.call(-1), 
>> expand.dots = FALSE)$...) 
>> if
(grepl("^eval\(", dots) && grepl("\)$", dots)) 
>>
return(eval(parse(text=dots))) 
>> return(dots) 
>> } 
>> setkeyE 
>> #
SAME AS setkey(.) WITH ADDITION THAT 
>> # IF KEY IS WRAPPED IN eval(.)
IT WILL BE PARSED 
>> if (is.character(x)) 
>> stop("x may no longer be
the character name of the data.table. The possibility was undocumented
and has been removed.") 
>> #** THIS IS THE MODIFIED LINE **# 
>> #
OLD**: cols = getdots() 
>> cols 
>> if (!length(cols)) 
>> cols =
colnames(x) 
>> else if (identical(cols, "NULL")) 
>> cols = NULL 
>>
setkeyv(x, cols, verbose = verbose) 
>> } -- 
>> 
>> Ricardo Saporta 
>>
Graduate Student, Data Analytics 
>> Rutgers University, New Jersey 
>>
e: saporta at rutgers.edu [1]

Links:
------
[1]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130307/81cf5984/attachment-0001.html>