[datatable-help] unique.data.frame should create a copy, right?

Steve Lianoglou mailinglist.honeypot at gmail.com
Thu Aug 15 02:30:19 CEST 2013


Hi all,

As I needed this sooner than I had expected, I just committed this
change. It's in svn revision 889.

I chose 'by.columns' as the parameter names -- seemed to make more
sense to me, and using the short hand interactively saves a letter,
eg: unique(dt, by=c('some', 'columns')) ;-)

Here's the note from the NEWS file:

o  "Uniqueness" tests can now specify arbirtray combinations of
columns to use to test for duplicates. `by.columns` parameter added to
unique.data.table and duplicated.data.table. This allows the user to
test for uniqueness using any combination of columns in the
data.table, where previously the user only had the option to use the
keyed columns (if keyed) or all columns (if not). The default behavior
sets `by.columns=key(dt)` to maintain backward compatability. See
man/duplicated.Rd and tests 986:991 for more information. Thanks to
Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful
discussions.

Should work as advertised assuming my unit tests weren't too simplistic.

Cheers,

-steve




On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Thanks for the suggestions, folks.
>
> Matthew: do you have a preference?
>
> -steve
>
> On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu> wrote:
>> Steve,
>>
>> I like your suggestion a lot.  I can see putting column specification to
>> good use.
>>
>> As for the argument name, perhaps
>>    'use.columns'
>>
>> And where a value of NULL or FALSE will yield same results as
>> `unique.data.frame`
>>
>>     use.columns=key(x)   # default behavior
>>     use.columns=c("col1name", "col7name")   #etc
>>     use.columns=NULL
>>
>>
>> Thanks as always,
>> Rick
>>
>>
>>
>> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>> Hi folks,
>>>
>>> I actually want to revisit the fix I made here.
>>>
>>> Instead of having `use.key` in the signature to unique.data.table (and
>>> duplicated.data.table) to be:
>>>
>>> function(x,
>>>              incomparables=FALSE,
>>>              tolerance=.Machine$double.eps ^ 0.5,
>>>              use.key=TRUE, ...)
>>>
>>> How about we switch out use.key for a parameter that specifies the
>>> column names to use in the uniqueness check, which defaults to key(x)
>>> to keep backwards compatibility.
>>>
>>> For argument's sake (like that?), lets call this parameter `columns`
>>> (by.columns? with.columns? whatever) so:
>>>
>>> function(x,
>>>              incomparables=FALSE,
>>>              tolerance=.Machine$double.eps ^ 0.5,
>>>              columns=key(x), ...)
>>>
>>> Then:
>>>
>>> (1) leaving it alone is the backward compatibile behavior;
>>> (2) Perhaps setting it to NULL will use all columns, and make it
>>> equivalent to unique.data.frame (also the same when x has no key); and
>>> (3) setting it to any other combo of columns uses those columns as the
>>> uniqueness key and filters the rows (only) out of x accordingly.
>>>
>>> What do you folks think? Personally I think this is better on all
>>> accounts then just specifying to use the key or not and the only
>>> question in my mind is the name of the argument -- happy to hear other
>>> world views, however, so don't be shy.
>>>
>>> Thanks,
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Computational Biologist
>>> Bioinformatics and Computational Biology
>>> Genentech
>>
>>
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


More information about the datatable-help mailing list