[datatable-help] unique.data.frame should create a copy, right?

Matthew Dowle mdowle at mdowle.plus.com
Sat Sep 28 09:29:40 CEST 2013


Oh, good point.

How about putting 'by' first in those situations :

 > DT = data.table(A=rep(1:3,2),B=1:2)
 > unique(by="A",DT)
    A B
1: 1 1
2: 2 2
3: 3 1
 > unique(by="B",DT)
    A B
1: 1 1
2: 2 2
 >

On 27/09/13 20:09, Ricardo Saporta wrote:
> Steve, not to beat a dead horse on the "what to name the new 
> parameter" discussion,  but I'm wondering what your/others' thoughts 
> are on using something other than 'by".  Maybe even "uby"
>
> Or perhaps we can have a synonym in the function definition:
>    .. function(........ , by=uby, uby)
>
> The reason I bring this up is that as I begin to use this and I am 
> reading over my own code, I realize that it takes a lot of visual 
> parsing to distinguish when the "by" in a complex call belongs to 
> "[.data.table" and when the "by" belongs to "unique.data.table"
>
> Cheers,
> Rick
>
>
> On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou 
> <mailinglist.honeypot at gmail.com 
> <mailto:mailinglist.honeypot at gmail.com>> wrote:
>
>     Last update here :-)
>
>     After more hemming and hawing, I've changed the name of the new
>     parameter added to duplicated.data.table and unique.data.table from
>     `by.columnss` to just `by`, as it (more or less) is the same idea as
>     the `by` in dt[x, i,j,by,...]
>
>     Sorry for any inconveniences caused if you've been working off of the
>     development version.
>
>     -steve
>
>
>     On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta
>     <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>     > Steve, great stuff!!
>     > thanks for making that happen
>     >
>     > Rick
>     >
>     >
>     > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou
>     > <mailinglist.honeypot at gmail.com
>     <mailto:mailinglist.honeypot at gmail.com>> wrote:
>     >>
>     >> Hi all,
>     >>
>     >> As I needed this sooner than I had expected, I just committed this
>     >> change. It's in svn revision 889.
>     >>
>     >> I chose 'by.columns' as the parameter names -- seemed to make more
>     >> sense to me, and using the short hand interactively saves a letter,
>     >> eg: unique(dt, by=c('some', 'columns')) ;-)
>     >>
>     >> Here's the note from the NEWS file:
>     >>
>     >> o  "Uniqueness" tests can now specify arbirtray combinations of
>     >> columns to use to test for duplicates. `by.columns` parameter
>     added to
>     >> unique.data.table and duplicated.data.table. This allows the
>     user to
>     >> test for uniqueness using any combination of columns in the
>     >> data.table, where previously the user only had the option to
>     use the
>     >> keyed columns (if keyed) or all columns (if not). The default
>     behavior
>     >> sets `by.columns=key(dt)` to maintain backward compatability. See
>     >> man/duplicated.Rd and tests 986:991 for more information. Thanks to
>     >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for
>     useful
>     >> discussions.
>     >>
>     >> Should work as advertised assuming my unit tests weren't too
>     simplistic.
>     >>
>     >> Cheers,
>     >>
>     >> -steve
>     >>
>     >>
>     >>
>     >>
>     >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou
>     >> <mailinglist.honeypot at gmail.com
>     <mailto:mailinglist.honeypot at gmail.com>> wrote:
>     >> > Thanks for the suggestions, folks.
>     >> >
>     >> > Matthew: do you have a preference?
>     >> >
>     >> > -steve
>     >> >
>     >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta
>     >> > <saporta at scarletmail.rutgers.edu
>     <mailto:saporta at scarletmail.rutgers.edu>> wrote:
>     >> >> Steve,
>     >> >>
>     >> >> I like your suggestion a lot.  I can see putting column
>     specification
>     >> >> to
>     >> >> good use.
>     >> >>
>     >> >> As for the argument name, perhaps
>     >> >>    'use.columns'
>     >> >>
>     >> >> And where a value of NULL or FALSE will yield same results as
>     >> >> `unique.data.frame`
>     >> >>
>     >> >>     use.columns=key(x)   # default behavior
>     >> >>     use.columns=c("col1name", "col7name")   #etc
>     >> >>     use.columns=NULL
>     >> >>
>     >> >>
>     >> >> Thanks as always,
>     >> >> Rick
>     >> >>
>     >> >>
>     >> >>
>     >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou
>     >> >> <mailinglist.honeypot at gmail.com
>     <mailto:mailinglist.honeypot at gmail.com>> wrote:
>     >> >>>
>     >> >>> Hi folks,
>     >> >>>
>     >> >>> I actually want to revisit the fix I made here.
>     >> >>>
>     >> >>> Instead of having `use.key` in the signature to
>     unique.data.table (and
>     >> >>> duplicated.data.table) to be:
>     >> >>>
>     >> >>> function(x,
>     >> >>>  incomparables=FALSE,
>     >> >>>  tolerance=.Machine$double.eps ^ 0.5,
>     >> >>>              use.key=TRUE, ...)
>     >> >>>
>     >> >>> How about we switch out use.key for a parameter that
>     specifies the
>     >> >>> column names to use in the uniqueness check, which defaults
>     to key(x)
>     >> >>> to keep backwards compatibility.
>     >> >>>
>     >> >>> For argument's sake (like that?), lets call this parameter
>     `columns`
>     >> >>> (by.columns? with.columns? whatever) so:
>     >> >>>
>     >> >>> function(x,
>     >> >>>  incomparables=FALSE,
>     >> >>>  tolerance=.Machine$double.eps ^ 0.5,
>     >> >>>              columns=key(x), ...)
>     >> >>>
>     >> >>> Then:
>     >> >>>
>     >> >>> (1) leaving it alone is the backward compatibile behavior;
>     >> >>> (2) Perhaps setting it to NULL will use all columns, and
>     make it
>     >> >>> equivalent to unique.data.frame (also the same when x has
>     no key); and
>     >> >>> (3) setting it to any other combo of columns uses those
>     columns as the
>     >> >>> uniqueness key and filters the rows (only) out of x
>     accordingly.
>     >> >>>
>     >> >>> What do you folks think? Personally I think this is better
>     on all
>     >> >>> accounts then just specifying to use the key or not and the
>     only
>     >> >>> question in my mind is the name of the argument -- happy to
>     hear other
>     >> >>> world views, however, so don't be shy.
>     >> >>>
>     >> >>> Thanks,
>     >> >>> -steve
>     >> >>>
>     >> >>> --
>     >> >>> Steve Lianoglou
>     >> >>> Computational Biologist
>     >> >>> Bioinformatics and Computational Biology
>     >> >>> Genentech
>     >> >>
>     >> >>
>     >> >
>     >> >
>     >> >
>     >> > --
>     >> > Steve Lianoglou
>     >> > Computational Biologist
>     >> > Bioinformatics and Computational Biology
>     >> > Genentech
>     >>
>     >>
>     >>
>     >> --
>     >> Steve Lianoglou
>     >> Computational Biologist
>     >> Bioinformatics and Computational Biology
>     >> Genentech
>     >
>     >
>
>
>
>     --
>     Steve Lianoglou
>     Computational Biologist
>     Bioinformatics and Computational Biology
>     Genentech
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130928/09e05341/attachment.html>


More information about the datatable-help mailing list