[datatable-help] unique.data.frame should create a copy, right?

Ricardo Saporta saporta at scarletmail.rutgers.edu
Fri Sep 27 21:01:44 CEST 2013


running some benchmarks at work, I got the following comparing
unique.data.frame to the new
unique(.. , by=..)

> microbenchmark(eval(uDF), eval(uDT))
Unit: milliseconds
      expr      min        lq    median       uq      max neval
 eval(uDF) 28.38505 29.368062 31.705633 33.53874 52.57522   100
 eval(uDT)  6.61314  7.220897  7.597114  9.58860 78.82127   100


well done!




On Tue, Aug 27, 2013 at 1:23 PM, Steve Lianoglou <
mailinglist.honeypot at gmail.com> wrote:

> Last update here :-)
>
> After more hemming and hawing, I've changed the name of the new
> parameter added to duplicated.data.table and unique.data.table from
> `by.columnss` to just `by`, as it (more or less) is the same idea as
> the `by` in dt[x, i,j,by,...]
>
> Sorry for any inconveniences caused if you've been working off of the
> development version.
>
> -steve
>
>
> On Thu, Aug 15, 2013 at 9:35 PM, Ricardo Saporta
> <saporta at scarletmail.rutgers.edu> wrote:
> > Steve, great stuff!!
> > thanks for making that happen
> >
> > Rick
> >
> >
> > On Wed, Aug 14, 2013 at 8:30 PM, Steve Lianoglou
> > <mailinglist.honeypot at gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> As I needed this sooner than I had expected, I just committed this
> >> change. It's in svn revision 889.
> >>
> >> I chose 'by.columns' as the parameter names -- seemed to make more
> >> sense to me, and using the short hand interactively saves a letter,
> >> eg: unique(dt, by=c('some', 'columns')) ;-)
> >>
> >> Here's the note from the NEWS file:
> >>
> >> o  "Uniqueness" tests can now specify arbirtray combinations of
> >> columns to use to test for duplicates. `by.columns` parameter added to
> >> unique.data.table and duplicated.data.table. This allows the user to
> >> test for uniqueness using any combination of columns in the
> >> data.table, where previously the user only had the option to use the
> >> keyed columns (if keyed) or all columns (if not). The default behavior
> >> sets `by.columns=key(dt)` to maintain backward compatability. See
> >> man/duplicated.Rd and tests 986:991 for more information. Thanks to
> >> Arunkumar Srinivasan, Ricardo Saporta, and Frank Erickson for useful
> >> discussions.
> >>
> >> Should work as advertised assuming my unit tests weren't too simplistic.
> >>
> >> Cheers,
> >>
> >> -steve
> >>
> >>
> >>
> >>
> >> On Tue, Aug 13, 2013 at 1:24 PM, Steve Lianoglou
> >> <mailinglist.honeypot at gmail.com> wrote:
> >> > Thanks for the suggestions, folks.
> >> >
> >> > Matthew: do you have a preference?
> >> >
> >> > -steve
> >> >
> >> > On Mon, Aug 12, 2013 at 11:12 AM, Ricardo Saporta
> >> > <saporta at scarletmail.rutgers.edu> wrote:
> >> >> Steve,
> >> >>
> >> >> I like your suggestion a lot.  I can see putting column specification
> >> >> to
> >> >> good use.
> >> >>
> >> >> As for the argument name, perhaps
> >> >>    'use.columns'
> >> >>
> >> >> And where a value of NULL or FALSE will yield same results as
> >> >> `unique.data.frame`
> >> >>
> >> >>     use.columns=key(x)   # default behavior
> >> >>     use.columns=c("col1name", "col7name")   #etc
> >> >>     use.columns=NULL
> >> >>
> >> >>
> >> >> Thanks as always,
> >> >> Rick
> >> >>
> >> >>
> >> >>
> >> >> On Mon, Aug 12, 2013 at 1:51 PM, Steve Lianoglou
> >> >> <mailinglist.honeypot at gmail.com> wrote:
> >> >>>
> >> >>> Hi folks,
> >> >>>
> >> >>> I actually want to revisit the fix I made here.
> >> >>>
> >> >>> Instead of having `use.key` in the signature to unique.data.table
> (and
> >> >>> duplicated.data.table) to be:
> >> >>>
> >> >>> function(x,
> >> >>>              incomparables=FALSE,
> >> >>>              tolerance=.Machine$double.eps ^ 0.5,
> >> >>>              use.key=TRUE, ...)
> >> >>>
> >> >>> How about we switch out use.key for a parameter that specifies the
> >> >>> column names to use in the uniqueness check, which defaults to
> key(x)
> >> >>> to keep backwards compatibility.
> >> >>>
> >> >>> For argument's sake (like that?), lets call this parameter `columns`
> >> >>> (by.columns? with.columns? whatever) so:
> >> >>>
> >> >>> function(x,
> >> >>>              incomparables=FALSE,
> >> >>>              tolerance=.Machine$double.eps ^ 0.5,
> >> >>>              columns=key(x), ...)
> >> >>>
> >> >>> Then:
> >> >>>
> >> >>> (1) leaving it alone is the backward compatibile behavior;
> >> >>> (2) Perhaps setting it to NULL will use all columns, and make it
> >> >>> equivalent to unique.data.frame (also the same when x has no key);
> and
> >> >>> (3) setting it to any other combo of columns uses those columns as
> the
> >> >>> uniqueness key and filters the rows (only) out of x accordingly.
> >> >>>
> >> >>> What do you folks think? Personally I think this is better on all
> >> >>> accounts then just specifying to use the key or not and the only
> >> >>> question in my mind is the name of the argument -- happy to hear
> other
> >> >>> world views, however, so don't be shy.
> >> >>>
> >> >>> Thanks,
> >> >>> -steve
> >> >>>
> >> >>> --
> >> >>> Steve Lianoglou
> >> >>> Computational Biologist
> >> >>> Bioinformatics and Computational Biology
> >> >>> Genentech
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Steve Lianoglou
> >> > Computational Biologist
> >> > Bioinformatics and Computational Biology
> >> > Genentech
> >>
> >>
> >>
> >> --
> >> Steve Lianoglou
> >> Computational Biologist
> >> Bioinformatics and Computational Biology
> >> Genentech
> >
> >
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130927/f02ac148/attachment.html>


More information about the datatable-help mailing list