[datatable-help] Feature Idea

Matthew Dowle mdowle at mdowle.plus.com
Sun Jul 17 02:58:39 CEST 2011


On Mon, 2011-07-11 at 19:42 +0000, Alexander Peterhansl wrote:
> I've found that "by" does not need a key.  For example,
Yes, that's 'ad hoc' by. It can be quite a bit slower than keyed by,
though. It's not just that it has to find the groups, but those groups
may be scattered in the table so it can't copy the data into the SD
environment in bulk (i.e. using memcpy in C). setkey isn't just about
ordering, but also getting the groups together in memory.

> > temp <- data.table(Index1=1:4,Index2=c(4,2,2,1),Values=c(10,10,10,30))  # no key set here!
> > temp[,sum(Values),by=Index2,bysameorder=TRUE]
>      Index2 V1
> [1,]      4 10
> [2,]      2 20
> [3,]      1 30
> > temp[,sum(Values),by=Index2,bysameorder=FALSE]
>      Index2 V1
> [1,]      1 30
> [2,]      2 20
> [3,]      4 10
> 
> Nevertheless "bysameorder" changes the initial ordering.

Been looking into this some more. I know I suggested bysameorder but I'm
not sure now it will work for Steve's feature request starting this
thread, because it doesn't handle non-contiguous groups; e.g. :

> DT = data.table(a=c(1,2,1,2),1:4)
> DT
     a V2
[1,] 1  1
[2,] 2  2
[3,] 1  3
[4,] 2  4
> DT[,sum(V2),by=a]  
     a V1
[1,] 1  4
[2,] 2  6   # ok
> DT[,sum(V2),by=a,bysameorder=TRUE]
     a V1
[1,] 1  1
[2,] 2  2
[3,] 1  3
[4,] 2  4   # not ok, not what bysameorder is for. 

bysameorder is more for by expressions where the user knows they are
order preserving. ?data.table is correct and we'd like to remove it.

It also turns out that ad hoc by doesn't set a key on the result; that
wasn't why the groups were being ordered.

So, simplest solution is just to make data.table preserve the group
appearance order.  If user wants to sort, or add a key to the results,
then they can do that afterwards. Of course a keyed by will preserve the
group order anyway, and have it's result keyed too.

Just committed :

o   Ad hoc grouping now returns results in the same order each group
    first appears in the table, rather than sorting the groups. As
    before, groups do not have to occur contiguously in the   
    table; the first row of each group determines the ordering of  
    groups. Thanks to Steve Lianoglou for highlighting. The order of
    the rows within each group, always has, and always will be 
    preserved.

This isn't set in stone by any means so comments welcome.

> 
> But, more generally, is there a way to attach a key "on the fly" ?
> 
> Suppose I wanted to extract all table values where Index2 is equal to 1.  Is there a better way to do this than:
> >setkey(temp,"Index2")
> > temp[J(1),]
>      Index2 Index1 Values
> [1,]      1      4     30

No, no better way. You do have to set the key, or create a manual
secondary key. It will by nicer when secondary keys are built in. What
we could do then is when you tell the query you want to join to Index2
(syntax to be determined), it could use the secondary key if it exists,
and if not it could calculate it and add it to the table automatically.
Perhaps.

> 
> Thanks,
> Alex
> 
> 
> 
> -----Original Message-----
> From: datatable-help-bounces at r-forge.wu-wien.ac.at [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Matthew Dowle
> Sent: Saturday, July 09, 2011 3:54 AM
> To: Steve Lianoglou
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Feature Idea
> 
> 
> (I think) it already does that. It's just that it sets a key on the result by default (which does the re-ordering of the grouped results at the end). If that's true, then could provide a way to not call setkey at the end. There is also the 'bysameorder' argument which might already be doing something similar.
> 
> Matthew 
> 
> On Fri, 2011-07-08 at 14:29 -0400, Steve Lianoglou wrote:
> > Hi,
> > 
> > I find myself often wanting to use a data.table for its quick 
> > aggregate&summary mojo, but I want to keep the ordering of my data as 
> > I have it, and not as it would be if I set the appropriate keys for my 
> > aggregation/summary.
> > 
> > How would you folks feel if I add a `by` (or dt.by) method for a data.table, eg:
> > 
> > result <- by(some.data.table, would.be.keys, {  ## stuff }, ...)
> > 
> > Which does the aggregate/summary encoded within { ... }, but the 
> > result is returned in the same order as `some.data.table` was in when 
> > it was passed into the function -- if { ... } returned as many rows as 
> > were in the original data.table, then it's 1-for-1, but you are 
> > summarizing groups of rows, the summary would be in the same
> > (appearance) order as it is in `some.data.table`.
> > 
> > The { ... } block would essentially be anything you can put in the `j` 
> > part of a data.table[i, j, ...].
> > 
> > The `...` dots after { ... } maybe extra params that can get passed 
> > into a "normal" data.table[i,j,...] call (haven't thought about that 
> > yet, tho).
> > 
> > If I can get some consensus on whether or not it's worthwhile to put 
> > such a function into the data.table package, I'll go ahead and add an 
> > initial implementation -- otherwise I can just keep it in my personal 
> > utility belt whenever I need to use it.
> > 
> > Thanks,
> > -steve
> > 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list