[datatable-help] Idea/feature request

Mon Jun 27 19:54:27 CEST 2011

There was a format error in one of the vignettes introduced last week
and fixed a few days ago, but R-Forge daily build hasn't caught up yet.
Most likely, anyway. If build on R-Forge isn't ok tomorrow, then it's
something else.
Matthew

On Mon, 2011-06-27 at 14:03 +0200, Andreas Borg wrote:
> For some reason I am not able to install the latest version, so I cannot 
> test it right now. Anyway, it looks great. Thanks!
> 
> Andreas
> 
> Matthew Dowle schrieb:
> > Andreas, Steve,
> >
> > Committed. Please test and confirm if it satisfies all needs ok?
> >
> > o    A new symbol .BY is available to j, containing 1 row
> >      of the current 'by' variables, type list. 'by' variables
> >      may also be used by name and they are now length 1, too.
> >      This implements FR#1313.
> >      For example :
> >           DT[,sum(x)*.BY[[1]],by=y]
> >           DT[,sum(x)*.BY[[1]],by=eval(byexp)]
> >           DT[,sapply(.SD,sum)*y,by=y]
> >           DT[,sapply(.SD,sum)*.BY[[2]],by=list(y,z)]
> >
> > Matthew
> >
> >
> >
> > On Wed, 2011-05-11 at 10:24 +0200, Andreas Borg wrote:
> >   
> >> Hi Steve,
> >>
> >>     
> >>> Now that you've brought this back up, what do you think you would
> >>> prefer? For example, using my (admittedly contrived) original example:
> >>>
> >>> result <- some.big.data.table[, by=list(colA, colB), {
> >>>  ## Sometimes I want to know what the current values of
> >>>  ## colA and colB are in here to get some more info. Mabye
> >>>  ## we can have .BY:
> >>>
> >>>  xref <- more.data[J(.BY[1], .BY[2]), mult='all'] ## or something
> >>>  ## ...
> >>> }]
> >>>
> >>> Should it be `J(.BY[1], .BY[2])` or is something like `J(colA, colB)`
> >>> more natural, you think?
> >>>
> >>>   
> >>>       
> >> 'J(colA, colB)' is perfect if you know the column names in advance. This 
> >> is not true in my case. I created a minimal example for a possible 
> >> application for a '.BY' construct:
> >>
> >>  > dt <- data.table(x=c(0,1,0,1), y=c(1,0,1,0))
> >>  > dt
> >>      x y
> >> [1,] 0 1
> >> [2,] 1 0
> >> [3,] 0 1
> >> [4,] 1 0
> >>
> >>  From this table, I want the row sum for each group, i.e. "select x + y 
> >> from dt group by x, y" in SQL. This would be:
> >>
> >>  > setkey(dt, x, y)
> >>  > dt[,sum(x[1], y[1]), by=list(x,y)]
> >>      x y V1
> >> [1,] 0 1  1
> >> [2,] 1 0  1
> >>
> >> But what if dt can have an arbitrary number of (grouping) columns with 
> >> arbitrary names? If the grouping columns are given as
> >>
> >> groupCols <- c("x", "y")
> >>
> >> , the following is possible:
> >>
> >>  > expr <- parse(text = sprintf("sum(%s)", paste(groupCols, "[1]", 
> >> sep="", collapse=", ")))
> >>  > dt[,eval(expr), by=groupCols]
> >>      x y V1
> >> [1,] 0 1  1
> >> [2,] 1 0  1
> >>
> >> Now, this is certainly uglier than
> >>
> >>  > dt[, sum(.BY), by = groupCols]
> >>
> >> My actual application is that I apply decision tree models (rpart) to a 
> >> large number of binary patterns. In order to save computation time, I 
> >> classify each distinct pattern only once. So what I basically do is to 
> >> group by all attributes and apply the model once to each group.
> >>
> >> Andreas
> >>
> >>     
> >
> >
> >
> >   
> 
>