[datatable-help] Idea/feature request
Matthew Dowle
mdowle at mdowle.plus.com
Tue Jun 21 21:40:02 CEST 2011
Andreas, Steve,
Committed. Please test and confirm if it satisfies all needs ok?
o A new symbol .BY is available to j, containing 1 row
of the current 'by' variables, type list. 'by' variables
may also be used by name and they are now length 1, too.
This implements FR#1313.
For example :
DT[,sum(x)*.BY[[1]],by=y]
DT[,sum(x)*.BY[[1]],by=eval(byexp)]
DT[,sapply(.SD,sum)*y,by=y]
DT[,sapply(.SD,sum)*.BY[[2]],by=list(y,z)]
Matthew
On Wed, 2011-05-11 at 10:24 +0200, Andreas Borg wrote:
> Hi Steve,
>
> > Now that you've brought this back up, what do you think you would
> > prefer? For example, using my (admittedly contrived) original example:
> >
> > result <- some.big.data.table[, by=list(colA, colB), {
> > ## Sometimes I want to know what the current values of
> > ## colA and colB are in here to get some more info. Mabye
> > ## we can have .BY:
> >
> > xref <- more.data[J(.BY[1], .BY[2]), mult='all'] ## or something
> > ## ...
> > }]
> >
> > Should it be `J(.BY[1], .BY[2])` or is something like `J(colA, colB)`
> > more natural, you think?
> >
> >
> 'J(colA, colB)' is perfect if you know the column names in advance. This
> is not true in my case. I created a minimal example for a possible
> application for a '.BY' construct:
>
> > dt <- data.table(x=c(0,1,0,1), y=c(1,0,1,0))
> > dt
> x y
> [1,] 0 1
> [2,] 1 0
> [3,] 0 1
> [4,] 1 0
>
> From this table, I want the row sum for each group, i.e. "select x + y
> from dt group by x, y" in SQL. This would be:
>
> > setkey(dt, x, y)
> > dt[,sum(x[1], y[1]), by=list(x,y)]
> x y V1
> [1,] 0 1 1
> [2,] 1 0 1
>
> But what if dt can have an arbitrary number of (grouping) columns with
> arbitrary names? If the grouping columns are given as
>
> groupCols <- c("x", "y")
>
> , the following is possible:
>
> > expr <- parse(text = sprintf("sum(%s)", paste(groupCols, "[1]",
> sep="", collapse=", ")))
> > dt[,eval(expr), by=groupCols]
> x y V1
> [1,] 0 1 1
> [2,] 1 0 1
>
> Now, this is certainly uglier than
>
> > dt[, sum(.BY), by = groupCols]
>
> My actual application is that I apply decision tree models (rpart) to a
> large number of binary patterns. In order to save computation time, I
> classify each distinct pattern only once. So what I basically do is to
> group by all attributes and apply the model once to each group.
>
> Andreas
>
More information about the datatable-help
mailing list