[datatable-help] columns in .SD with grouping ad-hoc using "by"

Sun May 12 10:12:17 CEST 2013

Hi,  

Suppose you've a data.table, say:

require(data.table)
DT <- data.table(x = 1:5, y = 6:10)

Suppose you want to group by "x %/% 2" ( = 0, 1,1, 2,2) and then calculate the sum of each column for each group, then one would do:

DT[, grp := x %/% 2]
DT[, list(x.sum=sum(x), y.sum=sum(y)), by = grp] # avoid .SD in case of few columns

Now, assume that you've many many columns which would make the use of `.SD` sensible.

DT[, lapply(.SD, sum), by = grp]
  grp x  y
1:   0 1  6
2:   1 5 15
3:   2 9 19

The issue is that if you create the grouping column ad-hoc, then the column from which the ad-hoc grouping column is derived is not available to .SD. Let me illustrate this:

DT <- data.table(x = 1:5, y = 6:10)

DT[, lapply(.SD, sum), by = (grp=x %/% 2)] # ad-hoc creation of grouping column
   grp  y
1:   0  6
2:   1 15
3:   2 19

I think it'd be nice to have the column available to `.SD` so that we can save creating a temporary column, grouping and then deleting it, as "technically" it *is* a new column (meaning, "x" must still be available). Any take on this?

Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130512/137a24a9/attachment.html>