[datatable-help] columns in .SD with grouping ad-hoc using "by"

Sun May 12 14:45:03 CEST 2013

On 12.05.2013 12:54, Arunkumar Srinivasan wrote: 

> I just realised
that I sent it only to MatthewDowle. So, sending it again. Sorry
@Matthew for the double email. 
> 
> Matthew, 
>>> .BY is available to j
already for that reason, does that work? .BY isn't a column of .SD
because i) it's the same value for every row of .SD i.e. .BY[[1]] is
length 1 and contains this particular group (replicating the same value
would be wasteful) 
> DT[, print(.BY), by = list(grp = x %/% 2)] 
> 
>
$grp 
> [1] 0 
> $grp 
> [1] 1 
> $grp 
> [1] 2 
> 
> DT[, print(.SD),
by = list(grp = x %/% 2)] # no column "x" 
> 
> y 
> 1: 6 
> y 
> 1: 7

> 2: 8 
> y 
> 1: 9 
> 2: 10 
> My question is not as to why the BY
column is not available in .SD. Rather, since .BY does not have column
"x" in it (rather the result of x%/% 2), why does .SD not have "x"? It's
as if grp = x%/%2 is a "new column". So, "x" should be available to .SD
is my point.

Oh I see now. Yes data.table inspects the expressions used
in 'by' and considers any columns used there as grouping columns and
excludes those from .SD. An example is a date column containing daily
observations. DT[, lapply(.SD,sum), by=month(date)] would not wish to
sum() the "date" column. 
In ?data.table I've just changed : 
.SD is a
data.table containing the Subset of x's Data for each group, excluding
the group column(s). 
to 
.SD is a data.table containing the Subset of
x's Data for each group, excluding any columns used in 'by' (or
'keyby'). 
Further answer below ... 

>>> but more significantly ii) it
is often a character group name where running an aggregation function
like sum() would trip up on it. 
> Again, I don't think so because, I am
not asking for .BY columns to be in .SD. 
> DT[, grp := x%/% = 2] 
>
DT[, lapply(.SD, sum), by=grp] 
> must be equal to: 
> DT[, lapply(.SD,
sum), by = list(grp = x%/%2)] # here, "x" should be available to .SD as
it's not the grouping column

This makes sense in this case because x
can be sum()-ed, but isn't true in general like the month(date) case
above. 
In these cases you can use .SDcols to include all columns, even
the ones used by by : 

> DT[, lapply(.SD, sum), by=list(grp=x%/%2)]

grp y
1: 0 6
2: 1 15
3: 2 19
> DT[, lapply(.SD, sum),
by=list(grp=x%/%2), .SDcols=names(DT)]
 grp x y
1: 0 1 6
2: 1 5 15
3: 2
9 19
> DT[, print(.SD), by = list(grp = x %/% 2), .SDcols=names(DT)]

 x
y
1: 1 6
 x y
1: 2 7
2: 3 8
 x y
1: 4 9
2: 5 10

Arun 

> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130512/8a731ca8/attachment.html>