[datatable-help] columns in .SD with grouping ad-hoc using "by"
Matthew Dowle
mdowle at mdowle.plus.com
Sun May 12 14:45:03 CEST 2013
On 12.05.2013 12:54, Arunkumar Srinivasan wrote:
> I just realised
that I sent it only to MatthewDowle. So, sending it again. Sorry
@Matthew for the double email.
>
> Matthew,
>>> .BY is available to j
already for that reason, does that work? .BY isn't a column of .SD
because i) it's the same value for every row of .SD i.e. .BY[[1]] is
length 1 and contains this particular group (replicating the same value
would be wasteful)
> DT[, print(.BY), by = list(grp = x %/% 2)]
>
>
$grp
> [1] 0
> $grp
> [1] 1
> $grp
> [1] 2
>
> DT[, print(.SD),
by = list(grp = x %/% 2)] # no column "x"
>
> y
> 1: 6
> y
> 1: 7
> 2: 8
> y
> 1: 9
> 2: 10
> My question is not as to why the BY
column is not available in .SD. Rather, since .BY does not have column
"x" in it (rather the result of x%/% 2), why does .SD not have "x"? It's
as if grp = x%/%2 is a "new column". So, "x" should be available to .SD
is my point.
Oh I see now. Yes data.table inspects the expressions used
in 'by' and considers any columns used there as grouping columns and
excludes those from .SD. An example is a date column containing daily
observations. DT[, lapply(.SD,sum), by=month(date)] would not wish to
sum() the "date" column.
In ?data.table I've just changed :
.SD is a
data.table containing the Subset of x's Data for each group, excluding
the group column(s).
to
.SD is a data.table containing the Subset of
x's Data for each group, excluding any columns used in 'by' (or
'keyby').
Further answer below ...
>>> but more significantly ii) it
is often a character group name where running an aggregation function
like sum() would trip up on it.
> Again, I don't think so because, I am
not asking for .BY columns to be in .SD.
> DT[, grp := x%/% = 2]
>
DT[, lapply(.SD, sum), by=grp]
> must be equal to:
> DT[, lapply(.SD,
sum), by = list(grp = x%/%2)] # here, "x" should be available to .SD as
it's not the grouping column
This makes sense in this case because x
can be sum()-ed, but isn't true in general like the month(date) case
above.
In these cases you can use .SDcols to include all columns, even
the ones used by by :
> DT[, lapply(.SD, sum), by=list(grp=x%/%2)]
grp y
1: 0 6
2: 1 15
3: 2 19
> DT[, lapply(.SD, sum),
by=list(grp=x%/%2), .SDcols=names(DT)]
grp x y
1: 0 1 6
2: 1 5 15
3: 2
9 19
> DT[, print(.SD), by = list(grp = x %/% 2), .SDcols=names(DT)]
x
y
1: 1 6
x y
1: 2 7
2: 3 8
x y
1: 4 9
2: 5 10
Arun
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130512/8a731ca8/attachment.html>
More information about the datatable-help
mailing list