[datatable-help] Programmatic by clauses
Short, Tom
TShort at epri.com
Mon Aug 30 15:36:58 CEST 2010
Johann, how about the following:
> data <- data.table(fico = rep((1:4)*25L, 5), imonth = rep(1:5,
each=4),
+ balance = runif(20), count = runif(20))
> aggregation.spec <- list(
+ iquarter=function (d) d$imonth %/% 3L,
+ fico.bucket=function (d)
as.integer(25*round(d$fico/25)))
>
> by.factors <- lapply(aggregation.spec, function (f) f(data))
>
> cols.to.sum <- c("balance", "count")
> data.to.sum <- data[,cols.to.sum, with=F]
> aggregate(data.to.sum, by.factors, sum)
iquarter fico.bucket balance count
1 0 25 0.5506797 1.133675
2 1 25 1.7311503 1.210178
3 0 50 1.5175908 0.854553
4 1 50 2.2930775 1.974759
5 0 75 0.4627294 1.171430
6 1 75 1.0477066 1.973119
7 0 100 0.8354870 1.083211
8 1 100 1.4351321 1.501291
>
Here's a data.table version:
> data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
+ by = lapply(aggregation.spec, function (f) f(data))]
iquarter fico.bucket balance count
[1,] 0 25 0.5506797 1.133675
[2,] 0 50 1.5175908 0.854553
[3,] 0 75 0.4627294 1.171430
[4,] 0 100 0.8354870 1.083211
[5,] 1 25 1.7311503 1.210178
[6,] 1 50 2.2930775 1.974759
[7,] 1 75 1.0477066 1.973119
[8,] 1 100 1.4351321 1.501291
>
I think the following should also work, but it doesn't. Note that I
didn't update to the very latest version of data.table, and I know
Matthew has changed some things that might already fix this.
> data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
+ by = by.factors]
Error in `[.data.table`(data, , lapply(.SD[, cols.to.sum, with = FALSE],
:
column or expression 1 of 'by' list is not internally type integer. Do
not quote column names. Example of correct use:
by=list(colA,month(colB),...).
- Tom
> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> On Behalf Of Johann Hibschman
> Sent: Monday, August 30, 2010 09:03
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] Programmatic by clauses
>
> I know this was discussed in early July, but I found much of
> that to be impenetrable, unfortunately. I'm still very bad
> at understanding R's lazy-evaluation mechanism and quoting.
> (The sad thing is that I have no problem with lisp quotes and
> quasiquotes in macros, scheme hygenic macros, and so on, but
> it just seems harder in R.)
>
> I'm trying to convert an existing function to R.
>
> That function takes a data set and a named list of
> factor-generating functions, then aggregates the data based
> on those results using the 'aggregate' function.
>
> e.g.:
>
> data <- some.big.data.frame
>
> aggregation.spec <- list(
> iquarter=function (d) d$imonth %/% 3,
> fico.bucket=function (d) as.integer(25*round(d$fico/25)))
>
> by.factors <- lapply(aggregation.spec, function (f) f(data))
>
> cols.to.sum <- c("balance", "count", other cols)
> data.to.sum <- data[,cols.to.sum]
> agg <- aggregate(data.to.sum, by.factors, sum)
>
> I'm not sure what the equivalent in data.table would be. I
> can get something that seems to work by something like:
>
> dt <- as.data.table(data[,cols.to.sum])
> for (n in names(aggregation.spec)) {
> dt[[n]] <- aggregation.spec[[n]](data)
> }
> agg <- dt[,
> list(balance=sum(balance), ...(manually construct big
> list of cols to sum)),
> by=paste(names(aggregation.spec), ",")]
>
> This just seems, well, hideous. Problems are:
>
> 1. Constructing the big list of columns to sum is a pain.
> Sure, I can
> do it, but it's a chunk of code that I don't want to maintain.
>
> 2. Passing in the aggregation column names as a
> comma-separated string
> feels like a hack.
>
> What's the "expert" way to do this in data.table?
>
> Thanks,
> Johann
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
>
More information about the datatable-help
mailing list