[datatable-help] Programmatic by clauses

Johann Hibschman jhibschman+r at gmail.com
Mon Aug 30 15:02:41 CEST 2010


I know this was discussed in early July, but I found much of that to be
impenetrable, unfortunately.  I'm still very bad at understanding R's
lazy-evaluation mechanism and quoting.  (The sad thing is that I have no
problem with lisp quotes and quasiquotes in macros, scheme hygenic
macros, and so on, but it just seems harder in R.)

I'm trying to convert an existing function to R.

That function takes a data set and a named list of factor-generating
functions, then aggregates the data based on those results using the
'aggregate' function.

e.g.:

  data <- some.big.data.frame

  aggregation.spec <- list(
    iquarter=function (d) d$imonth %/% 3,
    fico.bucket=function (d) as.integer(25*round(d$fico/25)))

  by.factors <- lapply(aggregation.spec, function (f) f(data))

  cols.to.sum <- c("balance", "count", other cols)
  data.to.sum <- data[,cols.to.sum]
  agg <- aggregate(data.to.sum, by.factors, sum)

I'm not sure what the equivalent in data.table would be.  I can get
something that seems to work by something like:

  dt <- as.data.table(data[,cols.to.sum])
  for (n in names(aggregation.spec)) {
    dt[[n]] <- aggregation.spec[[n]](data)
  }
  agg <- dt[,
    list(balance=sum(balance), ...(manually construct big list of cols to sum)),
    by=paste(names(aggregation.spec), ",")]

This just seems, well, hideous.  Problems are:

  1. Constructing the big list of columns to sum is a pain.  Sure, I can
     do it, but it's a chunk of code that I don't want to maintain.

  2. Passing in the aggregation column names as a comma-separated string
     feels like a hack.

What's the "expert" way to do this in data.table?

Thanks,
Johann



More information about the datatable-help mailing list