[datatable-help] Programmatic by clauses

Short, Tom TShort at epri.com
Mon Aug 30 15:36:58 CEST 2010


Johann, how about the following:

>     data <- data.table(fico = rep((1:4)*25L, 5), imonth = rep(1:5,
each=4),
+                        balance = runif(20), count = runif(20))
>     aggregation.spec <- list(
+                              iquarter=function (d) d$imonth %/% 3L,
+                              fico.bucket=function (d)
as.integer(25*round(d$fico/25)))
> 
>     by.factors <- lapply(aggregation.spec, function (f) f(data))
> 
>     cols.to.sum <- c("balance", "count")
>     data.to.sum <- data[,cols.to.sum, with=F]
>     aggregate(data.to.sum, by.factors, sum) 
  iquarter fico.bucket   balance    count
1        0          25 0.5506797 1.133675
2        1          25 1.7311503 1.210178
3        0          50 1.5175908 0.854553
4        1          50 2.2930775 1.974759
5        0          75 0.4627294 1.171430
6        1          75 1.0477066 1.973119
7        0         100 0.8354870 1.083211
8        1         100 1.4351321 1.501291
>

Here's a data.table version:
     
>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
+          by = lapply(aggregation.spec, function (f) f(data))]
     iquarter fico.bucket   balance    count
[1,]        0          25 0.5506797 1.133675
[2,]        0          50 1.5175908 0.854553
[3,]        0          75 0.4627294 1.171430
[4,]        0         100 0.8354870 1.083211
[5,]        1          25 1.7311503 1.210178
[6,]        1          50 2.2930775 1.974759
[7,]        1          75 1.0477066 1.973119
[8,]        1         100 1.4351321 1.501291
> 

I think the following should also work, but it doesn't. Note that I
didn't update to the very latest version of data.table, and I know
Matthew has changed some things that might already fix this.
     

>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
+          by = by.factors]
Error in `[.data.table`(data, , lapply(.SD[, cols.to.sum, with = FALSE],
: 
  column or expression 1 of 'by' list is not internally type integer. Do
not quote column names. Example of correct use:
by=list(colA,month(colB),...).



- Tom

 

> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Johann Hibschman
> Sent: Monday, August 30, 2010 09:03
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] Programmatic by clauses
> 
> I know this was discussed in early July, but I found much of 
> that to be impenetrable, unfortunately.  I'm still very bad 
> at understanding R's lazy-evaluation mechanism and quoting.  
> (The sad thing is that I have no problem with lisp quotes and 
> quasiquotes in macros, scheme hygenic macros, and so on, but 
> it just seems harder in R.)
> 
> I'm trying to convert an existing function to R.
> 
> That function takes a data set and a named list of 
> factor-generating functions, then aggregates the data based 
> on those results using the 'aggregate' function.
> 
> e.g.:
> 
>   data <- some.big.data.frame
> 
>   aggregation.spec <- list(
>     iquarter=function (d) d$imonth %/% 3,
>     fico.bucket=function (d) as.integer(25*round(d$fico/25)))
> 
>   by.factors <- lapply(aggregation.spec, function (f) f(data))
> 
>   cols.to.sum <- c("balance", "count", other cols)
>   data.to.sum <- data[,cols.to.sum]
>   agg <- aggregate(data.to.sum, by.factors, sum)
> 
> I'm not sure what the equivalent in data.table would be.  I 
> can get something that seems to work by something like:
> 
>   dt <- as.data.table(data[,cols.to.sum])
>   for (n in names(aggregation.spec)) {
>     dt[[n]] <- aggregation.spec[[n]](data)
>   }
>   agg <- dt[,
>     list(balance=sum(balance), ...(manually construct big 
> list of cols to sum)),
>     by=paste(names(aggregation.spec), ",")]
> 
> This just seems, well, hideous.  Problems are:
> 
>   1. Constructing the big list of columns to sum is a pain.  
> Sure, I can
>      do it, but it's a chunk of code that I don't want to maintain.
> 
>   2. Passing in the aggregation column names as a 
> comma-separated string
>      feels like a hack.
> 
> What's the "expert" way to do this in data.table?
> 
> Thanks,
> Johann
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
> 


More information about the datatable-help mailing list