[datatable-help] data.table in functions

Tue Jul 13 01:22:05 CEST 2010

Thanks for the emails, Matthew. I think our talk should be moved to the
data.table discussion list.

So, of all the data manipulation functions I've used, the 'gen' function
(which is similar to Stata's egen function) has been the most useful. I
would be very happy if it were adapted to data.tables.

Here is the version that works with plyr.

> gen <- function(data,  ... ,by=NA) {
 if(!is.na(by)) {
require(plyr)

  out = ddply(data,by, transform, ... )
} else {
out = transform(data,...)
}
  return(out)
}

> y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2), x3= rep(c(1,2),4))

> gen(y, xSum= sum(x1), by=c('x2'))

  x1 x2 x3 xSum
1  1  1  1    4
2  1  1  2    4
3  1  1  1    4
4  1  1  2    4
5  1  2  1    4
6  1  2  2    4
7  1  2  1    4
8  1  2  2    4

> gen(y,xSum= sum(x1), by=c('x2','x3') )
  x1 x2 x3 xSum
1  1  1  1    2
2  1  1  1    2
3  1  1  2    2
4  1  1  2    2
5  1  2  1    2
6  1  2  1    2
7  1  2  2    2
8  1  2  2    2

This is the last Matthew and friends came up with for data.tables:

> require(data.table)
> x = data.table(x1 = as.integer(1), x2 = as.integer(rep(c(1,2), each=2)),
x3= as.integer(rep(c(1,2),4)))

> gen2 <- function(data,by,...) {
eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))
 }

> gen2(x, by='x2', xSum = sum(x1))
     x2 x1 x2.1 x3 xSum
[1,]  1  1    1  1    4
[2,]  1  1    1  2    4
[3,]  1  1    1  1    4
[4,]  1  1    1  2    4
[5,]  2  1    2  1    4
[6,]  2  1    2  2    4
[7,]  2  1    2  1    4
[8,]  2  1    2  2    4

> gen2(x, by=x2, xSum = sum(x1))     # ok

> gen2(x, by=list(x2,x3), xSum = sum(x1)) #fails
Error in `[.data.table`(data, , +gen(.SD, ...), by = list) :
  column 1 of 'by' list does not evaluate to integer e.g. the by should be a
list of expressions. Do not quote column names when using by=list(...).

> class(x$x2)
[1] "integer"

> gen2(x, by=c('x2','x3'), xSum = sum(x1)) #fails
> gen2(x, by=list('x2','x3'), xSum = sum(x1)) #fails

The first issue is that when there are multiple grouping factors, passing
those to the data.table through this function leads to hangups (see last few
example).

The second issue is less critical, but important still. The data column x2.1
is automatically added and is redundant.  With multiple uses of this
function, these redundant columns add up quickly. I recall that there used
to be a way to suppress the repeating of columns in this type of operation,
but it was removed in later versions of data.table.

~~~~~~~~~~~~

There were a few other things Matthew and I were discussing, including
alternative 'SQL inspired' functions such as 'groupby', 'orderby', 'where',
'select', etc.. I find that functions are a lot easier to remember than
syntax, for the same reasons natural language syntax is difficult to learn
compared to learning words. Another selfish reason is that SQL was what I
and many others learned with originally, and the SQL commands would be
easier to translate.

On Mon, Jul 12, 2010 at 2:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> Hi,
>
> There was a thread on r-help where Hadley and I discussed plyr using
> data.table but something technical prevented it at that time as I
> recall. Might be different now. The spelling is Wickham.
>
> There is a link on the homepage to the datatable-help subscription
> page :
>
> http://datatable.r-forge.r-project.org/
>
> Would be great to see you on that. You explained to me once about the
> functional style I think and I mean to come back to it.
>
> Best of luck with your diss,
>
> Matthew
>
>

-- 
Sasha Goodman
Doctoral Candidate, Organizational Behavior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20100712/7ee3e0ba/attachment.htm>