[datatable-help] data.table in functions

Harish harishv_99 at yahoo.com
Wed Jul 14 07:49:40 CEST 2010


Sasha,

You should find code that works for you below along with a workaround to get rid of the ".1" columns.

You were having an issue because you are treating the formula as text.  When you pass in a list, things aren't quite working out.  I didn't bother trying to figure out how it is actually interpreting it.

I also put in a workaround for the duplicate columns you were getting.  You are right in that there was a flag to avoid that before (based on a discussion I read in the group).  Matthew is thinking about how to resolve this.

Most of the lines in the function are for the workaround of getting rid of duplicate columns.  Maybe there is an easier way.

The line...
   DT <- eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] ) )
is the main line of the function.


gen2 <- function(data,by,...) {
   sby <- substitute( by )
   if ( length( sby ) > 1 ) {
      if ( sby[[1]] == "list" )
         lst <- as.list( sby )[ -1 ]
   }
   else
      lst <- list( deparse(sby) )

   DT <- eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] ) )
   lapply( lst, function(.i) DT[[.i]] <<- NULL )

   return( DT )
}
gen2(x, by=x2, xSum = sum(x1))
gen2(x, by=list(x2,x3), xSum = sum(x1))



Harish


--- On Mon, 7/12/10, Sasha Goodman <sashag at stanford.edu> wrote:

> From: Sasha Goodman <sashag at stanford.edu>
> Subject: [datatable-help] data.table in functions
> To: datatable-help at lists.r-forge.r-project.org
> Date: Monday, July 12, 2010, 4:22 PM
> Thanks for the emails, Matthew. I think our
> talk should be moved to the data.table discussion list. 
> 
> So, of all the data manipulation functions I've used,
> the 'gen' function (which is similar to Stata's
> egen function) has been the most useful. I would be very
> happy if it were adapted to data.tables. 
> 
> 
> 
> Here is the version that works with plyr.
> 
> >
> gen <- function(data,  ... ,by=NA)
> {
> 
> 
>  if(!is.na(by)) {
> require(plyr)
> 
> 
>   out =
> ddply(data,by, transform, ... )
> } else
> {
> 
> 
> out =
> transform(data,...)
> }
> 
> 
>   return(out)
> }
> 
> >
> y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2),
> x3= rep(c(1,2),4))
> 
> 
> 
> > gen(y, xSum=
> sum(x1), by=c('x2'))
> 
>  
>   x1 x2 x3 xSum
> 1  1  1  1    4
> 2  1  1  2    4
> 3  1  1  1    4
> 4  1  1  2    4
> 5  1  2  1    4
> 
> 6  1  2  2    4
> 7  1  2  1    4
> 8  1  2  2    4
> 
> 
> > gen(y,xSum=
> sum(x1), by=c('x2','x3') )
> 
>   x1 x2 x3 xSum
> 1  1  1  1    2
> 2  1  1  1    2
> 
> 3  1  1  2    2
> 4  1  1  2    2
> 5  1  2  1    2
> 6  1  2  1    2
> 7  1  2  2    2
> 8  1  2  2    2
> 
> This is the last Matthew and friends came up with
> for data.tables:
> 
> 
> 
> >
> require(data.table)
> 
> >
> x = data.table(x1 = as.integer(1),
>  x2 = as.integer(rep(c(1,2), each=2)), x3=
> as.integer(rep(c(1,2),4)))
> 
> 
> 
> >
> gen2 <- function(data,by,...) {
> 
> 
> 
> eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))
> 
>  }
> 
> 
> >
> gen2(x, by='x2', xSum = sum(x1))  
> 
>      x2 x1 x2.1 x3 xSum
> [1,]  1 
> 1    1  1    4
> 
> [2,] 
> 1  1    1  2    4
> [3,]  1 
> 1    1  1    4
> 
> 
> [4,]  1  1    1  2    4
> [5,]  2 
> 1    2  1    4
> [6,]  2  1    2  2    4
> 
> [7,] 
> 2  1    2  1    4
> [8,]  2 
> 1    2  2    4
> 
> >
> gen2(x, by=x2, xSum = sum(x1))     #
> ok
> 
> 
> 
> 
> > gen2(x,
> by=list(x2,x3), xSum = sum(x1)) #fails
> 
> Error in `[.data.table`(data, , +gen(.SD, ...), by = list)
> : 
> 
>   column 1 of 'by' list does not evaluate to
> integer e.g. the by should be a list of expressions. Do not
> quote column names when using by=list(...).
> 
> 
> 
> >
> class(x$x2)
> [1] "integer"
> 
> >
> gen2(x, by=c('x2','x3'), xSum = sum(x1))
> #fails
> 
> 
> > gen2(x,
> by=list('x2','x3'), xSum = sum(x1))
> #fails
> 
> 
> The first issue is that when there are multiple grouping
> factors, passing those to the data.table through this
> function leads to hangups (see last few example).
> 
> 
> The second issue is less critical, but important still. The
> data column x2.1 is automatically added and is redundant. 
> With multiple uses of this function, these redundant columns
> add up quickly. I recall that there used to be a way to
> suppress the repeating of columns in this type of operation,
> but it was removed in later versions of data.table.
> 
> 
> 
> 
> ~~~~~~~~~~~~
> 
> There were a few other things Matthew and I were
> discussing, including alternative 'SQL inspired'
> functions such as 'groupby', 'orderby',
> 'where', 'select', etc.. I find that
> functions are a lot easier to remember than syntax, for the
> same reasons natural language syntax is difficult to learn
> compared to learning words. Another selfish reason is that
> SQL was what I and many others learned with originally, and
> the SQL commands would be easier to translate.
> 
> 
> 
> On Mon, Jul 12, 2010 at 2:28 PM,
> Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
> 
> 
> Hi,
> 
> 
> 
> There was a thread on r-help where Hadley and I discussed
> plyr using
> 
> data.table but something technical prevented it at that
> time as I
> 
> recall. Might be different now. The spelling is Wickham.
> 
> 
> 
> There is a link on the homepage to the datatable-help
> subscription
> 
> page :
> 
> 
> 
> http://datatable.r-forge.r-project.org/
> 
> 
> 
> Would be great to see you on that. You explained to me once
> about the
> 
> functional style I think and I mean to come back to it.
> 
> 
> 
> Best of luck with your diss,
> 
> 
> 
> Matthew
> 
> 
> 
> 
> 
> 
> 
> -- 
> Sasha Goodman
> Doctoral Candidate, Organizational Behavior
> 
> 
> -----Inline Attachment Follows-----
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 


      


More information about the datatable-help mailing list