[datatable-help] data.table in functions

Wed Jul 14 08:14:21 CEST 2010

Small update.  The prior function did not handle when "by" was missing.  Also, I thought of a "better" way to remove duplicate columns...

=================

gen2 <- function(data,by,...) {
   if ( ! missing( by ) ) {
      DT <- eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] ) )

      # Remove duplicate columns
      strCols <- names( DT )
      strDup <- strCols[ duplicated( strCols ) ]
      for ( str in strDup )
         DT[[ str ]] <- NULL
   }
   else {
      DT <- as.data.table( transform( data, ... ) )
   }

   return( DT )
}

gen2(x, by=x2, xSum = sum(x1))
gen2(x, by=list(x2,x3), xSum = sum(x1))
gen2(x, xSum=sum(x1))

=================

Please let me know if you find bugs in it or find a better way to do this.

Regards,
Harish

--- On Tue, 7/13/10, Harish <harishv_99 at yahoo.com> wrote:

> From: Harish <harishv_99 at yahoo.com>
> Subject: Re: [datatable-help] data.table in functions
> To: datatable-help at lists.r-forge.r-project.org, "Sasha Goodman" <sashag at stanford.edu>
> Date: Tuesday, July 13, 2010, 10:49 PM
> Sasha,
> 
> You should find code that works for you below along with a
> workaround to get rid of the ".1" columns.
> 
> You were having an issue because you are treating the
> formula as text.  When you pass in a list, things
> aren't quite working out.  I didn't bother trying to
> figure out how it is actually interpreting it.
> 
> I also put in a workaround for the duplicate columns you
> were getting.  You are right in that there was a flag
> to avoid that before (based on a discussion I read in the
> group).  Matthew is thinking about how to resolve
> this.
> 
> Most of the lines in the function are for the workaround of
> getting rid of duplicate columns.  Maybe there is an
> easier way.
> 
> The line...
>    DT <- eval( bquote(
> data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> is the main line of the function.
> 
> 
> gen2 <- function(data,by,...) {
>    sby <- substitute( by )
>    if ( length( sby ) > 1 ) {
>       if ( sby[[1]] == "list" )
>          lst <- as.list(
> sby )[ -1 ]
>    }
>    else
>       lst <- list( deparse(sby) )
> 
>    DT <- eval( bquote(
> data[,transform(.SD,...),by=.(substitute(by)) ] ) )
>    lapply( lst, function(.i) DT[[.i]]
> <<- NULL )
> 
>    return( DT )
> }
> gen2(x, by=x2, xSum = sum(x1))
> gen2(x, by=list(x2,x3), xSum = sum(x1))
> 
> 
> 
> Harish
> 
> 
> --- On Mon, 7/12/10, Sasha Goodman <sashag at stanford.edu>
> wrote:
> 
> > From: Sasha Goodman <sashag at stanford.edu>
> > Subject: [datatable-help] data.table in functions
> > To: datatable-help at lists.r-forge.r-project.org
> > Date: Monday, July 12, 2010, 4:22 PM
> > Thanks for the emails, Matthew. I think our
> > talk should be moved to the data.table discussion
> list. 
> > 
> > So, of all the data manipulation functions I've used,
> > the 'gen' function (which is similar to Stata's
> > egen function) has been the most useful. I would be
> very
> > happy if it were adapted to data.tables. 
> > 
> > 
> > 
> > Here is the version that works with plyr.
> > 
> > >
> > gen <- function(data,  ... ,by=NA)
> > {
> > 
> > 
> >  if(!is.na(by)) {
> > require(plyr)
> > 
> > 
> >   out =
> > ddply(data,by, transform, ... )
> > } else
> > {
> > 
> > 
> > out =
> > transform(data,...)
> > }
> > 
> > 
> >   return(out)
> > }
> > 
> > >
> > y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2),
> > x3= rep(c(1,2),4))
> > 
> > 
> > 
> > > gen(y, xSum=
> > sum(x1), by=c('x2'))
> > 
> >  
> >   x1 x2 x3 xSum
> > 1  1  1  1    4
> > 2  1  1  2    4
> > 3  1  1  1    4
> > 4  1  1  2    4
> > 5  1  2  1    4
> > 
> > 6  1  2  2    4
> > 7  1  2  1    4
> > 8  1  2  2    4
> > 
> > 
> > > gen(y,xSum=
> > sum(x1), by=c('x2','x3') )
> > 
> >   x1 x2 x3 xSum
> > 1  1  1  1    2
> > 2  1  1  1    2
> > 
> > 3  1  1  2    2
> > 4  1  1  2    2
> > 5  1  2  1    2
> > 6  1  2  1    2
> > 7  1  2  2    2
> > 8  1  2  2    2
> > 
> > This is the last Matthew and friends came up with
> > for data.tables:
> > 
> > 
> > 
> > >
> > require(data.table)
> > 
> > >
> > x = data.table(x1 = as.integer(1),
> >  x2 = as.integer(rep(c(1,2), each=2)), x3=
> > as.integer(rep(c(1,2),4)))
> > 
> > 
> > 
> > >
> > gen2 <- function(data,by,...) {
> > 
> > 
> > 
> >
> eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))
> > 
> >  }
> > 
> > 
> > >
> > gen2(x, by='x2', xSum = sum(x1))  
> > 
> >      x2 x1 x2.1 x3 xSum
> > [1,]  1 
> > 1    1  1    4
> > 
> > [2,] 
> > 1  1    1  2    4
> > [3,]  1 
> > 1    1  1    4
> > 
> > 
> > [4,]  1  1    1  2    4
> > [5,]  2 
> > 1    2  1    4
> > [6,]  2  1    2  2    4
> > 
> > [7,] 
> > 2  1    2  1    4
> > [8,]  2 
> > 1    2  2    4
> > 
> > >
> > gen2(x, by=x2, xSum = sum(x1))     #
> > ok
> > 
> > 
> > 
> > 
> > > gen2(x,
> > by=list(x2,x3), xSum = sum(x1)) #fails
> > 
> > Error in `[.data.table`(data, , +gen(.SD, ...), by =
> list)
> > : 
> > 
> >   column 1 of 'by' list does not evaluate to
> > integer e.g. the by should be a list of expressions.
> Do not
> > quote column names when using by=list(...).
> > 
> > 
> > 
> > >
> > class(x$x2)
> > [1] "integer"
> > 
> > >
> > gen2(x, by=c('x2','x3'), xSum = sum(x1))
> > #fails
> > 
> > 
> > > gen2(x,
> > by=list('x2','x3'), xSum = sum(x1))
> > #fails
> > 
> > 
> > The first issue is that when there are multiple
> grouping
> > factors, passing those to the data.table through this
> > function leads to hangups (see last few example).
> > 
> > 
> > The second issue is less critical, but important
> still. The
> > data column x2.1 is automatically added and is
> redundant. 
> > With multiple uses of this function, these redundant
> columns
> > add up quickly. I recall that there used to be a way
> to
> > suppress the repeating of columns in this type of
> operation,
> > but it was removed in later versions of data.table.
> > 
> > 
> > 
> > 
> > ~~~~~~~~~~~~
> > 
> > There were a few other things Matthew and I were
> > discussing, including alternative 'SQL inspired'
> > functions such as 'groupby', 'orderby',
> > 'where', 'select', etc.. I find that
> > functions are a lot easier to remember than syntax,
> for the
> > same reasons natural language syntax is difficult to
> learn
> > compared to learning words. Another selfish reason is
> that
> > SQL was what I and many others learned with
> originally, and
> > the SQL commands would be easier to translate.
> > 
> > 
> > 
> > On Mon, Jul 12, 2010 at 2:28 PM,
> > Matthew Dowle <mdowle at mdowle.plus.com>
> > wrote:
> > 
> > 
> > Hi,
> > 
> > 
> > 
> > There was a thread on r-help where Hadley and I
> discussed
> > plyr using
> > 
> > data.table but something technical prevented it at
> that
> > time as I
> > 
> > recall. Might be different now. The spelling is
> Wickham.
> > 
> > 
> > 
> > There is a link on the homepage to the datatable-help
> > subscription
> > 
> > page :
> > 
> > 
> > 
> > http://datatable.r-forge.r-project.org/
> > 
> > 
> > 
> > Would be great to see you on that. You explained to me
> once
> > about the
> > 
> > functional style I think and I mean to come back to
> it.
> > 
> > 
> > 
> > Best of luck with your diss,
> > 
> > 
> > 
> > Matthew
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > -- 
> > Sasha Goodman
> > Doctoral Candidate, Organizational Behavior
> > 
> > 
> > -----Inline Attachment Follows-----
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> 
> 
>       
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>