[datatable-help] data.table in functions

Sasha Goodman sashag at stanford.edu
Wed Jul 14 23:22:07 CEST 2010


Thanks Harish! This works great.

On Tue, Jul 13, 2010 at 11:14 PM, Harish <harishv_99 at yahoo.com> wrote:

> Small update.  The prior function did not handle when "by" was missing.
>  Also, I thought of a "better" way to remove duplicate columns...
>
> =================
>
> gen2 <- function(data,by,...) {
>   if ( ! missing( by ) ) {
>       DT <- eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] )
> )
>
>       # Remove duplicate columns
>      strCols <- names( DT )
>      strDup <- strCols[ duplicated( strCols ) ]
>      for ( str in strDup )
>         DT[[ str ]] <- NULL
>   }
>   else {
>      DT <- as.data.table( transform( data, ... ) )
>   }
>
>    return( DT )
> }
>
>
> gen2(x, by=x2, xSum = sum(x1))
> gen2(x, by=list(x2,x3), xSum = sum(x1))
> gen2(x, xSum=sum(x1))
>
> =================
>
> Please let me know if you find bugs in it or find a better way to do this.
>
>
> Regards,
> Harish
>
>
> --- On Tue, 7/13/10, Harish <harishv_99 at yahoo.com> wrote:
>
> > From: Harish <harishv_99 at yahoo.com>
> > Subject: Re: [datatable-help] data.table in functions
> > To: datatable-help at lists.r-forge.r-project.org, "Sasha Goodman" <
> sashag at stanford.edu>
> > Date: Tuesday, July 13, 2010, 10:49 PM
> > Sasha,
> >
> > You should find code that works for you below along with a
> > workaround to get rid of the ".1" columns.
> >
> > You were having an issue because you are treating the
> > formula as text.  When you pass in a list, things
> > aren't quite working out.  I didn't bother trying to
> > figure out how it is actually interpreting it.
> >
> > I also put in a workaround for the duplicate columns you
> > were getting.  You are right in that there was a flag
> > to avoid that before (based on a discussion I read in the
> > group).  Matthew is thinking about how to resolve
> > this.
> >
> > Most of the lines in the function are for the workaround of
> > getting rid of duplicate columns.  Maybe there is an
> > easier way.
> >
> > The line...
> >    DT <- eval( bquote(
> > data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> > is the main line of the function.
> >
> >
> > gen2 <- function(data,by,...) {
> >    sby <- substitute( by )
> >    if ( length( sby ) > 1 ) {
> >       if ( sby[[1]] == "list" )
> >          lst <- as.list(
> > sby )[ -1 ]
> >    }
> >    else
> >       lst <- list( deparse(sby) )
> >
> >    DT <- eval( bquote(
> > data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> >    lapply( lst, function(.i) DT[[.i]]
> > <<- NULL )
> >
> >    return( DT )
> > }
> > gen2(x, by=x2, xSum = sum(x1))
> > gen2(x, by=list(x2,x3), xSum = sum(x1))
> >
> >
> >
> > Harish
> >
> >
> > --- On Mon, 7/12/10, Sasha Goodman <sashag at stanford.edu>
> > wrote:
> >
> > > From: Sasha Goodman <sashag at stanford.edu>
> > > Subject: [datatable-help] data.table in functions
> > > To: datatable-help at lists.r-forge.r-project.org
> > > Date: Monday, July 12, 2010, 4:22 PM
> > > Thanks for the emails, Matthew. I think our
> > > talk should be moved to the data.table discussion
> > list.
> > >
> > > So, of all the data manipulation functions I've used,
> > > the 'gen' function (which is similar to Stata's
> > > egen function) has been the most useful. I would be
> > very
> > > happy if it were adapted to data.tables.
> > >
> > >
> > >
> > > Here is the version that works with plyr.
> > >
> > > >
> > > gen <- function(data,  ... ,by=NA)
> > > {
> > >
> > >
> > >  if(!is.na(by)) {
> > > require(plyr)
> > >
> > >
> > >   out =
> > > ddply(data,by, transform, ... )
> > > } else
> > > {
> > >
> > >
> > > out =
> > > transform(data,...)
> > > }
> > >
> > >
> > >   return(out)
> > > }
> > >
> > > >
> > > y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2),
> > > x3= rep(c(1,2),4))
> > >
> > >
> > >
> > > > gen(y, xSum=
> > > sum(x1), by=c('x2'))
> > >
> > >
> > >   x1 x2 x3 xSum
> > > 1  1  1  1    4
> > > 2  1  1  2    4
> > > 3  1  1  1    4
> > > 4  1  1  2    4
> > > 5  1  2  1    4
> > >
> > > 6  1  2  2    4
> > > 7  1  2  1    4
> > > 8  1  2  2    4
> > >
> > >
> > > > gen(y,xSum=
> > > sum(x1), by=c('x2','x3') )
> > >
> > >   x1 x2 x3 xSum
> > > 1  1  1  1    2
> > > 2  1  1  1    2
> > >
> > > 3  1  1  2    2
> > > 4  1  1  2    2
> > > 5  1  2  1    2
> > > 6  1  2  1    2
> > > 7  1  2  2    2
> > > 8  1  2  2    2
> > >
> > > This is the last Matthew and friends came up with
> > > for data.tables:
> > >
> > >
> > >
> > > >
> > > require(data.table)
> > >
> > > >
> > > x = data.table(x1 = as.integer(1),
> > >  x2 = as.integer(rep(c(1,2), each=2)), x3=
> > > as.integer(rep(c(1,2),4)))
> > >
> > >
> > >
> > > >
> > > gen2 <- function(data,by,...) {
> > >
> > >
> > >
> > >
> >
> eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))
> > >
> > >  }
> > >
> > >
> > > >
> > > gen2(x, by='x2', xSum = sum(x1))
> > >
> > >      x2 x1 x2.1 x3 xSum
> > > [1,]  1
> > > 1    1  1    4
> > >
> > > [2,]
> > > 1  1    1  2    4
> > > [3,]  1
> > > 1    1  1    4
> > >
> > >
> > > [4,]  1  1    1  2    4
> > > [5,]  2
> > > 1    2  1    4
> > > [6,]  2  1    2  2    4
> > >
> > > [7,]
> > > 2  1    2  1    4
> > > [8,]  2
> > > 1    2  2    4
> > >
> > > >
> > > gen2(x, by=x2, xSum = sum(x1))     #
> > > ok
> > >
> > >
> > >
> > >
> > > > gen2(x,
> > > by=list(x2,x3), xSum = sum(x1)) #fails
> > >
> > > Error in `[.data.table`(data, , +gen(.SD, ...), by =
> > list)
> > > :
> > >
> > >   column 1 of 'by' list does not evaluate to
> > > integer e.g. the by should be a list of expressions.
> > Do not
> > > quote column names when using by=list(...).
> > >
> > >
> > >
> > > >
> > > class(x$x2)
> > > [1] "integer"
> > >
> > > >
> > > gen2(x, by=c('x2','x3'), xSum = sum(x1))
> > > #fails
> > >
> > >
> > > > gen2(x,
> > > by=list('x2','x3'), xSum = sum(x1))
> > > #fails
> > >
> > >
> > > The first issue is that when there are multiple
> > grouping
> > > factors, passing those to the data.table through this
> > > function leads to hangups (see last few example).
> > >
> > >
> > > The second issue is less critical, but important
> > still. The
> > > data column x2.1 is automatically added and is
> > redundant.
> > > With multiple uses of this function, these redundant
> > columns
> > > add up quickly. I recall that there used to be a way
> > to
> > > suppress the repeating of columns in this type of
> > operation,
> > > but it was removed in later versions of data.table.
> > >
> > >
> > >
> > >
> > > ~~~~~~~~~~~~
> > >
> > > There were a few other things Matthew and I were
> > > discussing, including alternative 'SQL inspired'
> > > functions such as 'groupby', 'orderby',
> > > 'where', 'select', etc.. I find that
> > > functions are a lot easier to remember than syntax,
> > for the
> > > same reasons natural language syntax is difficult to
> > learn
> > > compared to learning words. Another selfish reason is
> > that
> > > SQL was what I and many others learned with
> > originally, and
> > > the SQL commands would be easier to translate.
> > >
> > >
> > >
> > > On Mon, Jul 12, 2010 at 2:28 PM,
> > > Matthew Dowle <mdowle at mdowle.plus.com>
> > > wrote:
> > >
> > >
> > > Hi,
> > >
> > >
> > >
> > > There was a thread on r-help where Hadley and I
> > discussed
> > > plyr using
> > >
> > > data.table but something technical prevented it at
> > that
> > > time as I
> > >
> > > recall. Might be different now. The spelling is
> > Wickham.
> > >
> > >
> > >
> > > There is a link on the homepage to the datatable-help
> > > subscription
> > >
> > > page :
> > >
> > >
> > >
> > > http://datatable.r-forge.r-project.org/
> > >
> > >
> > >
> > > Would be great to see you on that. You explained to me
> > once
> > > about the
> > >
> > > functional style I think and I mean to come back to
> > it.
> > >
> > >
> > >
> > > Best of luck with your diss,
> > >
> > >
> > >
> > > Matthew
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sasha Goodman
> > > Doctoral Candidate, Organizational Behavior
> > >
> > >
> > > -----Inline Attachment Follows-----
> > >
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> >
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
>
>
>
>


-- 
Sasha Goodman
Doctoral Candidate, Organizational Behavior
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20100714/58786f9c/attachment.htm>


More information about the datatable-help mailing list