[datatable-help] data.table in functions
Harish
harishv_99 at yahoo.com
Wed Jul 14 08:14:21 CEST 2010
Small update. The prior function did not handle when "by" was missing. Also, I thought of a "better" way to remove duplicate columns...
=================
gen2 <- function(data,by,...) {
if ( ! missing( by ) ) {
DT <- eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] ) )
# Remove duplicate columns
strCols <- names( DT )
strDup <- strCols[ duplicated( strCols ) ]
for ( str in strDup )
DT[[ str ]] <- NULL
}
else {
DT <- as.data.table( transform( data, ... ) )
}
return( DT )
}
gen2(x, by=x2, xSum = sum(x1))
gen2(x, by=list(x2,x3), xSum = sum(x1))
gen2(x, xSum=sum(x1))
=================
Please let me know if you find bugs in it or find a better way to do this.
Regards,
Harish
--- On Tue, 7/13/10, Harish <harishv_99 at yahoo.com> wrote:
> From: Harish <harishv_99 at yahoo.com>
> Subject: Re: [datatable-help] data.table in functions
> To: datatable-help at lists.r-forge.r-project.org, "Sasha Goodman" <sashag at stanford.edu>
> Date: Tuesday, July 13, 2010, 10:49 PM
> Sasha,
>
> You should find code that works for you below along with a
> workaround to get rid of the ".1" columns.
>
> You were having an issue because you are treating the
> formula as text. When you pass in a list, things
> aren't quite working out. I didn't bother trying to
> figure out how it is actually interpreting it.
>
> I also put in a workaround for the duplicate columns you
> were getting. You are right in that there was a flag
> to avoid that before (based on a discussion I read in the
> group). Matthew is thinking about how to resolve
> this.
>
> Most of the lines in the function are for the workaround of
> getting rid of duplicate columns. Maybe there is an
> easier way.
>
> The line...
> DT <- eval( bquote(
> data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> is the main line of the function.
>
>
> gen2 <- function(data,by,...) {
> sby <- substitute( by )
> if ( length( sby ) > 1 ) {
> if ( sby[[1]] == "list" )
> lst <- as.list(
> sby )[ -1 ]
> }
> else
> lst <- list( deparse(sby) )
>
> DT <- eval( bquote(
> data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> lapply( lst, function(.i) DT[[.i]]
> <<- NULL )
>
> return( DT )
> }
> gen2(x, by=x2, xSum = sum(x1))
> gen2(x, by=list(x2,x3), xSum = sum(x1))
>
>
>
> Harish
>
>
> --- On Mon, 7/12/10, Sasha Goodman <sashag at stanford.edu>
> wrote:
>
> > From: Sasha Goodman <sashag at stanford.edu>
> > Subject: [datatable-help] data.table in functions
> > To: datatable-help at lists.r-forge.r-project.org
> > Date: Monday, July 12, 2010, 4:22 PM
> > Thanks for the emails, Matthew. I think our
> > talk should be moved to the data.table discussion
> list.
> >
> > So, of all the data manipulation functions I've used,
> > the 'gen' function (which is similar to Stata's
> > egen function) has been the most useful. I would be
> very
> > happy if it were adapted to data.tables.
> >
> >
> >
> > Here is the version that works with plyr.
> >
> > >
> > gen <- function(data, ... ,by=NA)
> > {
> >
> >
> > if(!is.na(by)) {
> > require(plyr)
> >
> >
> > out =
> > ddply(data,by, transform, ... )
> > } else
> > {
> >
> >
> > out =
> > transform(data,...)
> > }
> >
> >
> > return(out)
> > }
> >
> > >
> > y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2),
> > x3= rep(c(1,2),4))
> >
> >
> >
> > > gen(y, xSum=
> > sum(x1), by=c('x2'))
> >
> >
> > x1 x2 x3 xSum
> > 1 1 1 1 4
> > 2 1 1 2 4
> > 3 1 1 1 4
> > 4 1 1 2 4
> > 5 1 2 1 4
> >
> > 6 1 2 2 4
> > 7 1 2 1 4
> > 8 1 2 2 4
> >
> >
> > > gen(y,xSum=
> > sum(x1), by=c('x2','x3') )
> >
> > x1 x2 x3 xSum
> > 1 1 1 1 2
> > 2 1 1 1 2
> >
> > 3 1 1 2 2
> > 4 1 1 2 2
> > 5 1 2 1 2
> > 6 1 2 1 2
> > 7 1 2 2 2
> > 8 1 2 2 2
> >
> > This is the last Matthew and friends came up with
> > for data.tables:
> >
> >
> >
> > >
> > require(data.table)
> >
> > >
> > x = data.table(x1 = as.integer(1),
> > x2 = as.integer(rep(c(1,2), each=2)), x3=
> > as.integer(rep(c(1,2),4)))
> >
> >
> >
> > >
> > gen2 <- function(data,by,...) {
> >
> >
> >
> >
> eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))
> >
> > }
> >
> >
> > >
> > gen2(x, by='x2', xSum = sum(x1))
> >
> > x2 x1 x2.1 x3 xSum
> > [1,] 1
> > 1 1 1 4
> >
> > [2,]
> > 1 1 1 2 4
> > [3,] 1
> > 1 1 1 4
> >
> >
> > [4,] 1 1 1 2 4
> > [5,] 2
> > 1 2 1 4
> > [6,] 2 1 2 2 4
> >
> > [7,]
> > 2 1 2 1 4
> > [8,] 2
> > 1 2 2 4
> >
> > >
> > gen2(x, by=x2, xSum = sum(x1)) #
> > ok
> >
> >
> >
> >
> > > gen2(x,
> > by=list(x2,x3), xSum = sum(x1)) #fails
> >
> > Error in `[.data.table`(data, , +gen(.SD, ...), by =
> list)
> > :
> >
> > column 1 of 'by' list does not evaluate to
> > integer e.g. the by should be a list of expressions.
> Do not
> > quote column names when using by=list(...).
> >
> >
> >
> > >
> > class(x$x2)
> > [1] "integer"
> >
> > >
> > gen2(x, by=c('x2','x3'), xSum = sum(x1))
> > #fails
> >
> >
> > > gen2(x,
> > by=list('x2','x3'), xSum = sum(x1))
> > #fails
> >
> >
> > The first issue is that when there are multiple
> grouping
> > factors, passing those to the data.table through this
> > function leads to hangups (see last few example).
> >
> >
> > The second issue is less critical, but important
> still. The
> > data column x2.1 is automatically added and is
> redundant.
> > With multiple uses of this function, these redundant
> columns
> > add up quickly. I recall that there used to be a way
> to
> > suppress the repeating of columns in this type of
> operation,
> > but it was removed in later versions of data.table.
> >
> >
> >
> >
> > ~~~~~~~~~~~~
> >
> > There were a few other things Matthew and I were
> > discussing, including alternative 'SQL inspired'
> > functions such as 'groupby', 'orderby',
> > 'where', 'select', etc.. I find that
> > functions are a lot easier to remember than syntax,
> for the
> > same reasons natural language syntax is difficult to
> learn
> > compared to learning words. Another selfish reason is
> that
> > SQL was what I and many others learned with
> originally, and
> > the SQL commands would be easier to translate.
> >
> >
> >
> > On Mon, Jul 12, 2010 at 2:28 PM,
> > Matthew Dowle <mdowle at mdowle.plus.com>
> > wrote:
> >
> >
> > Hi,
> >
> >
> >
> > There was a thread on r-help where Hadley and I
> discussed
> > plyr using
> >
> > data.table but something technical prevented it at
> that
> > time as I
> >
> > recall. Might be different now. The spelling is
> Wickham.
> >
> >
> >
> > There is a link on the homepage to the datatable-help
> > subscription
> >
> > page :
> >
> >
> >
> > http://datatable.r-forge.r-project.org/
> >
> >
> >
> > Would be great to see you on that. You explained to me
> once
> > about the
> >
> > functional style I think and I mean to come back to
> it.
> >
> >
> >
> > Best of luck with your diss,
> >
> >
> >
> > Matthew
> >
> >
> >
> >
> >
> >
> >
> > --
> > Sasha Goodman
> > Doctoral Candidate, Organizational Behavior
> >
> >
> > -----Inline Attachment Follows-----
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list