[datatable-help] data.table in functions
Matthew Dowle
mdowle at mdowle.plus.com
Sun Jul 18 12:37:28 CEST 2010
Thanks Harish too. Yes I'll come back to the .1 column names and a
neater/faster way to do this in general hopefully.
It seems related to feature request #978 and I just added a comment
there to reference this thread.
Thanks Sasha for raising it. This one likely benefited from the bug
fixes to scoping that Harish found recently, after the original work
around proposed to Sasha using eval(parse(text=...)). Didn't realise
back then there was a problem.
Matthew
On Wed, 2010-07-14 at 14:22 -0700, Sasha Goodman wrote:
> Thanks Harish! This works great.
>
> On Tue, Jul 13, 2010 at 11:14 PM, Harish <harishv_99 at yahoo.com> wrote:
> Small update. The prior function did not handle when "by" was
> missing. Also, I thought of a "better" way to remove
> duplicate columns...
>
> =================
>
> gen2 <- function(data,by,...) {
> if ( ! missing( by ) ) {
>
> DT <-
> eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] ) )
>
>
> # Remove duplicate columns
> strCols <- names( DT )
> strDup <- strCols[ duplicated( strCols ) ]
> for ( str in strDup )
> DT[[ str ]] <- NULL
> }
> else {
> DT <- as.data.table( transform( data, ... ) )
> }
>
>
> return( DT )
> }
>
>
> gen2(x, by=x2, xSum = sum(x1))
> gen2(x, by=list(x2,x3), xSum = sum(x1))
>
> gen2(x, xSum=sum(x1))
>
> =================
>
> Please let me know if you find bugs in it or find a better way
> to do this.
>
>
> Regards,
> Harish
>
>
> --- On Tue, 7/13/10, Harish <harishv_99 at yahoo.com> wrote:
>
> > From: Harish <harishv_99 at yahoo.com>
> > Subject: Re: [datatable-help] data.table in functions
> > To: datatable-help at lists.r-forge.r-project.org, "Sasha
> Goodman" <sashag at stanford.edu>
> > Date: Tuesday, July 13, 2010, 10:49 PM
>
>
> > Sasha,
> >
> > You should find code that works for you below along with a
> > workaround to get rid of the ".1" columns.
> >
> > You were having an issue because you are treating the
> > formula as text. When you pass in a list, things
> > aren't quite working out. I didn't bother trying to
> > figure out how it is actually interpreting it.
> >
> > I also put in a workaround for the duplicate columns you
> > were getting. You are right in that there was a flag
> > to avoid that before (based on a discussion I read in the
> > group). Matthew is thinking about how to resolve
> > this.
> >
> > Most of the lines in the function are for the workaround of
> > getting rid of duplicate columns. Maybe there is an
> > easier way.
> >
> > The line...
> > DT <- eval( bquote(
> > data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> > is the main line of the function.
> >
> >
> > gen2 <- function(data,by,...) {
> > sby <- substitute( by )
> > if ( length( sby ) > 1 ) {
> > if ( sby[[1]] == "list" )
> > lst <- as.list(
> > sby )[ -1 ]
> > }
> > else
> > lst <- list( deparse(sby) )
> >
> > DT <- eval( bquote(
> > data[,transform(.SD,...),by=.(substitute(by)) ] ) )
> > lapply( lst, function(.i) DT[[.i]]
> > <<- NULL )
> >
> > return( DT )
> > }
> > gen2(x, by=x2, xSum = sum(x1))
> > gen2(x, by=list(x2,x3), xSum = sum(x1))
> >
> >
> >
> > Harish
> >
> >
> > --- On Mon, 7/12/10, Sasha Goodman <sashag at stanford.edu>
> > wrote:
> >
> > > From: Sasha Goodman <sashag at stanford.edu>
> > > Subject: [datatable-help] data.table in functions
> > > To: datatable-help at lists.r-forge.r-project.org
> > > Date: Monday, July 12, 2010, 4:22 PM
> > > Thanks for the emails, Matthew. I think our
> > > talk should be moved to the data.table discussion
> > list.
> > >
> > > So, of all the data manipulation functions I've used,
> > > the 'gen' function (which is similar to Stata's
> > > egen function) has been the most useful. I would be
> > very
> > > happy if it were adapted to data.tables.
> > >
> > >
> > >
> > > Here is the version that works with plyr.
> > >
> > > >
> > > gen <- function(data, ... ,by=NA)
> > > {
> > >
> > >
> > > if(!is.na(by)) {
> > > require(plyr)
> > >
> > >
> > > out =
> > > ddply(data,by, transform, ... )
> > > } else
> > > {
> > >
> > >
> > > out =
> > > transform(data,...)
> > > }
> > >
> > >
> > > return(out)
> > > }
> > >
> > > >
> > > y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2),
> > > x3= rep(c(1,2),4))
> > >
> > >
> > >
> > > > gen(y, xSum=
> > > sum(x1), by=c('x2'))
> > >
> > >
> > > x1 x2 x3 xSum
> > > 1 1 1 1 4
> > > 2 1 1 2 4
> > > 3 1 1 1 4
> > > 4 1 1 2 4
> > > 5 1 2 1 4
> > >
> > > 6 1 2 2 4
> > > 7 1 2 1 4
> > > 8 1 2 2 4
> > >
> > >
> > > > gen(y,xSum=
> > > sum(x1), by=c('x2','x3') )
> > >
> > > x1 x2 x3 xSum
> > > 1 1 1 1 2
> > > 2 1 1 1 2
> > >
> > > 3 1 1 2 2
> > > 4 1 1 2 2
> > > 5 1 2 1 2
> > > 6 1 2 1 2
> > > 7 1 2 2 2
> > > 8 1 2 2 2
> > >
> > > This is the last Matthew and friends came up with
> > > for data.tables:
> > >
> > >
> > >
> > > >
> > > require(data.table)
> > >
> > > >
> > > x = data.table(x1 = as.integer(1),
> > > x2 = as.integer(rep(c(1,2), each=2)), x3=
> > > as.integer(rep(c(1,2),4)))
> > >
> > >
> > >
> > > >
> > > gen2 <- function(data,by,...) {
> > >
> > >
> > >
> > >
> >
> eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))
> > >
> > > }
> > >
> > >
> > > >
> > > gen2(x, by='x2', xSum = sum(x1))
> > >
> > > x2 x1 x2.1 x3 xSum
> > > [1,] 1
> > > 1 1 1 4
> > >
> > > [2,]
> > > 1 1 1 2 4
> > > [3,] 1
> > > 1 1 1 4
> > >
> > >
> > > [4,] 1 1 1 2 4
> > > [5,] 2
> > > 1 2 1 4
> > > [6,] 2 1 2 2 4
> > >
> > > [7,]
> > > 2 1 2 1 4
> > > [8,] 2
> > > 1 2 2 4
> > >
> > > >
> > > gen2(x, by=x2, xSum = sum(x1)) #
> > > ok
> > >
> > >
> > >
> > >
> > > > gen2(x,
> > > by=list(x2,x3), xSum = sum(x1)) #fails
> > >
> > > Error in `[.data.table`(data, , +gen(.SD, ...), by =
> > list)
> > > :
> > >
> > > column 1 of 'by' list does not evaluate to
> > > integer e.g. the by should be a list of expressions.
> > Do not
> > > quote column names when using by=list(...).
> > >
> > >
> > >
> > > >
> > > class(x$x2)
> > > [1] "integer"
> > >
> > > >
> > > gen2(x, by=c('x2','x3'), xSum = sum(x1))
> > > #fails
> > >
> > >
> > > > gen2(x,
> > > by=list('x2','x3'), xSum = sum(x1))
> > > #fails
> > >
> > >
> > > The first issue is that when there are multiple
> > grouping
> > > factors, passing those to the data.table through this
> > > function leads to hangups (see last few example).
> > >
> > >
> > > The second issue is less critical, but important
> > still. The
> > > data column x2.1 is automatically added and is
> > redundant.
> > > With multiple uses of this function, these redundant
> > columns
> > > add up quickly. I recall that there used to be a way
> > to
> > > suppress the repeating of columns in this type of
> > operation,
> > > but it was removed in later versions of data.table.
> > >
> > >
> > >
> > >
> > > ~~~~~~~~~~~~
> > >
> > > There were a few other things Matthew and I were
> > > discussing, including alternative 'SQL inspired'
> > > functions such as 'groupby', 'orderby',
> > > 'where', 'select', etc.. I find that
> > > functions are a lot easier to remember than syntax,
> > for the
> > > same reasons natural language syntax is difficult to
> > learn
> > > compared to learning words. Another selfish reason is
> > that
> > > SQL was what I and many others learned with
> > originally, and
> > > the SQL commands would be easier to translate.
> > >
> > >
> > >
> > > On Mon, Jul 12, 2010 at 2:28 PM,
> > > Matthew Dowle <mdowle at mdowle.plus.com>
> > > wrote:
> > >
> > >
> > > Hi,
> > >
> > >
> > >
> > > There was a thread on r-help where Hadley and I
> > discussed
> > > plyr using
> > >
> > > data.table but something technical prevented it at
> > that
> > > time as I
> > >
> > > recall. Might be different now. The spelling is
> > Wickham.
> > >
> > >
> > >
> > > There is a link on the homepage to the datatable-help
> > > subscription
> > >
> > > page :
> > >
> > >
> > >
> > > http://datatable.r-forge.r-project.org/
> > >
> > >
> > >
> > > Would be great to see you on that. You explained to me
> > once
> > > about the
> > >
> > > functional style I think and I mean to come back to
> > it.
> > >
> > >
> > >
> > > Best of luck with your diss,
> > >
> > >
> > >
> > > Matthew
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Sasha Goodman
> > > Doctoral Candidate, Organizational Behavior
> > >
> > >
> > > -----Inline Attachment Follows-----
> > >
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> >
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
>
>
>
>
>
>
>
> --
> Sasha Goodman
> Doctoral Candidate, Organizational Behavior
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list