Thanks Harish! This works great.<br><br><div class="gmail_quote">On Tue, Jul 13, 2010 at 11:14 PM, Harish <span dir="ltr"><<a href="mailto:harishv_99@yahoo.com">harishv_99@yahoo.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im">Small update. The prior function did not handle when "by" was missing. Also, I thought of a "better" way to remove duplicate columns...<br>
<br>
=================<br>
<br>
gen2 <- function(data,by,...) {<br>
if ( ! missing( by ) ) {<br>
</div><div class="im"> DT <- eval( bquote( data[,transform(.SD,...),by=.(substitute(by)) ] ) )<br>
<br>
</div><div class="im"> # Remove duplicate columns<br>
strCols <- names( DT )<br>
strDup <- strCols[ duplicated( strCols ) ]<br>
for ( str in strDup )<br>
DT[[ str ]] <- NULL<br>
}<br>
else {<br>
DT <- as.data.table( transform( data, ... ) )<br>
}<br>
<br>
</div><div class="im"> return( DT )<br>
}<br>
<br>
<br>
gen2(x, by=x2, xSum = sum(x1))<br>
gen2(x, by=list(x2,x3), xSum = sum(x1))<br>
</div><div class="im">gen2(x, xSum=sum(x1))<br>
<br>
=================<br>
<br>
Please let me know if you find bugs in it or find a better way to do this.<br>
<br>
<br>
Regards,<br>
Harish<br>
<br>
<br>
--- On Tue, 7/13/10, Harish <<a href="mailto:harishv_99@yahoo.com">harishv_99@yahoo.com</a>> wrote:<br>
<br>
> From: Harish <<a href="mailto:harishv_99@yahoo.com">harishv_99@yahoo.com</a>><br>
> Subject: Re: [datatable-help] data.table in functions<br>
> To: <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>, "Sasha Goodman" <<a href="mailto:sashag@stanford.edu">sashag@stanford.edu</a>><br>
> Date: Tuesday, July 13, 2010, 10:49 PM<br>
</div><div><div></div><div class="h5">> Sasha,<br>
><br>
> You should find code that works for you below along with a<br>
> workaround to get rid of the ".1" columns.<br>
><br>
> You were having an issue because you are treating the<br>
> formula as text. When you pass in a list, things<br>
> aren't quite working out. I didn't bother trying to<br>
> figure out how it is actually interpreting it.<br>
><br>
> I also put in a workaround for the duplicate columns you<br>
> were getting. You are right in that there was a flag<br>
> to avoid that before (based on a discussion I read in the<br>
> group). Matthew is thinking about how to resolve<br>
> this.<br>
><br>
> Most of the lines in the function are for the workaround of<br>
> getting rid of duplicate columns. Maybe there is an<br>
> easier way.<br>
><br>
> The line...<br>
> DT <- eval( bquote(<br>
> data[,transform(.SD,...),by=.(substitute(by)) ] ) )<br>
> is the main line of the function.<br>
><br>
><br>
> gen2 <- function(data,by,...) {<br>
> sby <- substitute( by )<br>
> if ( length( sby ) > 1 ) {<br>
> if ( sby[[1]] == "list" )<br>
> lst <- as.list(<br>
> sby )[ -1 ]<br>
> }<br>
> else<br>
> lst <- list( deparse(sby) )<br>
><br>
> DT <- eval( bquote(<br>
> data[,transform(.SD,...),by=.(substitute(by)) ] ) )<br>
> lapply( lst, function(.i) DT[[.i]]<br>
> <<- NULL )<br>
><br>
> return( DT )<br>
> }<br>
> gen2(x, by=x2, xSum = sum(x1))<br>
> gen2(x, by=list(x2,x3), xSum = sum(x1))<br>
><br>
><br>
><br>
> Harish<br>
><br>
><br>
> --- On Mon, 7/12/10, Sasha Goodman <<a href="mailto:sashag@stanford.edu">sashag@stanford.edu</a>><br>
> wrote:<br>
><br>
> > From: Sasha Goodman <<a href="mailto:sashag@stanford.edu">sashag@stanford.edu</a>><br>
> > Subject: [datatable-help] data.table in functions<br>
> > To: <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> > Date: Monday, July 12, 2010, 4:22 PM<br>
> > Thanks for the emails, Matthew. I think our<br>
> > talk should be moved to the data.table discussion<br>
> list.<br>
> ><br>
> > So, of all the data manipulation functions I've used,<br>
> > the 'gen' function (which is similar to Stata's<br>
> > egen function) has been the most useful. I would be<br>
> very<br>
> > happy if it were adapted to data.tables.<br>
> ><br>
> ><br>
> ><br>
> > Here is the version that works with plyr.<br>
> ><br>
> > ><br>
> > gen <- function(data, ... ,by=NA)<br>
> > {<br>
> ><br>
> ><br>
> > if(!<a href="http://is.na" target="_blank">is.na</a>(by)) {<br>
> > require(plyr)<br>
> ><br>
> ><br>
> > out =<br>
> > ddply(data,by, transform, ... )<br>
> > } else<br>
> > {<br>
> ><br>
> ><br>
> > out =<br>
> > transform(data,...)<br>
> > }<br>
> ><br>
> ><br>
> > return(out)<br>
> > }<br>
> ><br>
> > ><br>
> > y = data.frame(x1 = 1, x2 = rep(c(1,2), each=2),<br>
> > x3= rep(c(1,2),4))<br>
> ><br>
> ><br>
> ><br>
> > > gen(y, xSum=<br>
> > sum(x1), by=c('x2'))<br>
> ><br>
> > <br>
> > x1 x2 x3 xSum<br>
> > 1 1 1 1 4<br>
> > 2 1 1 2 4<br>
> > 3 1 1 1 4<br>
> > 4 1 1 2 4<br>
> > 5 1 2 1 4<br>
> ><br>
> > 6 1 2 2 4<br>
> > 7 1 2 1 4<br>
> > 8 1 2 2 4<br>
> ><br>
> ><br>
> > > gen(y,xSum=<br>
> > sum(x1), by=c('x2','x3') )<br>
> ><br>
> > x1 x2 x3 xSum<br>
> > 1 1 1 1 2<br>
> > 2 1 1 1 2<br>
> ><br>
> > 3 1 1 2 2<br>
> > 4 1 1 2 2<br>
> > 5 1 2 1 2<br>
> > 6 1 2 1 2<br>
> > 7 1 2 2 2<br>
> > 8 1 2 2 2<br>
> ><br>
> > This is the last Matthew and friends came up with<br>
> > for data.tables:<br>
> ><br>
> ><br>
> ><br>
> > ><br>
> > require(data.table)<br>
> ><br>
> > ><br>
> > x = data.table(x1 = as.integer(1),<br>
> > x2 = as.integer(rep(c(1,2), each=2)), x3=<br>
> > as.integer(rep(c(1,2),4)))<br>
> ><br>
> ><br>
> ><br>
> > ><br>
> > gen2 <- function(data,by,...) {<br>
> ><br>
> ><br>
> ><br>
> ><br>
> eval(parse(text=paste("data[,transform(.SD,...),by=",substitute(by),"]")))<br>
> ><br>
> > }<br>
> ><br>
> ><br>
> > ><br>
> > gen2(x, by='x2', xSum = sum(x1)) <br>
> ><br>
> > x2 x1 x2.1 x3 xSum<br>
> > [1,] 1 <br>
> > 1 1 1 4<br>
> ><br>
> > [2,] <br>
> > 1 1 1 2 4<br>
> > [3,] 1 <br>
> > 1 1 1 4<br>
> ><br>
> ><br>
> > [4,] 1 1 1 2 4<br>
> > [5,] 2 <br>
> > 1 2 1 4<br>
> > [6,] 2 1 2 2 4<br>
> ><br>
> > [7,] <br>
> > 2 1 2 1 4<br>
> > [8,] 2 <br>
> > 1 2 2 4<br>
> ><br>
> > ><br>
> > gen2(x, by=x2, xSum = sum(x1)) #<br>
> > ok<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > > gen2(x,<br>
> > by=list(x2,x3), xSum = sum(x1)) #fails<br>
> ><br>
> > Error in `[.data.table`(data, , +gen(.SD, ...), by =<br>
> list)<br>
> > :<br>
> ><br>
> > column 1 of 'by' list does not evaluate to<br>
> > integer e.g. the by should be a list of expressions.<br>
> Do not<br>
> > quote column names when using by=list(...).<br>
> ><br>
> ><br>
> ><br>
> > ><br>
> > class(x$x2)<br>
> > [1] "integer"<br>
> ><br>
> > ><br>
> > gen2(x, by=c('x2','x3'), xSum = sum(x1))<br>
> > #fails<br>
> ><br>
> ><br>
> > > gen2(x,<br>
> > by=list('x2','x3'), xSum = sum(x1))<br>
> > #fails<br>
> ><br>
> ><br>
> > The first issue is that when there are multiple<br>
> grouping<br>
> > factors, passing those to the data.table through this<br>
> > function leads to hangups (see last few example).<br>
> ><br>
> ><br>
> > The second issue is less critical, but important<br>
> still. The<br>
> > data column x2.1 is automatically added and is<br>
> redundant. <br>
> > With multiple uses of this function, these redundant<br>
> columns<br>
> > add up quickly. I recall that there used to be a way<br>
> to<br>
> > suppress the repeating of columns in this type of<br>
> operation,<br>
> > but it was removed in later versions of data.table.<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > ~~~~~~~~~~~~<br>
> ><br>
> > There were a few other things Matthew and I were<br>
> > discussing, including alternative 'SQL inspired'<br>
> > functions such as 'groupby', 'orderby',<br>
> > 'where', 'select', etc.. I find that<br>
> > functions are a lot easier to remember than syntax,<br>
> for the<br>
> > same reasons natural language syntax is difficult to<br>
> learn<br>
> > compared to learning words. Another selfish reason is<br>
> that<br>
> > SQL was what I and many others learned with<br>
> originally, and<br>
> > the SQL commands would be easier to translate.<br>
> ><br>
> ><br>
> ><br>
> > On Mon, Jul 12, 2010 at 2:28 PM,<br>
> > Matthew Dowle <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>><br>
> > wrote:<br>
> ><br>
> ><br>
> > Hi,<br>
> ><br>
> ><br>
> ><br>
> > There was a thread on r-help where Hadley and I<br>
> discussed<br>
> > plyr using<br>
> ><br>
> > data.table but something technical prevented it at<br>
> that<br>
> > time as I<br>
> ><br>
> > recall. Might be different now. The spelling is<br>
> Wickham.<br>
> ><br>
> ><br>
> ><br>
> > There is a link on the homepage to the datatable-help<br>
> > subscription<br>
> ><br>
> > page :<br>
> ><br>
> ><br>
> ><br>
> > <a href="http://datatable.r-forge.r-project.org/" target="_blank">http://datatable.r-forge.r-project.org/</a><br>
> ><br>
> ><br>
> ><br>
> > Would be great to see you on that. You explained to me<br>
> once<br>
> > about the<br>
> ><br>
> > functional style I think and I mean to come back to<br>
> it.<br>
> ><br>
> ><br>
> ><br>
> > Best of luck with your diss,<br>
> ><br>
> ><br>
> ><br>
> > Matthew<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > --<br>
> > Sasha Goodman<br>
> > Doctoral Candidate, Organizational Behavior<br>
> ><br>
> ><br>
> > -----Inline Attachment Follows-----<br>
> ><br>
> > _______________________________________________<br>
> > datatable-help mailing list<br>
> > <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> > <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
> ><br>
><br>
><br>
> <br>
> _______________________________________________<br>
> datatable-help mailing list<br>
> <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
><br>
<br>
<br>
<br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Sasha Goodman<br>Doctoral Candidate, Organizational Behavior<br>