[datatable-help] data.table by versus apply

Matthew Dowle mdowle at mdowle.plus.com
Tue Apr 19 00:00:51 CEST 2011


Damian,
Bug #1301 fixed. Workaround below no longer needed.
Matthew

On Sun, 2011-02-27 at 17:14 +0000, Matthew Dowle wrote:
> Hi,
> 
> How about this :
> 
> fns=c(max,min)
> test.dt <- data.table(ID=1:10, SCORE_1=rnorm(10), SCORE_2=rnorm(10),
> SCORE_3=rnorm(10), fn=c(rep(1, 5), rep(2, 5)))
> 
> test.dt[,fns[[fn]](SCORE_1,SCORE_2,SCORE_3),by=ID]  # bug #1301 raised
> 
> test.dt[,{fn;fns[[fn]](SCORE_1,SCORE_2,SCORE_3)},by=ID]  # workaround
>       ID         V1
>  [1,]  1 -1.6788065
>  [2,]  2 -1.4021021
>  [3,]  3 -1.0469943
>  [4,]  4 -1.2663419
>  [5,]  5 -0.2765518
>  [6,]  6  0.3511581
>  [7,]  7  1.1809315
>  [8,]  8  0.3570631
>  [9,]  9  0.9680948
> [10,] 10  1.3025652
> 
> The bug is because the variable 'fn' isn't being detected as used by j
> (incorrectly) so it isn't being subset. Maybe because it appears inside
> the [[]]. Using fn explicity in the workaround gets around that. Raised
> bug #1301 to fix that.
> 
> Also, data.table could be enhanced to allow a column to contain a list
> of functions directly, rather than a lookup. Should be ok provided it
> was pointers to functions rather than the functions themselves repeated
> over and over. Might be quite useful. FR#1302 raised to do that. You can
> probably create data.frame and data.table with a list column containing
> functions already, but whether operations on those columns work I doubt.
> Might not be very difficult to do though.
> 
> Thanks for helping to discover a new bug and new fr !
> 
> Matthew
> 
> 
> On Sat, 2011-02-26 at 16:55 -0600, Damian Betebenner wrote:
> > All,
> >
> > I’m curious from a speed perspective what the analog of apply is in
> > data.table as I have a problem where, for each row,  I want to take
> > either the min or the max of several columns depending upon the value
> > of a third column:
> > 
> > For example:
> > 
> > test.dt <- data.table(ID=1:10, SCORE_1=rnorm(10), SCORE_2=rnorm(10),
> > SCORE_3=rnorm(10), MAX_OR_MIN=c(rep("Max", 5), rep("Min", 5)))
> > 
> > For each row I’d like to get the max of SCORE_1, SCORE_2, and SCORE_3
> > if the MAX_OR_MIN value is MAX and the min of SCORE_1, SCORE_2, and
> > SCORE_3 if the MAX_OR_MIN value is MIN. 
> > 
> > It isn’t too difficult to come up with a “bulky” and slow solution,
> > but I’m wondering if I’m missing a way in which data.table would make
> > such an effort elegant and quick.
> >
> > Any help greatly appreciated.  
> > 
> > Damian Betebenner
> > 
> > Center for Assessment
> > 
> > PO Box 351
> > 
> > Dover, NH   03821-0351
> > 
> >  
> > 
> > Phone (office): (603) 516-7900
> > 
> > Phone (cell): (857) 234-2474
> > 
> > Fax: (603) 516-7910
> > 
> >  
> > 
> > dbetebenner at nciea.org
> > 
> > www.nciea.org
> > 
> >  
> > 
> >  
> > 
> >  
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list