[datatable-help] Expressions in "by" criteria (again)

Matthew Dowle mdowle at mdowle.plus.com
Sun Jul 4 10:50:01 CEST 2010


Great thread.

Just to first step back one post where Mike talks about collections of
fields...

There are two places lists are passed in, one is to 'j', one is to 'by'.

with=FALSE allows j to be list of column names as character or integer,
just like data.frame. I agree with Mikes' comments, and thats why the
'with' argument of [.data.table exists, if I understand correctly.

   my.fields=c("colA","ColB",...)
   DT[,my.fields,with=FALSE]  # returns the columns

   my.fields=c(1,25,34,808)
   DT[,my.fields,with=FALSE]  # returns the columns

c.listquote seems to be very similar to with=FALSE ?  Mike originally
asked about the 'by' argument though, which is a different argument that
happens to accept a list of expressions of column names too.

For Harish's last post, then one way is to use lapply in the usual way

   lapply(DT,minmax)

or on just a subset of columns

   lapply(DT[,12:300,with=FALSE], minmax)

   lapply(DT[,my.fields,with=FALSE], minmax)

or inside subsets :

   DT[,lapply(.SD,minmax),by=list(colA,colB)]

something like that anyway.

Matthew


On Fri, 2010-07-02 at 19:22 -0700, Harish wrote:
> Mike, I am glad that you are well on your way.
> 
> Matthew, I am intrigued by your view that c.listquote() is complex. I agree that it is, but I had to come up with it to solve a slightly different problem.  Maybe you could share some of your thoughts on how else I could do similar things.
> 
> For example, I had a situation where I had to execute...
> 
> DT[ , list( A_min=min(A), A_max=max(A), B_min=min(B), B_max =max(B), blah blah, Other1, Other2 ) ]
> 
> I wanted to avoid typing in the whole list because the code would become a nightmare.  The list had to keep changing a little based on the situation.  And there was no easy way that I could find for me to concatenate items to a quoted list like I can to a vector of strings using c().
> 
> If I were to use c.listquote(), then I can do something as follows:
> 
> minmax <- function( name ) {
>    strMin <- paste( name, "_min=min(", name, ")", sep="" )
>    strMax <- paste( name, "_max=max(", name, ")", sep="" )
>    return( c( strMin, strMax ) )
> }
> 
> longlist <- function() {
>    c( minmax( "A" ), minmax( "B" ), minmax( "C" ) )
> }
> 
> DT[ , eval( c.listquote( longlist(), list( Other1, Other2 ) ) ) ]
> DT[ , eval( c.listquote( longlist(), list( Other3 ) ) ) ]
> 
> Essentially, I wanted to be able to create the query based on other arguments or parameters.  This function allows me to have that flexibility.
> 
> How would you recommend that I deal with a situation like this?
> 
> 
> Regards,
> Harish
> 
> 
> --- On Fri, 7/2/10, Mike Sandfort <cute_moniker at yahoo.com> wrote:
> 
> > From: Mike Sandfort <cute_moniker at yahoo.com>
> > Subject: Re: [datatable-help] Expressions in "by" criteria (again)
> > To: mdowle at mdowle.plus.com, "Harish" <harishv_99 at yahoo.com>
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Date: Friday, July 2, 2010, 6:38 PM
> > Yikes. Seems like this fruit salad
> > has gone a bit too far.
> > 
> > c.listquote might look complicated but, as I emailed Harish
> > off-list (sorry),
> > it's exactly what I was looking for. The point you raise is
> > an excellent one -- 
> > the ability to use expressions in "by" does make the tool
> > much more flexible in
> > ways I hadn't thought about. My point was only that many
> > commonly-used R functions
> > encourage the user to keep collections of fields stored in
> > vectors and lists. The fact
> > that I didn't have a tool to shoehorn those vectors and
> > lists into an expression (without
> > a bunch of repetitive typing) was why I emailed the list.
> > Harish's code does exactly that
> > vector/list -> expression conversion that I needed.
> > 
> > As far as my need for lists of variables, there are lots of
> > reasons to keep them around.
> > If you need to bulk-convert a set of character fields to
> > factors, for example, it's handy to be able to
> > say
> > 
> > my.factors = <Vector of Field Names>
> > idx <- match(my.factors,names(df))
> > df[,idx] <- lapply(df[,idx],as.factor)
> > 
> > One may also have sets of factors which are relevant to
> > different kinds of analysis.
> > Working with invoice records, one might have
> > customer-related fields, product-related
> > fields, business-unit related fields, etc. Depending on the
> > sort of analysis one wants to
> > perform, one might only have need to aggregate across a
> > particular subset of factors.
> > Having my.geo.factors, my.cust.factors, my.prod.factors,
> > etc. reduces typing and makes
> > the code easier to debug -- particularly when the number of
> > field names becomes very large.
> > 
> > Thanks to both of you for your help in working this out.
> > And from now on I'll stick to "A","B","C" for my field
> > names when I email.
> > 
> > Mike
> > 
> > 
> > 
> > 
> > ----- Original Message ----
> > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > To: Harish <harishv_99 at yahoo.com>
> > Cc: Mike Sandfort <cute_moniker at yahoo.com>;
> > datatable-help at lists.r-forge.r-project.org
> > Sent: Fri, July 2, 2010 7:58:33 PM
> > Subject: Re: [datatable-help] Expressions in "by" criteria
> > (again)
> > 
> > 
> > c.listquote looks very complicated.  Mike shouldn't
> > need to do that at
> > this stage. My gut tells me there some fundamental
> > misunderstanding,
> > somewhere. Maybe its all the fruit.
> > 
> > Is Mike's data really _sorted_ by Apples column, then by
> > Bananas column,
> > then by Kiwi column then by Pineapples column then by
> > Prunes
> > column, ...?  Whats in those columns?  I can't
> > see any data or anything
> > reproducible.  When we 'by' we aim for that 'by' to be
> > in the same order
> > as the key.  That implies a key of 20+ columns
> > long.  Doesn't seem
> > right.  I've never needed a key that long.
> > 
> > Surely Mike needs _one_ fruit column, which will likely be
> > the 2nd
> > column of a key,  then a 3rd column which is "yield"
> > or some
> > measurement.
> > 
> > To add more fruit, you add more rows, not more columns.
> > Like a database.
> > 
> > Matthew
> > 
> > 
> > On Fri, 2010-07-02 at 10:13 -0700, Harish wrote:
> > > Mike,
> > > 
> > > Matthew is right.  Here is a function that might
> > help you transition from your current state to where you
> > need to get to quickly.
> > > 
> > > I started creating a function for other purposes that
> > might be useful to you.  I described the usage below.
> > > 
> > > ========================
> > > 
> > > # Concatenate all given arguments into a quote of a
> > list()
> > > # Arguments can be any of:
> > > #    1) an expression that returns a valid
> > value when evaluated in calling
> > > #       environment.
> > > #    2) a character vector which will be
> > treated as text inside list(...)
> > > #    3) a quote of a list
> > > #    4) a list() directly given in the
> > argument
> > > # Returns a quote of a list
> > > c.listquote <- function( ... ) {
> > >    
> > >    args <- as.list( match.call()[ -1 ] )
> > >    lstquote <- list( as.symbol( "list" )
> > );
> > >    for ( i in args ) {
> > >       # Evaluate expression
> > in parent eviron to see what it refers to
> > >       if ( class( i ) ==
> > "name" || ( class( i ) == "call" && i[[1]] != "list"
> > ) ) {
> > >          i <- eval(
> > substitute( i ), sys.frame( sys.parent() ) )
> > >       }
> > >       if ( class( i ) ==
> > "call" && i[[1]] == "list" ) {
> > >          lstquote <- c(
> > lstquote, as.list( i )[ -1 ] )
> > >       }
> > >       else if ( class( i ) ==
> > "character" )
> > >       {
> > >          for ( chr in i ) {
> > >         
> >    lstquote <- c( lstquote, list( parse(
> > text=chr )[[1]] ) )
> > >          }
> > >       }
> > >       else
> > >          stop( paste( "[",
> > deparse( substitute( i ) ), "] Unknown class [", class( i ),
> > "] or is not a list()", sep="" ) )
> > >    }
> > >    return( as.call( lstquote ) )
> > > }
> > > 
> > > ========================
> > > 
> > > IMPORTANT: If you find any bugs in this or find ways
> > to improve it, please let me know.
> > > 
> > > The usage is as follows:
> > > 
> > > my.fields <-
> > c("Apples","Bananas","Coconuts","Dragonfruits","Pomelos")
> > > q <- c.listquote( my.fields )
> > > DT[ , Col1, by=eval( q ) ]
> > > DT[ , q ]
> > > 
> > > The advantage of the function is that you can also
> > easily add fields through a variety of ways...
> > > 
> > > foo <- function() {
> > >    return( quote( list( Orange ) ) )
> > > }
> > > DT[ , eval( c.listquote( q, foo(), list( Pear ),
> > "Peach", c( "New1", "New2=form" ) ) ) ]
> > > 
> > > 
> > > Hope this helps.
> > > 
> > > 
> > > Regards,
> > > Harish
> > > 
> > > 
> > > --- On Fri, 7/2/10, mdowle at mdowle.plus.com
> > <mdowle at mdowle.plus.com>
> > wrote:
> > > 
> > > > From: mdowle at mdowle.plus.com
> > <mdowle at mdowle.plus.com>
> > > > Subject: Re: [datatable-help] Expressions in "by"
> > criteria (again)
> > > > To: "Mike Sandfort" <cute_moniker at yahoo.com>
> > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > Date: Friday, July 2, 2010, 9:49 AM
> > > > Quick answer is it needs to be this
> > > > way :
> > > > 
> > > >    my.fields =
> > > > quote(list(Apples,Bananas,...))
> > > >   
> > DT[,sum(NumericField),by=eval(my.fields)]
> > > > 
> > > > Also some bugs were just fixed in this area so
> > you may need
> > > > latest 1.5
> > > > from r-forge for this.
> > > > 
> > > > Having said that its sometimes easier coding to
> > use a flat
> > > > format (i.e.
> > > > have a single column 'fruit') then
> > "[,...,by=fruit]". There
> > > > was another
> > > > thread showing examples of long to wide taking
> > care of NAs
> > > > etc, search for
> > > > 'wide'.
> > > > 
> > > > HTH, thanks for the interest,
> > > > 
> > > > Matthew
> > > > 
> > > > 
> > > > > Hi,
> > > > >
> > > > > I suspect my question is similar to Harish's
> > "Question
> > > > #2" from 6/18.
> > > > > Suppose
> > > > > I have a data.table with many fields and
> > have a large
> > > > subset of fields I
> > > > > need to include
> > > > > in several expressions. Ordinarily, I would
> > create
> > > > (once) a vector of
> > > > > names of the fields
> > > > > in my subset:
> > > > > my.fields <-
> > > >
> > c("Apples","Bananas","Coconuts","Dragonfruits",...,"Pomelos")
> > > > >     [where the whole
> > data frame has many more
> > > > fields, including
> > > > > "Broccoli","Cabbages",...]
> > > > >
> > > > > Then I can re-use the my.fields vector when
> > > > extracting subsets, creating
> > > > > plots, aggregating with
> > > > > ddply(), etc. The problem is that I can't
> > figure out
> > > > how to
> > > > > (re)use my.fields to aggregate a
> > > > > data.table.
> > > > >
> > > > >
> > > >
> > DT[,sum(NumericField),by=(Apples,Bananas,Coconuts,Dragonfruits,...,Pomelos)]
> > > > > will work.
> > > > > However,
> > > > > DT[,sum(NumericField),by=my.fields]
> > > > > won't work, nor will any combination of
> > paste(),
> > > > list(), eval(), quote(),
> > > > > deparse(), etc. applied
> > > > > to my.fields (at least I haven't found one
> > yet).
> > > > >
> > > > > I know this is probably more an R-language
> > issue, but
> > > > since it's come up
> > > > > in my work with
> > > > > the (excellent) data.table package, I
> > thought I would
> > > > ask here.
> > > > >
> > > > > Thanks!
> > > > > Mike S.
> > > > >
> > > > >
> > > > >
> > > > >
> > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > >
> > > > 
> > > > 
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > 
> > > 
> > > 
> > >      
> > 
> > 
> >       
> > 
> 
> 
>       




More information about the datatable-help mailing list