[datatable-help] Expressions in "by" criteria (again)

Harish harishv_99 at yahoo.com
Sun Jul 4 20:51:23 CEST 2010


Matthew,

Thanks for the tip on using with="FALSE" and lapply().  I think that I need to play with the apply() family of functions and get comfortable with them.  I just ignored those functions thinking that data.tables would be a complete replacement since I mostly use data.frames type objects.  I suppose they still have their place.


Harish


--- On Sun, 7/4/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> From: Matthew Dowle <mdowle at mdowle.plus.com>
> Subject: Re: [datatable-help] Expressions in "by" criteria (again)
> To: "Harish" <harishv_99 at yahoo.com>
> Cc: "Mike Sandfort" <cute_moniker at yahoo.com>, datatable-help at lists.r-forge.r-project.org
> Date: Sunday, July 4, 2010, 1:50 AM
> Great thread.
> 
> Just to first step back one post where Mike talks about
> collections of
> fields...
> 
> There are two places lists are passed in, one is to 'j',
> one is to 'by'.
> 
> with=FALSE allows j to be list of column names as character
> or integer,
> just like data.frame. I agree with Mikes' comments, and
> thats why the
> 'with' argument of [.data.table exists, if I understand
> correctly.
> 
>    my.fields=c("colA","ColB",...)
>    DT[,my.fields,with=FALSE]  # returns
> the columns
> 
>    my.fields=c(1,25,34,808)
>    DT[,my.fields,with=FALSE]  # returns
> the columns
> 
> c.listquote seems to be very similar to with=FALSE ? 
> Mike originally
> asked about the 'by' argument though, which is a different
> argument that
> happens to accept a list of expressions of column names
> too.
> 
> For Harish's last post, then one way is to use lapply in
> the usual way
> 
>    lapply(DT,minmax)
> 
> or on just a subset of columns
> 
>    lapply(DT[,12:300,with=FALSE], minmax)
> 
>    lapply(DT[,my.fields,with=FALSE],
> minmax)
> 
> or inside subsets :
> 
>    DT[,lapply(.SD,minmax),by=list(colA,colB)]
> 
> something like that anyway.
> 
> Matthew
> 
> 
> On Fri, 2010-07-02 at 19:22 -0700, Harish wrote:
> > Mike, I am glad that you are well on your way.
> > 
> > Matthew, I am intrigued by your view that
> c.listquote() is complex. I agree that it is, but I had to
> come up with it to solve a slightly different problem. 
> Maybe you could share some of your thoughts on how else I
> could do similar things.
> > 
> > For example, I had a situation where I had to
> execute...
> > 
> > DT[ , list( A_min=min(A), A_max=max(A), B_min=min(B),
> B_max =max(B), blah blah, Other1, Other2 ) ]
> > 
> > I wanted to avoid typing in the whole list because the
> code would become a nightmare.  The list had to keep
> changing a little based on the situation.  And there
> was no easy way that I could find for me to concatenate
> items to a quoted list like I can to a vector of strings
> using c().
> > 
> > If I were to use c.listquote(), then I can do
> something as follows:
> > 
> > minmax <- function( name ) {
> >    strMin <- paste( name, "_min=min(",
> name, ")", sep="" )
> >    strMax <- paste( name, "_max=max(",
> name, ")", sep="" )
> >    return( c( strMin, strMax ) )
> > }
> > 
> > longlist <- function() {
> >    c( minmax( "A" ), minmax( "B" ), minmax(
> "C" ) )
> > }
> > 
> > DT[ , eval( c.listquote( longlist(), list( Other1,
> Other2 ) ) ) ]
> > DT[ , eval( c.listquote( longlist(), list( Other3 ) )
> ) ]
> > 
> > Essentially, I wanted to be able to create the query
> based on other arguments or parameters.  This function
> allows me to have that flexibility.
> > 
> > How would you recommend that I deal with a situation
> like this?
> > 
> > 
> > Regards,
> > Harish
> > 
> > 
> > --- On Fri, 7/2/10, Mike Sandfort <cute_moniker at yahoo.com>
> wrote:
> > 
> > > From: Mike Sandfort <cute_moniker at yahoo.com>
> > > Subject: Re: [datatable-help] Expressions in "by"
> criteria (again)
> > > To: mdowle at mdowle.plus.com,
> "Harish" <harishv_99 at yahoo.com>
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Date: Friday, July 2, 2010, 6:38 PM
> > > Yikes. Seems like this fruit salad
> > > has gone a bit too far.
> > > 
> > > c.listquote might look complicated but, as I
> emailed Harish
> > > off-list (sorry),
> > > it's exactly what I was looking for. The point
> you raise is
> > > an excellent one -- 
> > > the ability to use expressions in "by" does make
> the tool
> > > much more flexible in
> > > ways I hadn't thought about. My point was only
> that many
> > > commonly-used R functions
> > > encourage the user to keep collections of fields
> stored in
> > > vectors and lists. The fact
> > > that I didn't have a tool to shoehorn those
> vectors and
> > > lists into an expression (without
> > > a bunch of repetitive typing) was why I emailed
> the list.
> > > Harish's code does exactly that
> > > vector/list -> expression conversion that I
> needed.
> > > 
> > > As far as my need for lists of variables, there
> are lots of
> > > reasons to keep them around.
> > > If you need to bulk-convert a set of character
> fields to
> > > factors, for example, it's handy to be able to
> > > say
> > > 
> > > my.factors = <Vector of Field Names>
> > > idx <- match(my.factors,names(df))
> > > df[,idx] <- lapply(df[,idx],as.factor)
> > > 
> > > One may also have sets of factors which are
> relevant to
> > > different kinds of analysis.
> > > Working with invoice records, one might have
> > > customer-related fields, product-related
> > > fields, business-unit related fields, etc.
> Depending on the
> > > sort of analysis one wants to
> > > perform, one might only have need to aggregate
> across a
> > > particular subset of factors.
> > > Having my.geo.factors, my.cust.factors,
> my.prod.factors,
> > > etc. reduces typing and makes
> > > the code easier to debug -- particularly when the
> number of
> > > field names becomes very large.
> > > 
> > > Thanks to both of you for your help in working
> this out.
> > > And from now on I'll stick to "A","B","C" for my
> field
> > > names when I email.
> > > 
> > > Mike
> > > 
> > > 
> > > 
> > > 
> > > ----- Original Message ----
> > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > To: Harish <harishv_99 at yahoo.com>
> > > Cc: Mike Sandfort <cute_moniker at yahoo.com>;
> > > datatable-help at lists.r-forge.r-project.org
> > > Sent: Fri, July 2, 2010 7:58:33 PM
> > > Subject: Re: [datatable-help] Expressions in "by"
> criteria
> > > (again)
> > > 
> > > 
> > > c.listquote looks very complicated.  Mike
> shouldn't
> > > need to do that at
> > > this stage. My gut tells me there some
> fundamental
> > > misunderstanding,
> > > somewhere. Maybe its all the fruit.
> > > 
> > > Is Mike's data really _sorted_ by Apples column,
> then by
> > > Bananas column,
> > > then by Kiwi column then by Pineapples column
> then by
> > > Prunes
> > > column, ...?  Whats in those columns? 
> I can't
> > > see any data or anything
> > > reproducible.  When we 'by' we aim for that
> 'by' to be
> > > in the same order
> > > as the key.  That implies a key of 20+
> columns
> > > long.  Doesn't seem
> > > right.  I've never needed a key that long.
> > > 
> > > Surely Mike needs _one_ fruit column, which will
> likely be
> > > the 2nd
> > > column of a key,  then a 3rd column which is
> "yield"
> > > or some
> > > measurement.
> > > 
> > > To add more fruit, you add more rows, not more
> columns.
> > > Like a database.
> > > 
> > > Matthew
> > > 
> > > 
> > > On Fri, 2010-07-02 at 10:13 -0700, Harish wrote:
> > > > Mike,
> > > > 
> > > > Matthew is right.  Here is a function
> that might
> > > help you transition from your current state to
> where you
> > > need to get to quickly.
> > > > 
> > > > I started creating a function for other
> purposes that
> > > might be useful to you.  I described the
> usage below.
> > > > 
> > > > ========================
> > > > 
> > > > # Concatenate all given arguments into a
> quote of a
> > > list()
> > > > # Arguments can be any of:
> > > > #    1) an expression that returns
> a valid
> > > value when evaluated in calling
> > > > #   
>    environment.
> > > > #    2) a character vector which
> will be
> > > treated as text inside list(...)
> > > > #    3) a quote of a list
> > > > #    4) a list() directly given in
> the
> > > argument
> > > > # Returns a quote of a list
> > > > c.listquote <- function( ... ) {
> > > >    
> > > >    args <- as.list(
> match.call()[ -1 ] )
> > > >    lstquote <- list( as.symbol(
> "list" )
> > > );
> > > >    for ( i in args ) {
> > > >       # Evaluate
> expression
> > > in parent eviron to see what it refers to
> > > >       if ( class( i
> ) ==
> > > "name" || ( class( i ) == "call" &&
> i[[1]] != "list"
> > > ) ) {
> > > >          i <-
> eval(
> > > substitute( i ), sys.frame( sys.parent() ) )
> > > >       }
> > > >       if ( class( i
> ) ==
> > > "call" && i[[1]] == "list" ) {
> > > >          lstquote
> <- c(
> > > lstquote, as.list( i )[ -1 ] )
> > > >       }
> > > >       else if (
> class( i ) ==
> > > "character" )
> > > >       {
> > > >          for ( chr
> in i ) {
> > > >         
> > >    lstquote <- c( lstquote, list(
> parse(
> > > text=chr )[[1]] ) )
> > > >          }
> > > >       }
> > > >       else
> > > >          stop(
> paste( "[",
> > > deparse( substitute( i ) ), "] Unknown class [",
> class( i ),
> > > "] or is not a list()", sep="" ) )
> > > >    }
> > > >    return( as.call( lstquote ) )
> > > > }
> > > > 
> > > > ========================
> > > > 
> > > > IMPORTANT: If you find any bugs in this or
> find ways
> > > to improve it, please let me know.
> > > > 
> > > > The usage is as follows:
> > > > 
> > > > my.fields <-
> > >
> c("Apples","Bananas","Coconuts","Dragonfruits","Pomelos")
> > > > q <- c.listquote( my.fields )
> > > > DT[ , Col1, by=eval( q ) ]
> > > > DT[ , q ]
> > > > 
> > > > The advantage of the function is that you
> can also
> > > easily add fields through a variety of ways...
> > > > 
> > > > foo <- function() {
> > > >    return( quote( list( Orange ) )
> )
> > > > }
> > > > DT[ , eval( c.listquote( q, foo(), list(
> Pear ),
> > > "Peach", c( "New1", "New2=form" ) ) ) ]
> > > > 
> > > > 
> > > > Hope this helps.
> > > > 
> > > > 
> > > > Regards,
> > > > Harish
> > > > 
> > > > 
> > > > --- On Fri, 7/2/10, mdowle at mdowle.plus.com
> > > <mdowle at mdowle.plus.com>
> > > wrote:
> > > > 
> > > > > From: mdowle at mdowle.plus.com
> > > <mdowle at mdowle.plus.com>
> > > > > Subject: Re: [datatable-help]
> Expressions in "by"
> > > criteria (again)
> > > > > To: "Mike Sandfort" <cute_moniker at yahoo.com>
> > > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > > Date: Friday, July 2, 2010, 9:49 AM
> > > > > Quick answer is it needs to be this
> > > > > way :
> > > > > 
> > > > >    my.fields =
> > > > > quote(list(Apples,Bananas,...))
> > > > >   
> > > DT[,sum(NumericField),by=eval(my.fields)]
> > > > > 
> > > > > Also some bugs were just fixed in this
> area so
> > > you may need
> > > > > latest 1.5
> > > > > from r-forge for this.
> > > > > 
> > > > > Having said that its sometimes easier
> coding to
> > > use a flat
> > > > > format (i.e.
> > > > > have a single column 'fruit') then
> > > "[,...,by=fruit]". There
> > > > > was another
> > > > > thread showing examples of long to wide
> taking
> > > care of NAs
> > > > > etc, search for
> > > > > 'wide'.
> > > > > 
> > > > > HTH, thanks for the interest,
> > > > > 
> > > > > Matthew
> > > > > 
> > > > > 
> > > > > > Hi,
> > > > > >
> > > > > > I suspect my question is similar
> to Harish's
> > > "Question
> > > > > #2" from 6/18.
> > > > > > Suppose
> > > > > > I have a data.table with many
> fields and
> > > have a large
> > > > > subset of fields I
> > > > > > need to include
> > > > > > in several expressions.
> Ordinarily, I would
> > > create
> > > > > (once) a vector of
> > > > > > names of the fields
> > > > > > in my subset:
> > > > > > my.fields <-
> > > > >
> > >
> c("Apples","Bananas","Coconuts","Dragonfruits",...,"Pomelos")
> > > > > >     [where the
> whole
> > > data frame has many more
> > > > > fields, including
> > > > > > "Broccoli","Cabbages",...]
> > > > > >
> > > > > > Then I can re-use the my.fields
> vector when
> > > > > extracting subsets, creating
> > > > > > plots, aggregating with
> > > > > > ddply(), etc. The problem is that
> I can't
> > > figure out
> > > > > how to
> > > > > > (re)use my.fields to aggregate a
> > > > > > data.table.
> > > > > >
> > > > > >
> > > > >
> > >
> DT[,sum(NumericField),by=(Apples,Bananas,Coconuts,Dragonfruits,...,Pomelos)]
> > > > > > will work.
> > > > > > However,
> > > > > >
> DT[,sum(NumericField),by=my.fields]
> > > > > > won't work, nor will any
> combination of
> > > paste(),
> > > > > list(), eval(), quote(),
> > > > > > deparse(), etc. applied
> > > > > > to my.fields (at least I haven't
> found one
> > > yet).
> > > > > >
> > > > > > I know this is probably more an
> R-language
> > > issue, but
> > > > > since it's come up
> > > > > > in my work with
> > > > > > the (excellent) data.table
> package, I
> > > thought I would
> > > > > ask here.
> > > > > >
> > > > > > Thanks!
> > > > > > Mike S.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > datatable-help at lists.r-forge.r-project.org
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > >
> > > > > 
> > > > > 
> > > > >
> _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > 
> > > > 
> > > > 
> > > >      
> > > 
> > > 
> > >       
> > > 
> > 
> > 
> >       
> 
> 
> 


      


More information about the datatable-help mailing list