[datatable-help] Expressions in "by" criteria (again)
Harish
harishv_99 at yahoo.com
Sun Jul 4 20:51:23 CEST 2010
Matthew,
Thanks for the tip on using with="FALSE" and lapply(). I think that I need to play with the apply() family of functions and get comfortable with them. I just ignored those functions thinking that data.tables would be a complete replacement since I mostly use data.frames type objects. I suppose they still have their place.
Harish
--- On Sun, 7/4/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> Subject: Re: [datatable-help] Expressions in "by" criteria (again)
> To: "Harish" <harishv_99 at yahoo.com>
> Cc: "Mike Sandfort" <cute_moniker at yahoo.com>, datatable-help at lists.r-forge.r-project.org
> Date: Sunday, July 4, 2010, 1:50 AM
> Great thread.
>
> Just to first step back one post where Mike talks about
> collections of
> fields...
>
> There are two places lists are passed in, one is to 'j',
> one is to 'by'.
>
> with=FALSE allows j to be list of column names as character
> or integer,
> just like data.frame. I agree with Mikes' comments, and
> thats why the
> 'with' argument of [.data.table exists, if I understand
> correctly.
>
> my.fields=c("colA","ColB",...)
> DT[,my.fields,with=FALSE] # returns
> the columns
>
> my.fields=c(1,25,34,808)
> DT[,my.fields,with=FALSE] # returns
> the columns
>
> c.listquote seems to be very similar to with=FALSE ?
> Mike originally
> asked about the 'by' argument though, which is a different
> argument that
> happens to accept a list of expressions of column names
> too.
>
> For Harish's last post, then one way is to use lapply in
> the usual way
>
> lapply(DT,minmax)
>
> or on just a subset of columns
>
> lapply(DT[,12:300,with=FALSE], minmax)
>
> lapply(DT[,my.fields,with=FALSE],
> minmax)
>
> or inside subsets :
>
> DT[,lapply(.SD,minmax),by=list(colA,colB)]
>
> something like that anyway.
>
> Matthew
>
>
> On Fri, 2010-07-02 at 19:22 -0700, Harish wrote:
> > Mike, I am glad that you are well on your way.
> >
> > Matthew, I am intrigued by your view that
> c.listquote() is complex. I agree that it is, but I had to
> come up with it to solve a slightly different problem.
> Maybe you could share some of your thoughts on how else I
> could do similar things.
> >
> > For example, I had a situation where I had to
> execute...
> >
> > DT[ , list( A_min=min(A), A_max=max(A), B_min=min(B),
> B_max =max(B), blah blah, Other1, Other2 ) ]
> >
> > I wanted to avoid typing in the whole list because the
> code would become a nightmare. The list had to keep
> changing a little based on the situation. And there
> was no easy way that I could find for me to concatenate
> items to a quoted list like I can to a vector of strings
> using c().
> >
> > If I were to use c.listquote(), then I can do
> something as follows:
> >
> > minmax <- function( name ) {
> > strMin <- paste( name, "_min=min(",
> name, ")", sep="" )
> > strMax <- paste( name, "_max=max(",
> name, ")", sep="" )
> > return( c( strMin, strMax ) )
> > }
> >
> > longlist <- function() {
> > c( minmax( "A" ), minmax( "B" ), minmax(
> "C" ) )
> > }
> >
> > DT[ , eval( c.listquote( longlist(), list( Other1,
> Other2 ) ) ) ]
> > DT[ , eval( c.listquote( longlist(), list( Other3 ) )
> ) ]
> >
> > Essentially, I wanted to be able to create the query
> based on other arguments or parameters. This function
> allows me to have that flexibility.
> >
> > How would you recommend that I deal with a situation
> like this?
> >
> >
> > Regards,
> > Harish
> >
> >
> > --- On Fri, 7/2/10, Mike Sandfort <cute_moniker at yahoo.com>
> wrote:
> >
> > > From: Mike Sandfort <cute_moniker at yahoo.com>
> > > Subject: Re: [datatable-help] Expressions in "by"
> criteria (again)
> > > To: mdowle at mdowle.plus.com,
> "Harish" <harishv_99 at yahoo.com>
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Date: Friday, July 2, 2010, 6:38 PM
> > > Yikes. Seems like this fruit salad
> > > has gone a bit too far.
> > >
> > > c.listquote might look complicated but, as I
> emailed Harish
> > > off-list (sorry),
> > > it's exactly what I was looking for. The point
> you raise is
> > > an excellent one --
> > > the ability to use expressions in "by" does make
> the tool
> > > much more flexible in
> > > ways I hadn't thought about. My point was only
> that many
> > > commonly-used R functions
> > > encourage the user to keep collections of fields
> stored in
> > > vectors and lists. The fact
> > > that I didn't have a tool to shoehorn those
> vectors and
> > > lists into an expression (without
> > > a bunch of repetitive typing) was why I emailed
> the list.
> > > Harish's code does exactly that
> > > vector/list -> expression conversion that I
> needed.
> > >
> > > As far as my need for lists of variables, there
> are lots of
> > > reasons to keep them around.
> > > If you need to bulk-convert a set of character
> fields to
> > > factors, for example, it's handy to be able to
> > > say
> > >
> > > my.factors = <Vector of Field Names>
> > > idx <- match(my.factors,names(df))
> > > df[,idx] <- lapply(df[,idx],as.factor)
> > >
> > > One may also have sets of factors which are
> relevant to
> > > different kinds of analysis.
> > > Working with invoice records, one might have
> > > customer-related fields, product-related
> > > fields, business-unit related fields, etc.
> Depending on the
> > > sort of analysis one wants to
> > > perform, one might only have need to aggregate
> across a
> > > particular subset of factors.
> > > Having my.geo.factors, my.cust.factors,
> my.prod.factors,
> > > etc. reduces typing and makes
> > > the code easier to debug -- particularly when the
> number of
> > > field names becomes very large.
> > >
> > > Thanks to both of you for your help in working
> this out.
> > > And from now on I'll stick to "A","B","C" for my
> field
> > > names when I email.
> > >
> > > Mike
> > >
> > >
> > >
> > >
> > > ----- Original Message ----
> > > From: Matthew Dowle <mdowle at mdowle.plus.com>
> > > To: Harish <harishv_99 at yahoo.com>
> > > Cc: Mike Sandfort <cute_moniker at yahoo.com>;
> > > datatable-help at lists.r-forge.r-project.org
> > > Sent: Fri, July 2, 2010 7:58:33 PM
> > > Subject: Re: [datatable-help] Expressions in "by"
> criteria
> > > (again)
> > >
> > >
> > > c.listquote looks very complicated. Mike
> shouldn't
> > > need to do that at
> > > this stage. My gut tells me there some
> fundamental
> > > misunderstanding,
> > > somewhere. Maybe its all the fruit.
> > >
> > > Is Mike's data really _sorted_ by Apples column,
> then by
> > > Bananas column,
> > > then by Kiwi column then by Pineapples column
> then by
> > > Prunes
> > > column, ...? Whats in those columns?
> I can't
> > > see any data or anything
> > > reproducible. When we 'by' we aim for that
> 'by' to be
> > > in the same order
> > > as the key. That implies a key of 20+
> columns
> > > long. Doesn't seem
> > > right. I've never needed a key that long.
> > >
> > > Surely Mike needs _one_ fruit column, which will
> likely be
> > > the 2nd
> > > column of a key, then a 3rd column which is
> "yield"
> > > or some
> > > measurement.
> > >
> > > To add more fruit, you add more rows, not more
> columns.
> > > Like a database.
> > >
> > > Matthew
> > >
> > >
> > > On Fri, 2010-07-02 at 10:13 -0700, Harish wrote:
> > > > Mike,
> > > >
> > > > Matthew is right. Here is a function
> that might
> > > help you transition from your current state to
> where you
> > > need to get to quickly.
> > > >
> > > > I started creating a function for other
> purposes that
> > > might be useful to you. I described the
> usage below.
> > > >
> > > > ========================
> > > >
> > > > # Concatenate all given arguments into a
> quote of a
> > > list()
> > > > # Arguments can be any of:
> > > > # 1) an expression that returns
> a valid
> > > value when evaluated in calling
> > > > #
> environment.
> > > > # 2) a character vector which
> will be
> > > treated as text inside list(...)
> > > > # 3) a quote of a list
> > > > # 4) a list() directly given in
> the
> > > argument
> > > > # Returns a quote of a list
> > > > c.listquote <- function( ... ) {
> > > >
> > > > args <- as.list(
> match.call()[ -1 ] )
> > > > lstquote <- list( as.symbol(
> "list" )
> > > );
> > > > for ( i in args ) {
> > > > # Evaluate
> expression
> > > in parent eviron to see what it refers to
> > > > if ( class( i
> ) ==
> > > "name" || ( class( i ) == "call" &&
> i[[1]] != "list"
> > > ) ) {
> > > > i <-
> eval(
> > > substitute( i ), sys.frame( sys.parent() ) )
> > > > }
> > > > if ( class( i
> ) ==
> > > "call" && i[[1]] == "list" ) {
> > > > lstquote
> <- c(
> > > lstquote, as.list( i )[ -1 ] )
> > > > }
> > > > else if (
> class( i ) ==
> > > "character" )
> > > > {
> > > > for ( chr
> in i ) {
> > > >
> > > lstquote <- c( lstquote, list(
> parse(
> > > text=chr )[[1]] ) )
> > > > }
> > > > }
> > > > else
> > > > stop(
> paste( "[",
> > > deparse( substitute( i ) ), "] Unknown class [",
> class( i ),
> > > "] or is not a list()", sep="" ) )
> > > > }
> > > > return( as.call( lstquote ) )
> > > > }
> > > >
> > > > ========================
> > > >
> > > > IMPORTANT: If you find any bugs in this or
> find ways
> > > to improve it, please let me know.
> > > >
> > > > The usage is as follows:
> > > >
> > > > my.fields <-
> > >
> c("Apples","Bananas","Coconuts","Dragonfruits","Pomelos")
> > > > q <- c.listquote( my.fields )
> > > > DT[ , Col1, by=eval( q ) ]
> > > > DT[ , q ]
> > > >
> > > > The advantage of the function is that you
> can also
> > > easily add fields through a variety of ways...
> > > >
> > > > foo <- function() {
> > > > return( quote( list( Orange ) )
> )
> > > > }
> > > > DT[ , eval( c.listquote( q, foo(), list(
> Pear ),
> > > "Peach", c( "New1", "New2=form" ) ) ) ]
> > > >
> > > >
> > > > Hope this helps.
> > > >
> > > >
> > > > Regards,
> > > > Harish
> > > >
> > > >
> > > > --- On Fri, 7/2/10, mdowle at mdowle.plus.com
> > > <mdowle at mdowle.plus.com>
> > > wrote:
> > > >
> > > > > From: mdowle at mdowle.plus.com
> > > <mdowle at mdowle.plus.com>
> > > > > Subject: Re: [datatable-help]
> Expressions in "by"
> > > criteria (again)
> > > > > To: "Mike Sandfort" <cute_moniker at yahoo.com>
> > > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > > Date: Friday, July 2, 2010, 9:49 AM
> > > > > Quick answer is it needs to be this
> > > > > way :
> > > > >
> > > > > my.fields =
> > > > > quote(list(Apples,Bananas,...))
> > > > >
> > > DT[,sum(NumericField),by=eval(my.fields)]
> > > > >
> > > > > Also some bugs were just fixed in this
> area so
> > > you may need
> > > > > latest 1.5
> > > > > from r-forge for this.
> > > > >
> > > > > Having said that its sometimes easier
> coding to
> > > use a flat
> > > > > format (i.e.
> > > > > have a single column 'fruit') then
> > > "[,...,by=fruit]". There
> > > > > was another
> > > > > thread showing examples of long to wide
> taking
> > > care of NAs
> > > > > etc, search for
> > > > > 'wide'.
> > > > >
> > > > > HTH, thanks for the interest,
> > > > >
> > > > > Matthew
> > > > >
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I suspect my question is similar
> to Harish's
> > > "Question
> > > > > #2" from 6/18.
> > > > > > Suppose
> > > > > > I have a data.table with many
> fields and
> > > have a large
> > > > > subset of fields I
> > > > > > need to include
> > > > > > in several expressions.
> Ordinarily, I would
> > > create
> > > > > (once) a vector of
> > > > > > names of the fields
> > > > > > in my subset:
> > > > > > my.fields <-
> > > > >
> > >
> c("Apples","Bananas","Coconuts","Dragonfruits",...,"Pomelos")
> > > > > > [where the
> whole
> > > data frame has many more
> > > > > fields, including
> > > > > > "Broccoli","Cabbages",...]
> > > > > >
> > > > > > Then I can re-use the my.fields
> vector when
> > > > > extracting subsets, creating
> > > > > > plots, aggregating with
> > > > > > ddply(), etc. The problem is that
> I can't
> > > figure out
> > > > > how to
> > > > > > (re)use my.fields to aggregate a
> > > > > > data.table.
> > > > > >
> > > > > >
> > > > >
> > >
> DT[,sum(NumericField),by=(Apples,Bananas,Coconuts,Dragonfruits,...,Pomelos)]
> > > > > > will work.
> > > > > > However,
> > > > > >
> DT[,sum(NumericField),by=my.fields]
> > > > > > won't work, nor will any
> combination of
> > > paste(),
> > > > > list(), eval(), quote(),
> > > > > > deparse(), etc. applied
> > > > > > to my.fields (at least I haven't
> found one
> > > yet).
> > > > > >
> > > > > > I know this is probably more an
> R-language
> > > issue, but
> > > > > since it's come up
> > > > > > in my work with
> > > > > > the (excellent) data.table
> package, I
> > > thought I would
> > > > > ask here.
> > > > > >
> > > > > > Thanks!
> > > > > > Mike S.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > datatable-help at lists.r-forge.r-project.org
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > >
> > > > >
> > > > >
> > > > >
> _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > >
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>
More information about the datatable-help
mailing list