[datatable-help] Programmatic by clauses

Thu Sep 2 23:10:01 CEST 2010

Johann, I think that "as.list" works because you need something other
than a single variable. Single variables are treated differently.
Wrapping it in brackets also works:

    data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
         by = {by.factors}]

What it tries to do with a single variable is turn it into
list(by.factors). I think that's unintended, but we need to check with
Matthew.

Matthew, in the following lines of [.data.table, changing "list(" to
"as.list(" would fix the problem above, but if something's a vector, it
won't work.

                if (mode(bysub) %in% c("name","character")) {
                    # name : j may be a single unquoted column name but
it must evaluate to list so this is a convenience to users
                    # character: for backwards compatibility with v1.2
syntax passing single character to 'by' rather than list()
                    bysub =
parse(text=paste("list(",bysub,")",sep=""))[[1]]
                }

This seems to work, but it feels a little kludgy:

                if (mode(bysub) == "character") {
                    # character: for backwards compatibility with v1.2
syntax passing single character to 'by' rather than list()
                    bysub = parse(text=paste("list(",bysub,")",
sep=""))[[1]]
                }
                if (mode(bysub) == "name") {
                    # name : j may be a single unquoted column name but
it must evaluate to list so this is a convenience to users
                    bysub = parse(text=paste("if (is.list(",bysub,")) ",
bysub, " else list(", bysub, ")", sep=""))[[1]]
                }

- Tom

> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Johann Hibschman
> Sent: Tuesday, August 31, 2010 11:10
> To: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Programmatic by clauses
> 
> "Short, Tom" <TShort at epri.com> writes:
> 
> > This seems to work ("data" is different than before, so the balance 
> > and count columns are different):
> >
> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> > +          by = as.list(by.factors)]
> >      iquarter fico.bucket   balance     count
> > [1,]        0          25 0.1427648 1.0449715
> > [2,]        0          50 0.8598616 0.7946641
> > [3,]        0          75 0.7799311 0.6733977
> > [4,]        0         100 1.1240393 1.3415721
> > [5,]        1          25 1.6179294 1.9870932
> > [6,]        1          50 1.4562150 2.0651700
> > [7,]        1          75 1.8457541 1.6337161
> > [8,]        1         100 2.0330688 0.8113971
> 
> Using as.list works for me as well, thanks.
> 
> I had to change my summary function to return NA_real_ rather 
> than just plain NA, but once I did that, everything seems to work.
> 
> I'm impressed.  It looks to be about 10 times faster, all 
> considered. The actual aggregation step is something like 40 
> times faster, but I have to do some extra work to get it into 
> a format suitable for data.table.
> 
> I would still prefer there to be a more "plain vanilla" 
> interface to all this.  I have no idea why using "as.list" 
> works, and that makes me uncomfortable.
> 
> Regards,
> Johann
> 
> >
> >  
> >
> >> -----Original Message-----
> >> From: datatable-help-bounces at lists.r-forge.r-project.org
> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> On Behalf Of Johann Hibschman
> >> Sent: Monday, August 30, 2010 16:03
> >> To: datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] Programmatic by clauses
> >> 
> >> "Short, Tom" <TShort at epri.com> writes:
> >> 
> >> > Johann, how about the following:
> >> > [snip example]
> >> 
> >> That's a good example; thanks.
> >> 
> >> > Here's a data.table version:
> >> >      
> >> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> >> > +          by = lapply(aggregation.spec, function (f) f(data))]
> >> >      iquarter fico.bucket   balance    count
> >> > [1,]        0          25 0.5506797 1.133675
> >> > [2,]        0          50 1.5175908 0.854553
> >> > [3,]        0          75 0.4627294 1.171430
> >> > [4,]        0         100 0.8354870 1.083211
> >> > [5,]        1          25 1.7311503 1.210178
> >> > [6,]        1          50 2.2930775 1.974759
> >> > [7,]        1          75 1.0477066 1.973119
> >> > [8,]        1         100 1.4351321 1.501291
> >> 
> >> I hadn't understood .SD before; that's a very good thing to know.
> >> 
> >> > I think the following should also work, but it doesn't. 
> Note that I 
> >> > didn't update to the very latest version of data.table, 
> and I know 
> >> > Matthew has changed some things that might already fix this.
> >> >      
> >> >
> >> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> >> > +          by = by.factors]
> >> > Error in `[.data.table`(data, , lapply(.SD[, cols.to.sum, with = 
> >> > FALSE],
> >> > : 
> >> >   column or expression 1 of 'by' list is not internally
> >> type integer. 
> >> > Do not quote column names. Example of correct use:
> >> > by=list(colA,month(colB),...).
> >> 
> >> It still doesn't work.  Unfortunately, if I want to have a drop-in 
> >> replacement, I have to operate on the equivalent by.factors.
> >> 
> >> I tried the following:
> >> 
> >>   dt.tmp <- cbind(data[, cols.to.sum, with=FALSE],
> >>     data.table(by.factors))
> >>   dt.agg <- dt.tmp[, lapply(.SD, sum), by=paste(names(by.factor),
> >>     collapse=",")]
> >> 
> >> but I got:
> >> 
> >>   Error in `[.data.table`(dt.tmp, , lapply(.SD, sum.na), by = 
> >> paste(names(by),  :
> >>     by must evaluate to list
> >> 
> >> I tried
> >> 
> >>   by.names <- paste(names(by.factor), collapse=",")
> >>   dt.agg <- dt.tmp[, lapply(.SD, sum), by=by.names]
> >> 
> >> but I got the same error.  Randomly wrapping things in 
> eval or evalq 
> >> didn't seem to work either.
> >> 
> >> Is there any chance that we could get a "less magic" 
> version of the 
> >> data.table extract that doesn't do anything fancy?  Or maybe a 
> >> by.with=FALSE option?
> >> 
> >> I periodically try data.table, but I always run into this 
> wall where 
> >> I waste a few hours trying to guess how to make extract do what I 
> >> want it to and finally give up.  It's frustrating, it seems as if 
> >> only data.table were trying to be less clever, it would be very 
> >> useful to me.
> >> 
> >> 
> >> - Johann
> >> 
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> >> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
>