[datatable-help] Programmatic by clauses

Matthew Dowle mdowle at mdowle.plus.com
Tue Sep 7 23:33:00 CEST 2010


Fixed now (bug #1060) with 3 tests added. As Tom said it wasn't intended
and Johann was right to think as.list() shouldn't be needed. The fix is
what Tom suggested with use of is.atomic() to clean up the kludge.

DT = data.table(a=1:2,b=rnorm(10))
byfact = DT[,list(a)]  
DT[,mean(b),by=byfact]   # ok now, without as.list(byfact) or {byfact}

With the example in this thread ..

> data <- data.table(fico = rep((1:4)*25L, 5), imonth = rep(1:5,each=4),
balance = runif(20), count = runif(20))
> aggregation.spec <- list(iquarter=function (d) d$imonth %/%
3L,fico.bucket=function (d)as.integer(25*round(d$fico/25)))
> by.factors <- lapply(aggregation.spec, function (f) f(data))
> cols.to.sum <- c("balance", "count")

> data[, lapply(.SD[, cols.to.sum, with = FALSE], sum), by = by.factors]
# ok now
     iquarter fico.bucket   balance     count
[1,]        0          25 0.7181843 1.6976771
[2,]        0          50 1.4145608 1.7594647
[3,]        0          75 1.7662068 1.1061450
[4,]        0         100 0.7172427 1.7047801
[5,]        1          25 1.5622786 1.8164999
[6,]        1          50 2.2023567 1.6930506
[7,]        1          75 1.4455576 1.2438718
[8,]        1         100 2.0601978 0.6195285
> data[, lapply(.SD[, cols.to.sum, with = FALSE], sum), by =
as.list(by.factors)]    # still ok
     iquarter fico.bucket   balance     count
[1,]        0          25 0.7181843 1.6976771
[2,]        0          50 1.4145608 1.7594647
[3,]        0          75 1.7662068 1.1061450
[4,]        0         100 0.7172427 1.7047801
[5,]        1          25 1.5622786 1.8164999
[6,]        1          50 2.2023567 1.6930506
[7,]        1          75 1.4455576 1.2438718
[8,]        1         100 2.0601978 0.6195285
> data[, lapply(.SD[, cols.to.sum, with = FALSE], sum), by =
{by.factors}]    # still ok
     iquarter fico.bucket   balance     count
[1,]        0          25 0.7181843 1.6976771
[2,]        0          50 1.4145608 1.7594647
[3,]        0          75 1.7662068 1.1061450
[4,]        0         100 0.7172427 1.7047801
[5,]        1          25 1.5622786 1.8164999
[6,]        1          50 2.2023567 1.6930506
[7,]        1          75 1.4455576 1.2438718
[8,]        1         100 2.0601978 0.6195285
> 

Matthew


On Thu, 2010-09-02 at 14:10 -0700, Short, Tom wrote:
> Johann, I think that "as.list" works because you need something other
> than a single variable. Single variables are treated differently.
> Wrapping it in brackets also works:
> 
>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
>          by = {by.factors}]
> 
> What it tries to do with a single variable is turn it into
> list(by.factors). I think that's unintended, but we need to check with
> Matthew.
> 
> Matthew, in the following lines of [.data.table, changing "list(" to
> "as.list(" would fix the problem above, but if something's a vector, it
> won't work.
> 
>                 if (mode(bysub) %in% c("name","character")) {
>                     # name : j may be a single unquoted column name but
> it must evaluate to list so this is a convenience to users
>                     # character: for backwards compatibility with v1.2
> syntax passing single character to 'by' rather than list()
>                     bysub =
> parse(text=paste("list(",bysub,")",sep=""))[[1]]
>                 }
> 
> This seems to work, but it feels a little kludgy:
> 
>                 if (mode(bysub) == "character") {
>                     # character: for backwards compatibility with v1.2
> syntax passing single character to 'by' rather than list()
>                     bysub = parse(text=paste("list(",bysub,")",
> sep=""))[[1]]
>                 }
>                 if (mode(bysub) == "name") {
>                     # name : j may be a single unquoted column name but
> it must evaluate to list so this is a convenience to users
>                     bysub = parse(text=paste("if (is.list(",bysub,")) ",
> bysub, " else list(", bysub, ")", sep=""))[[1]]
>                 }
> 
> - Tom
> 
> 
> > -----Original Message-----
> > From: datatable-help-bounces at lists.r-forge.r-project.org 
> > [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> > On Behalf Of Johann Hibschman
> > Sent: Tuesday, August 31, 2010 11:10
> > To: datatable-help at lists.r-forge.r-project.org
> > Subject: Re: [datatable-help] Programmatic by clauses
> > 
> > "Short, Tom" <TShort at epri.com> writes:
> > 
> > > This seems to work ("data" is different than before, so the balance 
> > > and count columns are different):
> > >
> > >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> > > +          by = as.list(by.factors)]
> > >      iquarter fico.bucket   balance     count
> > > [1,]        0          25 0.1427648 1.0449715
> > > [2,]        0          50 0.8598616 0.7946641
> > > [3,]        0          75 0.7799311 0.6733977
> > > [4,]        0         100 1.1240393 1.3415721
> > > [5,]        1          25 1.6179294 1.9870932
> > > [6,]        1          50 1.4562150 2.0651700
> > > [7,]        1          75 1.8457541 1.6337161
> > > [8,]        1         100 2.0330688 0.8113971
> > 
> > Using as.list works for me as well, thanks.
> > 
> > I had to change my summary function to return NA_real_ rather 
> > than just plain NA, but once I did that, everything seems to work.
> > 
> > I'm impressed.  It looks to be about 10 times faster, all 
> > considered. The actual aggregation step is something like 40 
> > times faster, but I have to do some extra work to get it into 
> > a format suitable for data.table.
> > 
> > I would still prefer there to be a more "plain vanilla" 
> > interface to all this.  I have no idea why using "as.list" 
> > works, and that makes me uncomfortable.
> > 
> > Regards,
> > Johann
> > 
> > >
> > >  
> > >
> > >> -----Original Message-----
> > >> From: datatable-help-bounces at lists.r-forge.r-project.org
> > >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > >> On Behalf Of Johann Hibschman
> > >> Sent: Monday, August 30, 2010 16:03
> > >> To: datatable-help at lists.r-forge.r-project.org
> > >> Subject: Re: [datatable-help] Programmatic by clauses
> > >> 
> > >> "Short, Tom" <TShort at epri.com> writes:
> > >> 
> > >> > Johann, how about the following:
> > >> > [snip example]
> > >> 
> > >> That's a good example; thanks.
> > >> 
> > >> > Here's a data.table version:
> > >> >      
> > >> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> > >> > +          by = lapply(aggregation.spec, function (f) f(data))]
> > >> >      iquarter fico.bucket   balance    count
> > >> > [1,]        0          25 0.5506797 1.133675
> > >> > [2,]        0          50 1.5175908 0.854553
> > >> > [3,]        0          75 0.4627294 1.171430
> > >> > [4,]        0         100 0.8354870 1.083211
> > >> > [5,]        1          25 1.7311503 1.210178
> > >> > [6,]        1          50 2.2930775 1.974759
> > >> > [7,]        1          75 1.0477066 1.973119
> > >> > [8,]        1         100 1.4351321 1.501291
> > >> 
> > >> I hadn't understood .SD before; that's a very good thing to know.
> > >> 
> > >> > I think the following should also work, but it doesn't. 
> > Note that I 
> > >> > didn't update to the very latest version of data.table, 
> > and I know 
> > >> > Matthew has changed some things that might already fix this.
> > >> >      
> > >> >
> > >> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> > >> > +          by = by.factors]
> > >> > Error in `[.data.table`(data, , lapply(.SD[, cols.to.sum, with = 
> > >> > FALSE],
> > >> > : 
> > >> >   column or expression 1 of 'by' list is not internally
> > >> type integer. 
> > >> > Do not quote column names. Example of correct use:
> > >> > by=list(colA,month(colB),...).
> > >> 
> > >> It still doesn't work.  Unfortunately, if I want to have a drop-in 
> > >> replacement, I have to operate on the equivalent by.factors.
> > >> 
> > >> I tried the following:
> > >> 
> > >>   dt.tmp <- cbind(data[, cols.to.sum, with=FALSE],
> > >>     data.table(by.factors))
> > >>   dt.agg <- dt.tmp[, lapply(.SD, sum), by=paste(names(by.factor),
> > >>     collapse=",")]
> > >> 
> > >> but I got:
> > >> 
> > >>   Error in `[.data.table`(dt.tmp, , lapply(.SD, sum.na), by = 
> > >> paste(names(by),  :
> > >>     by must evaluate to list
> > >> 
> > >> I tried
> > >> 
> > >>   by.names <- paste(names(by.factor), collapse=",")
> > >>   dt.agg <- dt.tmp[, lapply(.SD, sum), by=by.names]
> > >> 
> > >> but I got the same error.  Randomly wrapping things in 
> > eval or evalq 
> > >> didn't seem to work either.
> > >> 
> > >> Is there any chance that we could get a "less magic" 
> > version of the 
> > >> data.table extract that doesn't do anything fancy?  Or maybe a 
> > >> by.with=FALSE option?
> > >> 
> > >> I periodically try data.table, but I always run into this 
> > wall where 
> > >> I waste a few hours trying to guess how to make extract do what I 
> > >> want it to and finally give up.  It's frustrating, it seems as if 
> > >> only data.table were trying to be less clever, it would be very 
> > >> useful to me.
> > >> 
> > >> 
> > >> - Johann
> > >> 
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > > atatable-help
> > >> 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> atatable-help
> > 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list