[datatable-help] Programmatic by clauses

Tue Aug 31 17:10:12 CEST 2010

"Short, Tom" <TShort at epri.com> writes:

> This seems to work ("data" is different than before, so the balance and
> count columns are different):
>
>>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
> +          by = as.list(by.factors)]
>      iquarter fico.bucket   balance     count
> [1,]        0          25 0.1427648 1.0449715
> [2,]        0          50 0.8598616 0.7946641
> [3,]        0          75 0.7799311 0.6733977
> [4,]        0         100 1.1240393 1.3415721
> [5,]        1          25 1.6179294 1.9870932
> [6,]        1          50 1.4562150 2.0651700
> [7,]        1          75 1.8457541 1.6337161
> [8,]        1         100 2.0330688 0.8113971

Using as.list works for me as well, thanks.

I had to change my summary function to return NA_real_ rather than just
plain NA, but once I did that, everything seems to work.

I'm impressed.  It looks to be about 10 times faster, all
considered. The actual aggregation step is something like 40 times
faster, but I have to do some extra work to get it into a format
suitable for data.table.

I would still prefer there to be a more "plain vanilla" interface to all
this.  I have no idea why using "as.list" works, and that makes me
uncomfortable.

Regards,
Johann

>
>  
>
>> -----Original Message-----
>> From: datatable-help-bounces at lists.r-forge.r-project.org 
>> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
>> On Behalf Of Johann Hibschman
>> Sent: Monday, August 30, 2010 16:03
>> To: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Programmatic by clauses
>> 
>> "Short, Tom" <TShort at epri.com> writes:
>> 
>> > Johann, how about the following:
>> > [snip example]
>> 
>> That's a good example; thanks.
>> 
>> > Here's a data.table version:
>> >      
>> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
>> > +          by = lapply(aggregation.spec, function (f) f(data))]
>> >      iquarter fico.bucket   balance    count
>> > [1,]        0          25 0.5506797 1.133675
>> > [2,]        0          50 1.5175908 0.854553
>> > [3,]        0          75 0.4627294 1.171430
>> > [4,]        0         100 0.8354870 1.083211
>> > [5,]        1          25 1.7311503 1.210178
>> > [6,]        1          50 2.2930775 1.974759
>> > [7,]        1          75 1.0477066 1.973119
>> > [8,]        1         100 1.4351321 1.501291
>> 
>> I hadn't understood .SD before; that's a very good thing to know.
>> 
>> > I think the following should also work, but it doesn't. Note that I 
>> > didn't update to the very latest version of data.table, and I know 
>> > Matthew has changed some things that might already fix this.
>> >      
>> >
>> >>     data[, lapply(.SD[, cols.to.sum, with = FALSE], sum),
>> > +          by = by.factors]
>> > Error in `[.data.table`(data, , lapply(.SD[, cols.to.sum, with = 
>> > FALSE],
>> > : 
>> >   column or expression 1 of 'by' list is not internally 
>> type integer. 
>> > Do not quote column names. Example of correct use:
>> > by=list(colA,month(colB),...).
>> 
>> It still doesn't work.  Unfortunately, if I want to have a 
>> drop-in replacement, I have to operate on the equivalent by.factors.
>> 
>> I tried the following:
>> 
>>   dt.tmp <- cbind(data[, cols.to.sum, with=FALSE],
>>     data.table(by.factors))
>>   dt.agg <- dt.tmp[, lapply(.SD, sum), by=paste(names(by.factor),
>>     collapse=",")]
>> 
>> but I got:
>> 
>>   Error in `[.data.table`(dt.tmp, , lapply(.SD, sum.na), by = 
>> paste(names(by),  : 
>>     by must evaluate to list
>> 
>> I tried
>> 
>>   by.names <- paste(names(by.factor), collapse=",")
>>   dt.agg <- dt.tmp[, lapply(.SD, sum), by=by.names]
>> 
>> but I got the same error.  Randomly wrapping things in eval 
>> or evalq didn't seem to work either.
>> 
>> Is there any chance that we could get a "less magic" version 
>> of the data.table extract that doesn't do anything fancy?  Or 
>> maybe a by.with=FALSE option?
>> 
>> I periodically try data.table, but I always run into this 
>> wall where I waste a few hours trying to guess how to make 
>> extract do what I want it to and finally give up.  It's 
>> frustrating, it seems as if only data.table were trying to be 
>> less clever, it would be very useful to me.
>> 
>> 
>> - Johann
>> 
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> atatable-help
>>