[datatable-help] Summing over many variables

Matthew Dowle mdowle at mdowle.plus.com
Thu Dec 23 22:33:03 CET 2010


Yes that's one way. We aren't that happy with using lapply in j as it
loses the benefit of data.table.

I tend to 'flatten' tables like this. Try to have few columns. In this
case it would be either a 3 column table (grp,colname,value) or maybe a
4 column table if you ever want to group by "A" or
"B" (grp,letter,number,value). The query would then be
DT[,sum(value),by=list(grp,letter,number)]. You can then do pattern
matches and filters etc in the i rather than in the j e.g.
DT[letter=="A",sum(value),by=group] for just the "A"s. The answer comes
out in 'flat' format but you can always 'unflatten' the result to make
it look pretty or easier to read.  [Note that I sinned by using '==' in
the i just then invoking vector scan, so to avoid that for speed you
would setkey(letter,group) then DT["A",sum(value),by=group]], or getting
fancy if you only wanted some groups (say 1 and 3) then 'by without by'
e.g. DT[list("A",c(1,3)),sum(value)].

'flat' is a common way to use data.table to store higher dimensional
data, and especially sparse higher dimensional data.

The 'grp.1' repetition is a problem I'd like to remove. It's related to
this feature request (but is almost a bug). At the moment you have to
remove the grp.1 afterwards.
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=978&group_id=240&atid=978

Matthew

On Thu, 2010-12-23 at 14:51 -0500, Steve Lianoglou wrote:
> Hi,
> 
> On Thu, Dec 23, 2010 at 2:35 PM, Joseph Voelkel <jgvcqa at rit.edu> wrote:
> > Consider this set of code:
> >
> > DT1<-data.table(A1=1:100,A2=1:100,A3=1:100,B1=101:200,B2=101:200,B3=101:200,C1=301:400,D1=301:400,grp=rep(1:5,each=20))
> >
> > setkey(DT1,grp)
> >
> > (DT2<-DT1[,lapply(.SD,sum),by=grp]) # from data.table FAQ
> >
> > I have two questions:
> >
> > 1. I have many columns like C1 and D1 that I don't want to include in the
> > new data.table (nor do I want grp.1 in it). How can I nicely have these not
> > be part of my result? (If it helps, I know the indices for the A and the B
> > columns)
> 
> Since you already know the indices, it might be faster to just pull
> them out of DT2 after you're done processing it. Barring that, one way
> you could do it "inline" is this:
> 
> R> skip <- match(c("C1", "D1", "grp"), colnames(DT1))
> R> dt2 <- DT1[, {
>   lapply(.SD[, -skip, with=FALSE], sum)
> }, by='grp']
> 
> Perhaps there are more elegant ways ...
> 
> -steve
> 




More information about the datatable-help mailing list