[datatable-help] Summing over many variables

Mon Dec 27 19:23:26 CET 2010

I like Matthew's idea of flattening tables. But, as usual, I did not tell the whole story in my first post. I will probably want to look at many expressions, for example,

sum(A1+A2+A3+A4+A5)
sum(A2+A3+A4+A5+A6)
sum(A3+A4+A5+A6+A7)
sum((A1+A2)/2 - (A3+A4)/2)

To be able to investigate a sequence of these easily, I found (after some trial and error, and then thinking about it a bit more to try to make my problem look like one from the datatable-faq) that this will do the trick:

library(data.table)

# create data table
DT1<-data.table(A1=1:1000000,A2=1:1000000,A3=1:1000000,A4=1:1000000,A5=1:1000000,grp=rep(1:50000,each=20))
setkey(DT1,grp)

# Say I want DT1[,sum(A1+A2+A3+A4+A5),by=grp]

# First, create expression of interest, and convert it to data-table-useful form
ASumExpr<-parse(text=paste("quote(sum(",paste("A",1:5,sep="",collapse="+"),"))",sep=""))
# (Next few lines: to help me and maybe you see what this looks like...)
ASumExpr
str(ASumExpr)
eval(ASumExpr)
str(eval(ASumExpr))
str(quote(mean(x))) # from example in datatable-faq.pdf. So eval(ASumExpr) looks good

# long-hand typing method. OK for one or two, but not in general
system.time(dt2a<-DT1[,sum(A1+A2+A3+A4+A5),by=grp])
# formula method. This will be useful.
system.time(dt2b<-DT1[,eval(eval(ASumExpr)),by=grp])

identical(dt2a, dt2b)

# Fast and easy to write. Just what I wanted. Thanks again for the ideas that lead to this useful solution.

Joe V.

-----Original Message-----
From: Matthew Dowle [mailto:mdowlenoreply at virginmedia.com] On Behalf Of Matthew Dowle
Sent: Thursday, December 23, 2010 4:33 PM
To: Joseph Voelkel
Cc: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] Summing over many variables

Yes that's one way. We aren't that happy with using lapply in j as it
loses the benefit of data.table.

I tend to 'flatten' tables like this. Try to have few columns. In this
case it would be either a 3 column table (grp,colname,value) or maybe a
4 column table if you ever want to group by "A" or
"B" (grp,letter,number,value). The query would then be
DT[,sum(value),by=list(grp,letter,number)]. You can then do pattern
matches and filters etc in the i rather than in the j e.g.
DT[letter=="A",sum(value),by=group] for just the "A"s. The answer comes
out in 'flat' format but you can always 'unflatten' the result to make
it look pretty or easier to read.  [Note that I sinned by using '==' in
the i just then invoking vector scan, so to avoid that for speed you
would setkey(letter,group) then DT["A",sum(value),by=group]], or getting
fancy if you only wanted some groups (say 1 and 3) then 'by without by'
e.g. DT[list("A",c(1,3)),sum(value)].

'flat' is a common way to use data.table to store higher dimensional
data, and especially sparse higher dimensional data.

The 'grp.1' repetition is a problem I'd like to remove. It's related to
this feature request (but is almost a bug). At the moment you have to
remove the grp.1 afterwards.
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=978&group_id=240&atid=978

Matthew