[datatable-help] Summing over many variables
Matthew Dowle
mdowle at mdowle.plus.com
Sat Jan 8 09:48:52 CET 2011
Hi Joseph,
Fixed, in 1.5.2 i.e. the ugly double eval no longer needed. Just eval
the parse()-ed expression with no need for the quote().
Thanks for finding this one.
Matthew
On Sun, 2011-01-02 at 08:21 +0000, Matthew Dowle wrote:
> Thanks again. Yes looks like a bug. Added here :
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1243&group_id=240&atid=975
> Matthew
>
> On Fri, 2010-12-31 at 17:10 -0500, Joseph Voelkel wrote:
> > Matthew, just to address point 5 (double eval of quoted works, single eval on unquoted does not):
> >
> > > dt1key<-data.table(A1=1:100,onekey=rep(1:2,each=50))
> > > setkey(dt1key,onekey)
> > > ASumExpr<-parse(text="quote(sum(A1))")
> > > ASumExpr
> > expression(quote(sum(A1)))
> > attr(,"srcfile")
> > <text>
> > > str(eval(ASumExpr))
> > language sum(A1)
> > >
> > > ASumExprNoQ<-parse(text="sum(A1)")
> > > ASumExprNoQ
> > expression(sum(A1))
> > attr(,"srcfile")
> > <text>
> > > str(ASumExprNoQ) # not the same as eval(ASumExpr)
> > length 1 expression(sum(A1))
> > - attr(*, "srcref")=List of 1
> > ..$ :Class 'srcref' atomic [1:6] 1 1 1 7 1 7
> > .. .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x060cef18>
> > - attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x060cef18>
> > >
> > > # test of 3 ways to obtain results
> > > dt1key[,sum(A1),by=onekey]
> > onekey V1
> > [1,] 1 1275
> > [2,] 2 3775
> > > dt1key[,eval(eval(ASumExpr)),by=onekey]
> > onekey V1
> > [1,] 1 1275
> > [2,] 2 3775
> > > dt1key[,eval(ASumExprNoQ),by=onekey]
> > Error in `[.data.table`(dt1key, , eval(ASumExprNoQ), by = onekey) :
> > columns of j don't evaluate to consistent types for each group
> > In addition: Warning message:
> > In if (as.character(jsub[[1]]) %in% c("list", "DT")) { :
> > the condition has length > 1 and only the first element will be used
> > >
> > >
> >
> > -----Original Message-----
> > From: Matthew Dowle [mailto:mdowlenoreply at virginmedia.com] On Behalf Of Matthew Dowle
> > Sent: Friday, December 31, 2010 9:13 AM
> > To: Joseph Voelkel
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Subject: RE: [datatable-help] Summing over many variables
> >
> > Hi,
> >
> > I don't quite follow all of that e.g. I'm thinking secondary keys at
> > points (manual now, or 'built-in' feature request). Anyway, sounds like
> > it's working. On point 3 I'm not sure that's really data.table, rather
> > the difference between a vectorized sum and apply()-ed sum. You should
> > see the same difference with a data.frame.
> >
> > Where it isn't working (point 5) was there an error message or was the
> > result incorrect? Might be a clue that reveals a bug.
> >
> > Matthew
> >
> > On Wed, 2010-12-29 at 12:57 -0500, Joseph Voelkel wrote:
> > > Thanks, Matthew.
> > >
> > > 1. Yes, you have the subsetting of j on your faq.
> > > 2. The double eval appears to handle this subsetting. In my smaller problem, which I am working on first, I have about 55K records and 250 variables. Using either A1+A2+A3+A4+A5 or eval(eval()) takes about 0.22 sec of user time. So, from this indirect measure, the two are equally efficient.
> > > 3. By the way, for this example, I used 6 keys, which happened for this problem to correspond to 1 record for each unique key. That is, my output also contained 55K records. I originally solved the problem by using apply with sum on the five columns that contained the A1 through A5 value, e.g. apply(DT1[,11:15,with=FALSE],1,sum). This operation took about 0.62 sec. So, even here, data.table is 3x faster than apply.
> > > 4. Of course, no key is really needed here, so if I just want to return the sum along with the key vars, I can just use DT1[,list(key1,key2,key3,key4,key5,key6,sum=A1+A2+A3+A4+A5)] which runs in under 0.01 seconds.
> > > 5. Finally, I tried your idea of removing the quote and just trying the one eval(). It worked with a simple contrived example, but not for my more complex one--I have no idea why not, because the two seem analogous...
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Matthew Dowle [mailto:mdowlenoreply at virginmedia.com] On Behalf Of Matthew Dowle
> > > Sent: Tuesday, December 28, 2010 12:22 PM
> > > To: Joseph Voelkel
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Subject: Re: [datatable-help] Summing over many variables
> > >
> > > Glad that works. Thanks for posting back. One thintg with that approach
> > > is that data.table inspects the j expression to see which columns it
> > > uses. It only subsets the ones that are used, for efficiency. There's a
> > > faq on that I think. If the expression is wrapped up inside an eval I
> > > think it still inspects the j but I can't quite remember. I'd be
> > > surprised if that works with the double eval like that. If A runs from 1
> > > to 100 in your real data and you're taking many sub-sums of 5, then this
> > > could make a big difference. Try timing sum(A1) vs sum(A2+A3+A4+A5) with
> > > and without the eval(eval()). That should reveal whether the j is being
> > > inspected ok.
> > > Also looking at it again, you shouldn't need the quote() inside the text
> > > passed to parse. Then it's just a single eval and j inspection should be
> > > ok I think i.e. DT1[,eval(ASumExpr),by=grp] rather than
> > > DT1[,eval(eval(ASumExpr)),by=grp]
> > >
> > > Matthew
> > >
> > >
> > > On Mon, 2010-12-27 at 13:23 -0500, Joseph Voelkel wrote:
> > > > I like Matthew's idea of flattening tables. But, as usual, I did not tell the whole story in my first post. I will probably want to look at many expressions, for example,
> > > >
> > > > sum(A1+A2+A3+A4+A5)
> > > > sum(A2+A3+A4+A5+A6)
> > > > sum(A3+A4+A5+A6+A7)
> > > > sum((A1+A2)/2 - (A3+A4)/2)
> > > >
> > > > To be able to investigate a sequence of these easily, I found (after some trial and error, and then thinking about it a bit more to try to make my problem look like one from the datatable-faq) that this will do the trick:
> > > >
> > > > library(data.table)
> > > >
> > > > # create data table
> > > > DT1<-data.table(A1=1:1000000,A2=1:1000000,A3=1:1000000,A4=1:1000000,A5=1:1000000,grp=rep(1:50000,each=20))
> > > > setkey(DT1,grp)
> > > >
> > > > # Say I want DT1[,sum(A1+A2+A3+A4+A5),by=grp]
> > > >
> > > > # First, create expression of interest, and convert it to data-table-useful form
> > > > ASumExpr<-parse(text=paste("quote(sum(",paste("A",1:5,sep="",collapse="+"),"))",sep=""))
> > > > # (Next few lines: to help me and maybe you see what this looks like...)
> > > > ASumExpr
> > > > str(ASumExpr)
> > > > eval(ASumExpr)
> > > > str(eval(ASumExpr))
> > > > str(quote(mean(x))) # from example in datatable-faq.pdf. So eval(ASumExpr) looks good
> > > >
> > > > # long-hand typing method. OK for one or two, but not in general
> > > > system.time(dt2a<-DT1[,sum(A1+A2+A3+A4+A5),by=grp])
> > > > # formula method. This will be useful.
> > > > system.time(dt2b<-DT1[,eval(eval(ASumExpr)),by=grp])
> > > >
> > > > identical(dt2a, dt2b)
> > > >
> > > > # Fast and easy to write. Just what I wanted. Thanks again for the ideas that lead to this useful solution.
> > > >
> > > > Joe V.
> > > >
> > > > -----Original Message-----
> > > > From: Matthew Dowle [mailto:mdowlenoreply at virginmedia.com] On Behalf Of Matthew Dowle
> > > > Sent: Thursday, December 23, 2010 4:33 PM
> > > > To: Joseph Voelkel
> > > > Cc: datatable-help at lists.r-forge.r-project.org
> > > > Subject: Re: [datatable-help] Summing over many variables
> > > >
> > > >
> > > > Yes that's one way. We aren't that happy with using lapply in j as it
> > > > loses the benefit of data.table.
> > > >
> > > > I tend to 'flatten' tables like this. Try to have few columns. In this
> > > > case it would be either a 3 column table (grp,colname,value) or maybe a
> > > > 4 column table if you ever want to group by "A" or
> > > > "B" (grp,letter,number,value). The query would then be
> > > > DT[,sum(value),by=list(grp,letter,number)]. You can then do pattern
> > > > matches and filters etc in the i rather than in the j e.g.
> > > > DT[letter=="A",sum(value),by=group] for just the "A"s. The answer comes
> > > > out in 'flat' format but you can always 'unflatten' the result to make
> > > > it look pretty or easier to read. [Note that I sinned by using '==' in
> > > > the i just then invoking vector scan, so to avoid that for speed you
> > > > would setkey(letter,group) then DT["A",sum(value),by=group]], or getting
> > > > fancy if you only wanted some groups (say 1 and 3) then 'by without by'
> > > > e.g. DT[list("A",c(1,3)),sum(value)].
> > > >
> > > > 'flat' is a common way to use data.table to store higher dimensional
> > > > data, and especially sparse higher dimensional data.
> > > >
> > > > The 'grp.1' repetition is a problem I'd like to remove. It's related to
> > > > this feature request (but is almost a bug). At the moment you have to
> > > > remove the grp.1 afterwards.
> > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=978&group_id=240&atid=978
> > > >
> > > > Matthew
> > > >
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > >
> >
> >
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list