[datatable-help] Unexpectedly getting "Didn't allocate enough rows..."

Mon Aug 2 18:01:40 CEST 2010

It looks good.  Thanks.

Harish

--- On Thu, 7/22/10, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> From: Matthew Dowle <mdowle at mdowle.plus.com>
> Subject: Re: [datatable-help] Unexpectedly getting "Didn't allocate enough rows..."
> To: "Harish" <harishv_99 at yahoo.com>, datatable-help at lists.r-forge.r-project.org
> Date: Thursday, July 22, 2010, 6:43 AM
> This is now fixed, just committed.
> Bug #952 raised by Georg closed.
> Tests 173-175 added which includes Harish's test below.
> If anyone can test and confirm, much appreciated.
> Matthew
> 
> 
> On Tue, 2010-07-13 at 00:24 +0100, Matthew Dowle wrote: 
> > Thanks once again Harish. When I made the changes for
> fast grouping I
> > didn't quite finish it off hence the 'to implement' in
> the message. I
> > didn't want to hold up 1.4 going to CRAN because of
> it. Sure enough
> > pretty quickly after that, and more quickly than I
> expected, Georg V
> > added it to the bug tracker (#952) and I've been
> meaning to get to it.
> > 
> > What happens is this. It allocates memory for the
> largest group and
> > re-uses that same memory for all groups. No allocation
> and no garbage
> > collection. Thats one reason its fast on the input
> data side of things
> > for each eval(j).  However on the output side its
> also fast because it
> > allocates the data.table result in advance. When it
> gets the result of
> > the j expression for each group, it sticks that data
> directly into the
> > result data.table at the correct row. It doesn't build
> a list() of
> > results which is then collapsed down.
> > 
> > It can't possibly know how many rows to allocate in
> advance though,
> > until it has run the j for all the groups, right?
> True, so it tries to
> > make a very good guess, optimised for most tasks. Most
> of the time we
> > either do i) single row aggregates (j is sum, mean, lm
> etc), or ii) a
> > subset of the group data (j is cumprod or [ or similar
> returning
> > multiple rows per group) or iii) NULL for the side
> effect of plotting
> > where no data output is required.
> > 
> > First, it runs the j for the first group. Depending on
> the number of
> > rows returned by the j on the first group it decides
> how to allocate the
> > result. If that is a single row for example, it
> allocates 1*number of
> > groups rows for the result. Most of the time thats
> what we need. Then it
> > proceeds to the 2nd group etc.
> > 
> > If it gets the guess wrong, then it needs to
> re-allocate memory for the
> > result using information from the later groups. Thats
> what isn't done
> > yet.  Its the right way to do it I think, but the
> re-allocate just isn't
> > implemented yet. In the vast majority of cases, it
> should only need one
> > re-allocate.
> > 
> > 'slow grow' means the method of either building up a
> growing list()
> > which is later collapsed, or growing the result slowly
> for example in
> > powers of two or by a fixed number of rows somehow.
> > 
> > Why the first group? I did try with the largest group
> which improves the
> > guess, but that messes up side-effect only plotting.
> The plot appears
> > for the largest group first, followed by group 1,
> group 2, etc. It
> > wasn't right and even then a re-allocate might still
> be needed. So its
> > cleaner to run for the first group, then make the good
> guess, then
> > proceed through groups 2 to n.
> > 
> > Long answer to a simple question I'm afraid.
> > 
> > Btw, you don't need to wrap with list in that example
> :
> >    DT[ , list(C[ C-min(C) < 5 ]),
> by=list(A,B) ]
> > you can just do this :
> >    DT[ , C[ C-min(C) < 5 ], by=list(A,B)
> ]
> > 
> > I'll see if I can implement the re-allocate soon. Or
> if there any C
> > programmers listening, then its this line
> >    // TO DO: implement R_realloc(?) here
> > that needs doing in dogroups.c.
> > 
> > Matthew
> > 
> > 
> > On Sun, 2010-07-11 at 02:09 -0700, Harish wrote:
> > > I am unexpectedly getting an error -- Didn't
> allocate enough rows. Must grow ans (to implement as we
> don't want default slow grow)
> > > 
> > > 
> > > DT <- data.table(
> > >         
> A=c("a","a","b","b","d","c","a","d"),
> > >         
> B=c("x1","x2","x2","x1","x2","x1","x1","x2"),
> > >         
> C=c(5,2,3,4,9,5,1,9)
> > >          )
> > > DT[ , list(C[ C-min(C) < 3 ]), by=list(A,B)
> ]    # Get error
> > > 
> > > DT[ , list(C[ C-min(C) < 5 ]), by=list(A,B)
> ]    # No error (as expected)
> > > 
> > > 
> > > Am I doing something that I shouldn't be?
> > > 
> > > 
> > > Harish
> > > 
> > > 
> > > 
> > >       
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> 
>