[datatable-help] Unexpectedly getting "Didn't allocate enough rows..."

Matthew Dowle mdowle at mdowle.plus.com
Tue Jul 13 01:24:41 CEST 2010


Thanks once again Harish. When I made the changes for fast grouping I
didn't quite finish it off hence the 'to implement' in the message. I
didn't want to hold up 1.4 going to CRAN because of it. Sure enough
pretty quickly after that, and more quickly than I expected, Georg V
added it to the bug tracker (#952) and I've been meaning to get to it.

What happens is this. It allocates memory for the largest group and
re-uses that same memory for all groups. No allocation and no garbage
collection. Thats one reason its fast on the input data side of things
for each eval(j).  However on the output side its also fast because it
allocates the data.table result in advance. When it gets the result of
the j expression for each group, it sticks that data directly into the
result data.table at the correct row. It doesn't build a list() of
results which is then collapsed down.

It can't possibly know how many rows to allocate in advance though,
until it has run the j for all the groups, right? True, so it tries to
make a very good guess, optimised for most tasks. Most of the time we
either do i) single row aggregates (j is sum, mean, lm etc), or ii) a
subset of the group data (j is cumprod or [ or similar returning
multiple rows per group) or iii) NULL for the side effect of plotting
where no data output is required.

First, it runs the j for the first group. Depending on the number of
rows returned by the j on the first group it decides how to allocate the
result. If that is a single row for example, it allocates 1*number of
groups rows for the result. Most of the time thats what we need. Then it
proceeds to the 2nd group etc.

If it gets the guess wrong, then it needs to re-allocate memory for the
result using information from the later groups. Thats what isn't done
yet.  Its the right way to do it I think, but the re-allocate just isn't
implemented yet. In the vast majority of cases, it should only need one
re-allocate.

'slow grow' means the method of either building up a growing list()
which is later collapsed, or growing the result slowly for example in
powers of two or by a fixed number of rows somehow.

Why the first group? I did try with the largest group which improves the
guess, but that messes up side-effect only plotting. The plot appears
for the largest group first, followed by group 1, group 2, etc. It
wasn't right and even then a re-allocate might still be needed. So its
cleaner to run for the first group, then make the good guess, then
proceed through groups 2 to n.

Long answer to a simple question I'm afraid.

Btw, you don't need to wrap with list in that example :
   DT[ , list(C[ C-min(C) < 5 ]), by=list(A,B) ]
you can just do this :
   DT[ , C[ C-min(C) < 5 ], by=list(A,B) ]

I'll see if I can implement the re-allocate soon. Or if there any C
programmers listening, then its this line
   // TO DO: implement R_realloc(?) here
that needs doing in dogroups.c.

Matthew


On Sun, 2010-07-11 at 02:09 -0700, Harish wrote:
> I am unexpectedly getting an error -- Didn't allocate enough rows. Must grow ans (to implement as we don't want default slow grow)
> 
> 
> DT <- data.table(
>          A=c("a","a","b","b","d","c","a","d"),
>          B=c("x1","x2","x2","x1","x2","x1","x1","x2"),
>          C=c(5,2,3,4,9,5,1,9)
>          )
> DT[ , list(C[ C-min(C) < 3 ]), by=list(A,B) ]    # Get error
> 
> DT[ , list(C[ C-min(C) < 5 ]), by=list(A,B) ]    # No error (as expected)
> 
> 
> Am I doing something that I shouldn't be?
> 
> 
> Harish
> 
> 
> 
>       
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list