[Rcpp-devel] Largest size of a NumericMatrix, segfaults and error messages

Mon Apr 1 17:04:03 CEST 2013

On Mon, 1 Apr 2013 08:15:48 -0500,Dirk Eddelbuettel <edd at debian.org> wrote:

> On 1 April 2013 at 14:48, Ramon Diaz-Uriarte wrote:
> | 
> | Dear All,
> | 
> | I am confused about creating Rcpp Numeric Matrices larger than
> | .Machine$integer.max. The code below illustrates some of the points
> | (probably with too much detail ;-). These are some things that puzzle me:

> Which R version did you use?  

Ooops, sorry. 

> version
               _                                           
platform       x86_64-pc-linux-gnu                         
arch           x86_64                                      
os             linux-gnu                                   
system         x86_64, linux-gnu                           
status         Patched                                     
major          2                                           
minor          15.3                                        
year           2013                                        
month          03                                          
day            03                                          
svn rev        62150                                       
language       R                                           
version.string R version 2.15.3 Patched (2013-03-03 r62150)
nickname       Security Blanket  

> Does what you attempt work _in straight C code
> bypassing Rcpp_ ?

In straight C++, using std::vector, this works (though not, as I tried it,
in naive straight C, as shown in the comments). It will use ~ 35 GB of
memory:

#include <iostream>
#include <vector>
#include <iterator>

int main() {

  // double v1[500000L * 9000L]; // this segfaults
  // double v1[4300000000]; // this segfaults

  std::vector<double> v2(500000L * 9000L);
  std::cout << " Max size v2: " << v2.max_size() << std::endl;
  std::cout << " Current size v2: " << v2.size() << std::endl;

  double tt = 0;
  for(size_t t = 0; t < v2.size(); ++t)
     v2[t] = ++tt;
  std::cout << "\n Assigned to vector" << std::endl;
  std::cout << "\n Last value is " << v2[(500000L * 9000L) - 1] << std::endl;
  return 0;
}

Anyway, I guess the example is not really relevant for this case.

> If you used R 2.*, then the attempt makes little sense AFAICT.

Sorry, I was not clear. I was not (consciously) _attempting_ to do
that. In my "for real" code the dimensions of the object are set almost at
the end of a long simulation and in a few cases those numbers were much
larger than I expected (I did not realize how big until I started looking
into the segfaults and the errors).

What I found confusing was the segmentation fault, because the behavior
seems inconsistent. Sometimes there was no segfault because the error
("negative length vectors are not allowed (...)")  was triggered. But
sometimes the object seemed to have been created (and thus I assumed sizes
were OK ---yes, before looking at the actual sizes) and then the segfault
took place later.

R.

> If you used R 3.0.0, then you may have noticed that R is ahead of us, and you
> are welcome to help close the gap :)

> Dirk

> | 1. For some values of number of rows and columns, creating the matrix is
> | not allowed, with the message "negative length vectors are not allowed",
> | but with other values the creation of the matrix proceeds without
> | (apparent) troubles, even when the total size is >> 2^31 - 1.
> | 
> | 1.a. Is this intended? 
> | 
> | 1.b. I understand the error message is coming from R (not Rcpp) and thus
> | this is not something that can be made easier to understand?
> | 
> | 
> | 2. The part I found confusing is that the same problem (number of cells >
> | 2^32 - 1) is sometimes caught at object creation, but sometimes manifests
> | itself much later (either in the C++ code or later in R).
> | 
> | I was expecting (maybe the problem are my expectations) an error early on,
> | when creating the matrix; if the creation proceeds without trouble, I was
> | not expecting a segfault (as I think all cells are initialized to cero).
> | 
> | Is the recommended procedure to check if the product of dimensions is <
> | 2^31 - 1 before creation? (But then, this will change in R-3.0 in 64 bit
> | systems?). 
> | 
> | 
> | Best,
> | 
> | R.
> | 
> | 
> | 
> | // Beginning of file max-size.cpp
> | 
> | #include <Rcpp.h>
> | 
> | using namespace Rcpp;
> | 
> | 
> | // [[Rcpp::export]]
> | 
> | NumericMatrix f1(IntegerVector nr, IntegerVector nc,
> | 		 IntegerVector sf = 0) {
> |   int nrow = as<int>(nr);
> |   int ncol = as<int>(nc);
> |   int segf = as<int>(sf);
> |   
> |   NumericMatrix outM(nrow, ncol);
> |   std::cout << " After creating outM" << std::endl;
> |   outM(nrow - 1, 0) = 1;
> |   std::cout << " After asigning to last row, first column" 
> |             << std::endl;
> | 
> |   std::cout << " Some other value: 1, 0:   " 
> | 	    << outM(1, 0) << std::endl;
> | 
> |   if( (nrow > 1) && (ncol > 3) )
> |     std::cout << " Some other value: nrow - 1, ncol - 3:   " 
> | 	      << outM(nrow - 1, ncol - 3) << std::endl;
> | 
> |   outM(nrow - 1, ncol - 1) = 1;
> |   std::cout << " After asigning something to last cell" 
> |             << std::endl;
> | 
> |   std::cout << " Try to return the last assignment: " 
> | 	    << outM(nrow - 1, ncol - 1) << std::endl;
> | 
> |   if((nrow >= 500000) && segf) {
> |     std::cout << "\n Assign a few around/beyond 2^32 - 1. Should segfault\n";
> |     for(int i = 4290; i < 4300; ++i) {
> |       std::cout << "    i = " << i << std::endl;
> |       outM(nrow - 1, i) = 0;
> |     }
> |   }
> | 
> |   return wrap(outM);
> | }
> | 
> | // End of file max-size.cpp
> | 
> | 
> | 
> | 
> | 
> | ################################################
> | library(Rcpp)
> | sourceCpp("max-size.cpp", verbose = TRUE)
> | 
> | (tmp <- f1(4, 5))
> | 
> | 
> | 4294967 * 500 > .Machine$integer.max
> | tmp <- f1(4294967, 500)
> | object.size(tmp)/(4294967 * 500) ## ~ 8
> | 
> | 4294967 * 501 > .Machine$integer.max
> | tmp <- f1(4294967, 501) ## negative length vectors 
> | 
> | 500000 * 9000 > .Machine$integer.max
> | tmp <- f1(500000, 9000) ## sometimes segfaults
> | tmp[500000, 9000]
> | object.size(tmp) ## things are missing 
> | prod(dim(tmp)) > .Machine$integer.max
> | 
> | ## using either of these usually leads to segfault
> | 
> | for(i in (4290:4300)) print(tmp[500000, i]) 
> | 
> | f1(500000, 9000, 1)
> | 
> | #####################################################
> | 
> | 
> | -- 
> | Ramon Diaz-Uriarte
> | Department of Biochemistry, Lab B-25
> | Facultad de Medicina 
> | Universidad Autónoma de Madrid 
> | Arzobispo Morcillo, 4
> | 28029 Madrid
> | Spain
> | 
> | Phone: +34-91-497-2412
> | 
> | Email: rdiaz02 at gmail.com
> |        ramon.diaz at iib.uam.es
> | 
> | http://ligarto.org/rdiaz
> | 
> | 
> | _______________________________________________
> | Rcpp-devel mailing list
> | Rcpp-devel at lists.r-forge.r-project.org
> | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
> -- 
> Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
-- 
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina 
Universidad Autónoma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz