[Rcpp-devel] Largest size of a NumericMatrix, segfaults and error messages

Mon Apr 1 17:13:54 CEST 2013

On 1 April 2013 at 17:04, Ramon Diaz-Uriarte wrote:
| 
| 
| 
| On Mon, 1 Apr 2013 08:15:48 -0500,Dirk Eddelbuettel <edd at debian.org> wrote:
| 
| > On 1 April 2013 at 14:48, Ramon Diaz-Uriarte wrote:
| > | 
| > | Dear All,
| > | 
| > | I am confused about creating Rcpp Numeric Matrices larger than
| > | .Machine$integer.max. The code below illustrates some of the points
| > | (probably with too much detail ;-). These are some things that puzzle me:
| 
| > Which R version did you use?  
| 
| Ooops, sorry. 
| 
| > version
|                _                                           
| platform       x86_64-pc-linux-gnu                         
| arch           x86_64                                      
| os             linux-gnu                                   
| system         x86_64, linux-gnu                           
| status         Patched                                     
| major          2                                           
| minor          15.3                                        

I think you can't really expect this to work.  R, up to this version, has the
very famous 2^31 - 1 index limit.

| year           2013                                        
| month          03                                          
| day            03                                          
| svn rev        62150                                       
| language       R                                           
| version.string R version 2.15.3 Patched (2013-03-03 r62150)
| nickname       Security Blanket  
| 
| 
| 
| > Does what you attempt work _in straight C code
| > bypassing Rcpp_ ?
| 
| In straight C++, using std::vector, this works (though not, as I tried it,
| in naive straight C, as shown in the comments). It will use ~ 35 GB of
| memory:

Sure, but "does not matter" as it is outside of R.

In R, you can do this _if you go the route of outside memory management_ as
eg bigmemory and ff do.

| #include <iostream>
| #include <vector>
| #include <iterator>
| 
| int main() {
|   
|   // double v1[500000L * 9000L]; // this segfaults
|   // double v1[4300000000]; // this segfaults
| 
|   std::vector<double> v2(500000L * 9000L);
|   std::cout << " Max size v2: " << v2.max_size() << std::endl;
|   std::cout << " Current size v2: " << v2.size() << std::endl;
|  
|   double tt = 0;
|   for(size_t t = 0; t < v2.size(); ++t)
|      v2[t] = ++tt;
|   std::cout << "\n Assigned to vector" << std::endl;
|   std::cout << "\n Last value is " << v2[(500000L * 9000L) - 1] << std::endl;
|   return 0;
| }
| 
| Anyway, I guess the example is not really relevant for this case.

Agreed.

| > If you used R 2.*, then the attempt makes little sense AFAICT.
| 
| Sorry, I was not clear. I was not (consciously) _attempting_ to do
| that. In my "for real" code the dimensions of the object are set almost at
| the end of a long simulation and in a few cases those numbers were much
| larger than I expected (I did not realize how big until I started looking
| into the segfaults and the errors).

I understand. But I think you should consider writing some sort of "reducers"
to not require to swallow that whole object.

| What I found confusing was the segmentation fault, because the behavior
| seems inconsistent. Sometimes there was no segfault because the error
| ("negative length vectors are not allowed (...)")  was triggered. But
| sometimes the object seemed to have been created (and thus I assumed sizes
| were OK ---yes, before looking at the actual sizes) and then the segfault
| took place later.

I think we simply see an error condition for undefined behaviour.

Dirk

| 
| 
| 
| 
| R.
|   
| 
| > If you used R 3.0.0, then you may have noticed that R is ahead of us, and you
| > are welcome to help close the gap :)
| 
| > Dirk
| 
|  
| > | 1. For some values of number of rows and columns, creating the matrix is
| > | not allowed, with the message "negative length vectors are not allowed",
| > | but with other values the creation of the matrix proceeds without
| > | (apparent) troubles, even when the total size is >> 2^31 - 1.
| > | 
| > | 1.a. Is this intended? 
| > | 
| > | 1.b. I understand the error message is coming from R (not Rcpp) and thus
| > | this is not something that can be made easier to understand?
| > | 
| > | 
| > | 2. The part I found confusing is that the same problem (number of cells >
| > | 2^32 - 1) is sometimes caught at object creation, but sometimes manifests
| > | itself much later (either in the C++ code or later in R).
| > | 
| > | I was expecting (maybe the problem are my expectations) an error early on,
| > | when creating the matrix; if the creation proceeds without trouble, I was
| > | not expecting a segfault (as I think all cells are initialized to cero).
| > | 
| > | Is the recommended procedure to check if the product of dimensions is <
| > | 2^31 - 1 before creation? (But then, this will change in R-3.0 in 64 bit
| > | systems?). 
| > | 
| > | 
| > | Best,
| > | 
| > | R.
| > | 
| > | 
| > | 
| > | // Beginning of file max-size.cpp
| > | 
| > | #include <Rcpp.h>
| > | 
| > | using namespace Rcpp;
| > | 
| > | 
| > | // [[Rcpp::export]]
| > | 
| > | NumericMatrix f1(IntegerVector nr, IntegerVector nc,
| > | 		 IntegerVector sf = 0) {
| > |   int nrow = as<int>(nr);
| > |   int ncol = as<int>(nc);
| > |   int segf = as<int>(sf);
| > |   
| > |   NumericMatrix outM(nrow, ncol);
| > |   std::cout << " After creating outM" << std::endl;
| > |   outM(nrow - 1, 0) = 1;
| > |   std::cout << " After asigning to last row, first column" 
| > |             << std::endl;
| > | 
| > |   std::cout << " Some other value: 1, 0:   " 
| > | 	    << outM(1, 0) << std::endl;
| > | 
| > |   if( (nrow > 1) && (ncol > 3) )
| > |     std::cout << " Some other value: nrow - 1, ncol - 3:   " 
| > | 	      << outM(nrow - 1, ncol - 3) << std::endl;
| > | 
| > |   outM(nrow - 1, ncol - 1) = 1;
| > |   std::cout << " After asigning something to last cell" 
| > |             << std::endl;
| > | 
| > |   std::cout << " Try to return the last assignment: " 
| > | 	    << outM(nrow - 1, ncol - 1) << std::endl;
| > | 
| > |   if((nrow >= 500000) && segf) {
| > |     std::cout << "\n Assign a few around/beyond 2^32 - 1. Should segfault\n";
| > |     for(int i = 4290; i < 4300; ++i) {
| > |       std::cout << "    i = " << i << std::endl;
| > |       outM(nrow - 1, i) = 0;
| > |     }
| > |   }
| > | 
| > |   return wrap(outM);
| > | }
| > | 
| > | // End of file max-size.cpp
| > | 
| > | 
| > | 
| > | 
| > | 
| > | ################################################
| > | library(Rcpp)
| > | sourceCpp("max-size.cpp", verbose = TRUE)
| > | 
| > | (tmp <- f1(4, 5))
| > | 
| > | 
| > | 4294967 * 500 > .Machine$integer.max
| > | tmp <- f1(4294967, 500)
| > | object.size(tmp)/(4294967 * 500) ## ~ 8
| > | 
| > | 4294967 * 501 > .Machine$integer.max
| > | tmp <- f1(4294967, 501) ## negative length vectors 
| > | 
| > | 500000 * 9000 > .Machine$integer.max
| > | tmp <- f1(500000, 9000) ## sometimes segfaults
| > | tmp[500000, 9000]
| > | object.size(tmp) ## things are missing 
| > | prod(dim(tmp)) > .Machine$integer.max
| > | 
| > | ## using either of these usually leads to segfault
| > | 
| > | for(i in (4290:4300)) print(tmp[500000, i]) 
| > | 
| > | f1(500000, 9000, 1)
| > | 
| > | #####################################################
| > | 
| > | 
| > | -- 
| > | Ramon Diaz-Uriarte
| > | Department of Biochemistry, Lab B-25
| > | Facultad de Medicina 
| > | Universidad Autónoma de Madrid 
| > | Arzobispo Morcillo, 4
| > | 28029 Madrid
| > | Spain
| > | 
| > | Phone: +34-91-497-2412
| > | 
| > | Email: rdiaz02 at gmail.com
| > |        ramon.diaz at iib.uam.es
| > | 
| > | http://ligarto.org/rdiaz
| > | 
| > | 
| > | _______________________________________________
| > | Rcpp-devel mailing list
| > | Rcpp-devel at lists.r-forge.r-project.org
| > | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
| > -- 
| > Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
| -- 
| Ramon Diaz-Uriarte
| Department of Biochemistry, Lab B-25
| Facultad de Medicina 
| Universidad Autónoma de Madrid 
| Arzobispo Morcillo, 4
| 28029 Madrid
| Spain
| 
| Phone: +34-91-497-2412
| 
| Email: rdiaz02 at gmail.com
|        ramon.diaz at iib.uam.es
| 
| http://ligarto.org/rdiaz
| 
| 

-- 
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com