[Rcpp-devel] Largest size of a NumericMatrix, segfaults and error messages

Tue Apr 2 21:01:11 CEST 2013

On Mon, 1 Apr 2013 10:13:54 -0500,Dirk Eddelbuettel <edd at debian.org> wrote:

> On 1 April 2013 at 17:04, Ramon Diaz-Uriarte wrote:
> | 
> | 
> | 
> | On Mon, 1 Apr 2013 08:15:48 -0500,Dirk Eddelbuettel <edd at debian.org> wrote:
> | 
> | > On 1 April 2013 at 14:48, Ramon Diaz-Uriarte wrote:
> | > | 
> | > | Dear All,
> | > | 
> | > | I am confused about creating Rcpp Numeric Matrices larger than
> | > | .Machine$integer.max. The code below illustrates some of the points
> | > | (probably with too much detail ;-). These are some things that puzzle me:
> | 
> | > Which R version did you use?  
> | 
> | Ooops, sorry. 
> | 
> | > version
> |                _                                           
> | platform       x86_64-pc-linux-gnu                         
> | arch           x86_64                                      
> | os             linux-gnu                                   
> | system         x86_64, linux-gnu                           
> | status         Patched                                     
> | major          2                                           
> | minor          15.3                                        

> I think you can't really expect this to work.  R, up to this version, has the
> very famous 2^31 - 1 index limit.

> | year           2013                                        
> | month          03                                          
> | day            03                                          
> | svn rev        62150                                       
> | language       R                                           
> | version.string R version 2.15.3 Patched (2013-03-03 r62150)
> | nickname       Security Blanket  
> | 
> | 
> | 
> | > Does what you attempt work _in straight C code
> | > bypassing Rcpp_ ?
> | 
> | In straight C++, using std::vector, this works (though not, as I tried it,
> | in naive straight C, as shown in the comments). It will use ~ 35 GB of
> | memory:

> Sure, but "does not matter" as it is outside of R.

> In R, you can do this _if you go the route of outside memory management_ as
> eg bigmemory and ff do.

Thanks! However, for the current stuff I definitely want the output to
stay well within the 2^32 limit.

> | #include <iostream>
> | #include <vector>
> | #include <iterator>
> | 
> | int main() {
> |   
> |   // double v1[500000L * 9000L]; // this segfaults
> |   // double v1[4300000000]; // this segfaults
> | 
> |   std::vector<double> v2(500000L * 9000L);
> |   std::cout << " Max size v2: " << v2.max_size() << std::endl;
> |   std::cout << " Current size v2: " << v2.size() << std::endl;
> |  
> |   double tt = 0;
> |   for(size_t t = 0; t < v2.size(); ++t)
> |      v2[t] = ++tt;
> |   std::cout << "\n Assigned to vector" << std::endl;
> |   std::cout << "\n Last value is " << v2[(500000L * 9000L) - 1] << std::endl;
> |   return 0;
> | }
> | 
> | Anyway, I guess the example is not really relevant for this case.

> Agreed.

> | > If you used R 2.*, then the attempt makes little sense AFAICT.
> | 
> | Sorry, I was not clear. I was not (consciously) _attempting_ to do
> | that. In my "for real" code the dimensions of the object are set almost at
> | the end of a long simulation and in a few cases those numbers were much
> | larger than I expected (I did not realize how big until I started looking
> | into the segfaults and the errors).

> I understand. But I think you should consider writing some sort of "reducers"
> to not require to swallow that whole object.

Yes, agreed; that is what I'm trying now.

> | What I found confusing was the segmentation fault, because the behavior
> | seems inconsistent. Sometimes there was no segfault because the error
> | ("negative length vectors are not allowed (...)")  was triggered. But
> | sometimes the object seemed to have been created (and thus I assumed sizes
> | were OK ---yes, before looking at the actual sizes) and then the segfault
> | took place later.

> <insert Oscar Wilde quote about conistency being ...   just kidding>

C++ is still way tooooo big for me to try the imaginative route; for now,
I'll stay inside the box ;-).

R.

> I think we simply see an error condition for undefined behaviour.

> Dirk

> | 
> | 
> | 
> | 
> | R.
> |   
> | 
> | > If you used R 3.0.0, then you may have noticed that R is ahead of us, and you
> | > are welcome to help close the gap :)
> | 
> | > Dirk
> | 
> |  
> | > | 1. For some values of number of rows and columns, creating the matrix is
> | > | not allowed, with the message "negative length vectors are not allowed",
> | > | but with other values the creation of the matrix proceeds without
> | > | (apparent) troubles, even when the total size is >> 2^31 - 1.
> | > | 
> | > | 1.a. Is this intended? 
> | > | 
> | > | 1.b. I understand the error message is coming from R (not Rcpp) and thus
> | > | this is not something that can be made easier to understand?
> | > | 
> | > | 
> | > | 2. The part I found confusing is that the same problem (number of cells >
> | > | 2^32 - 1) is sometimes caught at object creation, but sometimes manifests
> | > | itself much later (either in the C++ code or later in R).
> | > | 
> | > | I was expecting (maybe the problem are my expectations) an error early on,
> | > | when creating the matrix; if the creation proceeds without trouble, I was
> | > | not expecting a segfault (as I think all cells are initialized to cero).
> | > | 
> | > | Is the recommended procedure to check if the product of dimensions is <
> | > | 2^31 - 1 before creation? (But then, this will change in R-3.0 in 64 bit
> | > | systems?). 
> | > | 
> | > | 
> | > | Best,
> | > | 
> | > | R.
> | > | 
> | > | 
> | > | 
> | > | // Beginning of file max-size.cpp
> | > | 
> | > | #include <Rcpp.h>
> | > | 
> | > | using namespace Rcpp;
> | > | 
> | > | 
> | > | // [[Rcpp::export]]
> | > | 
> | > | NumericMatrix f1(IntegerVector nr, IntegerVector nc,
> | > | 		 IntegerVector sf = 0) {
> | > |   int nrow = as<int>(nr);
> | > |   int ncol = as<int>(nc);
> | > |   int segf = as<int>(sf);
> | > |   
> | > |   NumericMatrix outM(nrow, ncol);
> | > |   std::cout << " After creating outM" << std::endl;
> | > |   outM(nrow - 1, 0) = 1;
> | > |   std::cout << " After asigning to last row, first column" 
> | > |             << std::endl;
> | > | 
> | > |   std::cout << " Some other value: 1, 0:   " 
> | > | 	    << outM(1, 0) << std::endl;
> | > | 
> | > |   if( (nrow > 1) && (ncol > 3) )
> | > |     std::cout << " Some other value: nrow - 1, ncol - 3:   " 
> | > | 	      << outM(nrow - 1, ncol - 3) << std::endl;
> | > | 
> | > |   outM(nrow - 1, ncol - 1) = 1;
> | > |   std::cout << " After asigning something to last cell" 
> | > |             << std::endl;
> | > | 
> | > |   std::cout << " Try to return the last assignment: " 
> | > | 	    << outM(nrow - 1, ncol - 1) << std::endl;
> | > | 
> | > |   if((nrow >= 500000) && segf) {
> | > |     std::cout << "\n Assign a few around/beyond 2^32 - 1. Should segfault\n";
> | > |     for(int i = 4290; i < 4300; ++i) {
> | > |       std::cout << "    i = " << i << std::endl;
> | > |       outM(nrow - 1, i) = 0;
> | > |     }
> | > |   }
> | > | 
> | > |   return wrap(outM);
> | > | }
> | > | 
> | > | // End of file max-size.cpp
> | > | 
> | > | 
> | > | 
> | > | 
> | > | 
> | > | ################################################
> | > | library(Rcpp)
> | > | sourceCpp("max-size.cpp", verbose = TRUE)
> | > | 
> | > | (tmp <- f1(4, 5))
> | > | 
> | > | 
> | > | 4294967 * 500 > .Machine$integer.max
> | > | tmp <- f1(4294967, 500)
> | > | object.size(tmp)/(4294967 * 500) ## ~ 8
> | > | 
> | > | 4294967 * 501 > .Machine$integer.max
> | > | tmp <- f1(4294967, 501) ## negative length vectors 
> | > | 
> | > | 500000 * 9000 > .Machine$integer.max
> | > | tmp <- f1(500000, 9000) ## sometimes segfaults
> | > | tmp[500000, 9000]
> | > | object.size(tmp) ## things are missing 
> | > | prod(dim(tmp)) > .Machine$integer.max
> | > | 
> | > | ## using either of these usually leads to segfault
> | > | 
> | > | for(i in (4290:4300)) print(tmp[500000, i]) 
> | > | 
> | > | f1(500000, 9000, 1)
> | > | 
> | > | #####################################################
> | > | 
> | > | 
> | > | -- 
> | > | Ramon Diaz-Uriarte
> | > | Department of Biochemistry, Lab B-25
> | > | Facultad de Medicina 
> | > | Universidad Autónoma de Madrid 
> | > | Arzobispo Morcillo, 4
> | > | 28029 Madrid
> | > | Spain
> | > | 
> | > | Phone: +34-91-497-2412
> | > | 
> | > | Email: rdiaz02 at gmail.com
> | > |        ramon.diaz at iib.uam.es
> | > | 
> | > | http://ligarto.org/rdiaz
> | > | 
> | > | 
> | > | _______________________________________________
> | > | Rcpp-devel mailing list
> | > | Rcpp-devel at lists.r-forge.r-project.org
> | > | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
> | > -- 
> | > Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
> | -- 
> | Ramon Diaz-Uriarte
> | Department of Biochemistry, Lab B-25
> | Facultad de Medicina 
> | Universidad Autónoma de Madrid 
> | Arzobispo Morcillo, 4
> | 28029 Madrid
> | Spain
> | 
> | Phone: +34-91-497-2412
> | 
> | Email: rdiaz02 at gmail.com
> |        ramon.diaz at iib.uam.es
> | 
> | http://ligarto.org/rdiaz
> | 
> | 

> -- 
> Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
-- 
Ramon Diaz-Uriarte
Department of Biochemistry, Lab B-25
Facultad de Medicina 
Universidad Autónoma de Madrid 
Arzobispo Morcillo, 4
28029 Madrid
Spain

Phone: +34-91-497-2412

Email: rdiaz02 at gmail.com
       ramon.diaz at iib.uam.es

http://ligarto.org/rdiaz