[Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores

Sun Mar 3 02:02:47 CET 2024

Hi Robin,

On 2 March 2024 at 16:34, Robin Liu wrote:
| sessionInfo() was the right clue. Indeed the version of R on machine B was not
| linked to OpenBLAS. Switching to a version with OpenBLAS allows the test code
| to use all cores.
| 
| A clear way to check which library is linked is to run the following:
| 
| > extSoftVersion()["BLAS"]

Ah yes -- I keep forgetting about that one. Good reminder!

| Thanks for your help!

Always a pleasure. Glad you are all set.

Dirk

| On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel <edd at debian.org> wrote:
| 
| 
|     On 24 February 2024 at 11:44, Robin Liu wrote:
|     | Thank you Dirk for the response.
|     |
|     | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both
|     machines
|     | and correctly see that machine A and B have 20 and 40 cores,
|     respectively. I
|     | also see that calling the setter changes this value.
|     |
|     | However, calling the setter does not seem to change the number of cores
|     used on
|     | either machine A or B. I have updated my code example as below: the
|     execution
|     | uses 20 cores on machine A and 1 core on machine B as before, despite my
|     | setting the number of omp threads to 5. Do you have any further hints?
| 
|     I fear you need to debug that on the machine 'B' in question. It's all open
|     source.  I do not think either Conrad or myself put code in to constrain
|     you
|     to one core on 'B' (and then doesn't as you see on 'A').
| 
|     You can grep around both the RcppArmadillo wrapper code and the include
|     Armadillo code, I suggest making a local copy and peppering in some print
|     statements.
| 
|     Also keep in mind that (Rcpp)Armadillo hands off to computation to the
|     actual
|     LAPACK / BLAS implementation on that machine. Lots of things can go wrong
|     there: maybe R was compiled with its own embedded BLAS/LAPACK sources
|     (preventing a call out to OpenBLAS even when the machine has it). Or maybe
|     R
|     was compiled correctly but a single-threaded set of libraries is on the
|     machine.
| 
|     You have not supplied any of that information. Many bug report suggestions
|     hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK
|     libraries. You are not forced to show us this, but by not showing us you
|     prevent us from being more focussed on suggestions.  So maybe start at your
|     end by glancing at sessionInfo() on A and B?
| 
|     Dirk
| 
| 
|     | library(RcppArmadillo)
|     | library(Rcpp)
|     |
|     | RcppArmadillo::armadillo_set_number_of_omp_threads(5)
|     | print(sprintf("There are %d threads",
|     |       RcppArmadillo::armadillo_get_number_of_omp_threads()))
|     |
|     | src <-
|     | r"(#include <RcppArmadillo.h>
|     |
|     | // [[Rcpp::depends(RcppArmadillo)]]
|     |
|     | // [[Rcpp::export]]
|     | arma::vec getEigenValues(arma::mat M) {
|     |   return arma::eig_sym(M);
|     | })"
|     |
|     | size <- 10000
|     | m <- matrix(rnorm(size^2), size, size)
|     | m <- m * t(m)
|     |
|     | # This line compiles the above code with the -fopenmp flag.
|     | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
|     | result <- getEigenValues(m)
|     | print(result[1:10])
|     |
|     | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel <edd at debian.org>
|     wrote:
|     |
|     |
|     |     On 23 February 2024 at 09:35, Robin Liu wrote:
|     |     | Hi all,
|     |     |
|     |     | Here is an R script that uses Armadillo to decompose a large matrix
|     and
|     |     print
|     |     | the first 10 eigenvalues.
|     |     |
|     |     | library(RcppArmadillo)
|     |     | library(Rcpp)
|     |     |
|     |     | src <-
|     |     | r"(#include <RcppArmadillo.h>
|     |     |
|     |     | // [[Rcpp::depends(RcppArmadillo)]]
|     |     |
|     |     | // [[Rcpp::export]]
|     |     | arma::vec getEigenValues(arma::mat M) {
|     |     |   return arma::eig_sym(M);
|     |     | })"
|     |     |
|     |     | size <- 10000
|     |     | m <- matrix(rnorm(size^2), size, size)
|     |     | m <- m * t(m)
|     |     |
|     |     | # This line compiles the above code with the -fopenmp flag.
|     |     | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
|     |     | result <- getEigenValues(m)
|     |     | print(result[1:10])
|     |     |
|     |     | When I run this code on server A, I see that arma can implicitly
|     leverage
|     |     all
|     |     | available cores by running top -H. However, on server B it can only
|     use
|     |     one
|     |     | core despite multiple being available: there is just one process
|     entry in
|     |     top
|     |     | -H. Both processes successfully exit and return an answer. The
|     process on
|     |     | server B is of course much slower.
|     |
|     |     It is documented in the package how this is applied and the policy is
|     to
|     |     NOT
|     |     blindly enforce one use case (say all cores, or half, or a magically
|     chosen
|     |     value of N for whatever value of N) but to follow the local admin
|     setting
|     |     and
|     |     respecting standard environment variables.
|     |
|     |     So I suspect that your machine 'B' differs from machine 'A' in this
|     |     regards.
|     |
|     |     Not that this is a _run-time_ and not _compile-time_ behavior. As it
|     is for
|     |     multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
|     |     basically
|     |     most software of this type.
|     |
|     |     You can override it, see
|     |       RcppArmadillo::armadillo_set_number_of_omp_threads
|     |       RcppArmadillo::armadillo_get_number_of_omp_threads
|     |
|     |     Can you try and see if these help you?
|     |
|     |     Dirk
|     |
|     |     | Here is the compilation on server A:
|     |     | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
|     |     | 'file197c21cbec564.cpp'
|     |     | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/
|     include
|     |     | -fopenmp  -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/
|     local/
|     |     lib/R/
|     |     | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
|     |     | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include   -fpic
|      -g -O2
|     |     | -fstack-protector-strong -Wformat -Werror=format-security
|     -Wdate-time
|     |     | -D_FORTIFY_SOURCE=2 -g  -c file197c21cbec564.cpp -o
|     file197c21cbec564.o
|     |     | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
|     |     | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas
|     -lgfortran
|     |     -lm
|     |     | -lquadmath -L/usr/local/lib/R/lib -lR
|     |     |
|     |     | and here it is for server B:
|     |     | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o
|     'sourceCpp_2.so'
|     |     | 'file158165b9c4ae1.cpp'
|     |     | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../
|     inst/
|     |     include
|     |     | -fopenmp  -I"/home/my_username/.R/library/Rcpp/include" -I"/home/
|     |      my_username
|     |     | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/
|     |     | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include   -fpic
|      -g
|     |     -O2  -c
|     |     | file158165b9c4ae1.cpp -o file158165b9c4ae1.o
|     |     | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/
|     lib64
|     |     -o
|     |     | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas
|     -lgfortran
|     |     -lm
|     |     | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR
|     |     |
|     |     | I thought that the -fopenmp flag should let arma implicitly
|     parallelize
|     |     matrix
|     |     | computations. Any hints as to why this may not work on server B?
|     |     |
|     |     | The actual code I'm running is an R package that includes
|     RcppArmadillo
|     |     and
|     |     | RcppEnsmallen. Server B is the login node to an hpc cluster, but
|     the code
|     |     does
|     |     | not use all cores on the compute nodes either.
|     |     |
|     |     | Best,
|     |     | Robin
|     |     | _______________________________________________
|     |     | Rcpp-devel mailing list
|     |     | Rcpp-devel at lists.r-forge.r-project.org
|     |     | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/
|     rcpp-devel
|     |
|     |     --
|     |     dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
|     |
| 
|     --
|     dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
| 

-- 
dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org