[Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Dirk Eddelbuettel
edd at debian.org
Sat Feb 24 18:17:37 CET 2024
On 24 February 2024 at 11:44, Robin Liu wrote:
| Thank you Dirk for the response.
|
| I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both machines
| and correctly see that machine A and B have 20 and 40 cores, respectively. I
| also see that calling the setter changes this value.
|
| However, calling the setter does not seem to change the number of cores used on
| either machine A or B. I have updated my code example as below: the execution
| uses 20 cores on machine A and 1 core on machine B as before, despite my
| setting the number of omp threads to 5. Do you have any further hints?
I fear you need to debug that on the machine 'B' in question. It's all open
source. I do not think either Conrad or myself put code in to constrain you
to one core on 'B' (and then doesn't as you see on 'A').
You can grep around both the RcppArmadillo wrapper code and the include
Armadillo code, I suggest making a local copy and peppering in some print
statements.
Also keep in mind that (Rcpp)Armadillo hands off to computation to the actual
LAPACK / BLAS implementation on that machine. Lots of things can go wrong
there: maybe R was compiled with its own embedded BLAS/LAPACK sources
(preventing a call out to OpenBLAS even when the machine has it). Or maybe R
was compiled correctly but a single-threaded set of libraries is on the
machine.
You have not supplied any of that information. Many bug report suggestions
hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK
libraries. You are not forced to show us this, but by not showing us you
prevent us from being more focussed on suggestions. So maybe start at your
end by glancing at sessionInfo() on A and B?
Dirk
| library(RcppArmadillo)
| library(Rcpp)
|
| RcppArmadillo::armadillo_set_number_of_omp_threads(5)
| print(sprintf("There are %d threads",
| RcppArmadillo::armadillo_get_number_of_omp_threads()))
|
| src <-
| r"(#include <RcppArmadillo.h>
|
| // [[Rcpp::depends(RcppArmadillo)]]
|
| // [[Rcpp::export]]
| arma::vec getEigenValues(arma::mat M) {
| return arma::eig_sym(M);
| })"
|
| size <- 10000
| m <- matrix(rnorm(size^2), size, size)
| m <- m * t(m)
|
| # This line compiles the above code with the -fopenmp flag.
| sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| result <- getEigenValues(m)
| print(result[1:10])
|
| On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel <edd at debian.org> wrote:
|
|
| On 23 February 2024 at 09:35, Robin Liu wrote:
| | Hi all,
| |
| | Here is an R script that uses Armadillo to decompose a large matrix and
| print
| | the first 10 eigenvalues.
| |
| | library(RcppArmadillo)
| | library(Rcpp)
| |
| | src <-
| | r"(#include <RcppArmadillo.h>
| |
| | // [[Rcpp::depends(RcppArmadillo)]]
| |
| | // [[Rcpp::export]]
| | arma::vec getEigenValues(arma::mat M) {
| | return arma::eig_sym(M);
| | })"
| |
| | size <- 10000
| | m <- matrix(rnorm(size^2), size, size)
| | m <- m * t(m)
| |
| | # This line compiles the above code with the -fopenmp flag.
| | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| | result <- getEigenValues(m)
| | print(result[1:10])
| |
| | When I run this code on server A, I see that arma can implicitly leverage
| all
| | available cores by running top -H. However, on server B it can only use
| one
| | core despite multiple being available: there is just one process entry in
| top
| | -H. Both processes successfully exit and return an answer. The process on
| | server B is of course much slower.
|
| It is documented in the package how this is applied and the policy is to
| NOT
| blindly enforce one use case (say all cores, or half, or a magically chosen
| value of N for whatever value of N) but to follow the local admin setting
| and
| respecting standard environment variables.
|
| So I suspect that your machine 'B' differs from machine 'A' in this
| regards.
|
| Not that this is a _run-time_ and not _compile-time_ behavior. As it is for
| multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
| basically
| most software of this type.
|
| You can override it, see
| RcppArmadillo::armadillo_set_number_of_omp_threads
| RcppArmadillo::armadillo_get_number_of_omp_threads
|
| Can you try and see if these help you?
|
| Dirk
|
| | Here is the compilation on server A:
| | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
| | 'file197c21cbec564.cpp'
| | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/include
| | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/local/
| lib/R/
| | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
| | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic -g -O2
| | -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time
| | -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o file197c21cbec564.o
| | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
| | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas -lgfortran
| -lm
| | -lquadmath -L/usr/local/lib/R/lib -lR
| |
| | and here it is for server B:
| | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
| | 'file158165b9c4ae1.cpp'
| | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../inst/
| include
| | -fopenmp -I"/home/my_username/.R/library/Rcpp/include" -I"/home/
| my_username
| | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/
| | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic -g
| -O2 -c
| | file158165b9c4ae1.cpp -o file158165b9c4ae1.o
| | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/lib64
| -o
| | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas -lgfortran
| -lm
| | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR
| |
| | I thought that the -fopenmp flag should let arma implicitly parallelize
| matrix
| | computations. Any hints as to why this may not work on server B?
| |
| | The actual code I'm running is an R package that includes RcppArmadillo
| and
| | RcppEnsmallen. Server B is the login node to an hpc cluster, but the code
| does
| | not use all cores on the compute nodes either.
| |
| | Best,
| | Robin
| | _______________________________________________
| | Rcpp-devel mailing list
| | Rcpp-devel at lists.r-forge.r-project.org
| | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
|
| --
| dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
|
--
dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
More information about the Rcpp-devel
mailing list