[Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Dirk Eddelbuettel
edd at debian.org
Sun Mar 3 02:02:47 CET 2024
Hi Robin,
On 2 March 2024 at 16:34, Robin Liu wrote:
| sessionInfo() was the right clue. Indeed the version of R on machine B was not
| linked to OpenBLAS. Switching to a version with OpenBLAS allows the test code
| to use all cores.
|
| A clear way to check which library is linked is to run the following:
|
| > extSoftVersion()["BLAS"]
Ah yes -- I keep forgetting about that one. Good reminder!
| Thanks for your help!
Always a pleasure. Glad you are all set.
Dirk
| On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel <edd at debian.org> wrote:
|
|
| On 24 February 2024 at 11:44, Robin Liu wrote:
| | Thank you Dirk for the response.
| |
| | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both
| machines
| | and correctly see that machine A and B have 20 and 40 cores,
| respectively. I
| | also see that calling the setter changes this value.
| |
| | However, calling the setter does not seem to change the number of cores
| used on
| | either machine A or B. I have updated my code example as below: the
| execution
| | uses 20 cores on machine A and 1 core on machine B as before, despite my
| | setting the number of omp threads to 5. Do you have any further hints?
|
| I fear you need to debug that on the machine 'B' in question. It's all open
| source. I do not think either Conrad or myself put code in to constrain
| you
| to one core on 'B' (and then doesn't as you see on 'A').
|
| You can grep around both the RcppArmadillo wrapper code and the include
| Armadillo code, I suggest making a local copy and peppering in some print
| statements.
|
| Also keep in mind that (Rcpp)Armadillo hands off to computation to the
| actual
| LAPACK / BLAS implementation on that machine. Lots of things can go wrong
| there: maybe R was compiled with its own embedded BLAS/LAPACK sources
| (preventing a call out to OpenBLAS even when the machine has it). Or maybe
| R
| was compiled correctly but a single-threaded set of libraries is on the
| machine.
|
| You have not supplied any of that information. Many bug report suggestions
| hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK
| libraries. You are not forced to show us this, but by not showing us you
| prevent us from being more focussed on suggestions. So maybe start at your
| end by glancing at sessionInfo() on A and B?
|
| Dirk
|
|
| | library(RcppArmadillo)
| | library(Rcpp)
| |
| | RcppArmadillo::armadillo_set_number_of_omp_threads(5)
| | print(sprintf("There are %d threads",
| | RcppArmadillo::armadillo_get_number_of_omp_threads()))
| |
| | src <-
| | r"(#include <RcppArmadillo.h>
| |
| | // [[Rcpp::depends(RcppArmadillo)]]
| |
| | // [[Rcpp::export]]
| | arma::vec getEigenValues(arma::mat M) {
| | return arma::eig_sym(M);
| | })"
| |
| | size <- 10000
| | m <- matrix(rnorm(size^2), size, size)
| | m <- m * t(m)
| |
| | # This line compiles the above code with the -fopenmp flag.
| | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| | result <- getEigenValues(m)
| | print(result[1:10])
| |
| | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel <edd at debian.org>
| wrote:
| |
| |
| | On 23 February 2024 at 09:35, Robin Liu wrote:
| | | Hi all,
| | |
| | | Here is an R script that uses Armadillo to decompose a large matrix
| and
| | print
| | | the first 10 eigenvalues.
| | |
| | | library(RcppArmadillo)
| | | library(Rcpp)
| | |
| | | src <-
| | | r"(#include <RcppArmadillo.h>
| | |
| | | // [[Rcpp::depends(RcppArmadillo)]]
| | |
| | | // [[Rcpp::export]]
| | | arma::vec getEigenValues(arma::mat M) {
| | | return arma::eig_sym(M);
| | | })"
| | |
| | | size <- 10000
| | | m <- matrix(rnorm(size^2), size, size)
| | | m <- m * t(m)
| | |
| | | # This line compiles the above code with the -fopenmp flag.
| | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
| | | result <- getEigenValues(m)
| | | print(result[1:10])
| | |
| | | When I run this code on server A, I see that arma can implicitly
| leverage
| | all
| | | available cores by running top -H. However, on server B it can only
| use
| | one
| | | core despite multiple being available: there is just one process
| entry in
| | top
| | | -H. Both processes successfully exit and return an answer. The
| process on
| | | server B is of course much slower.
| |
| | It is documented in the package how this is applied and the policy is
| to
| | NOT
| | blindly enforce one use case (say all cores, or half, or a magically
| chosen
| | value of N for whatever value of N) but to follow the local admin
| setting
| | and
| | respecting standard environment variables.
| |
| | So I suspect that your machine 'B' differs from machine 'A' in this
| | regards.
| |
| | Not that this is a _run-time_ and not _compile-time_ behavior. As it
| is for
| | multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
| | basically
| | most software of this type.
| |
| | You can override it, see
| | RcppArmadillo::armadillo_set_number_of_omp_threads
| | RcppArmadillo::armadillo_get_number_of_omp_threads
| |
| | Can you try and see if these help you?
| |
| | Dirk
| |
| | | Here is the compilation on server A:
| | | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
| | | 'file197c21cbec564.cpp'
| | | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG -I../inst/
| include
| | | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/
| local/
| | lib/R/
| | | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
| | | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic
| -g -O2
| | | -fstack-protector-strong -Wformat -Werror=format-security
| -Wdate-time
| | | -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o
| file197c21cbec564.o
| | | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
| | | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas
| -lgfortran
| | -lm
| | | -lquadmath -L/usr/local/lib/R/lib -lR
| | |
| | | and here it is for server B:
| | | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o
| 'sourceCpp_2.so'
| | | 'file158165b9c4ae1.cpp'
| | | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG -I../
| inst/
| | include
| | | -fopenmp -I"/home/my_username/.R/library/Rcpp/include" -I"/home/
| | my_username
| | | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/
| | | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic
| -g
| | -O2 -c
| | | file158165b9c4ae1.cpp -o file158165b9c4ae1.o
| | | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib -L/usr/local/
| lib64
| | -o
| | | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas
| -lgfortran
| | -lm
| | | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR
| | |
| | | I thought that the -fopenmp flag should let arma implicitly
| parallelize
| | matrix
| | | computations. Any hints as to why this may not work on server B?
| | |
| | | The actual code I'm running is an R package that includes
| RcppArmadillo
| | and
| | | RcppEnsmallen. Server B is the login node to an hpc cluster, but
| the code
| | does
| | | not use all cores on the compute nodes either.
| | |
| | | Best,
| | | Robin
| | | _______________________________________________
| | | Rcpp-devel mailing list
| | | Rcpp-devel at lists.r-forge.r-project.org
| | | https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/
| rcpp-devel
| |
| | --
| | dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
| |
|
| --
| dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
|
--
dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
More information about the Rcpp-devel
mailing list