[Rcpp-devel] RcppArmadillo with -fopenmp: Not using all available cores
Robin Liu
robin28liu at gmail.com
Sun Mar 3 01:34:34 CET 2024
Hi Dirk,
sessionInfo() was the right clue. Indeed the version of R on machine B was
not linked to OpenBLAS. Switching to a version with OpenBLAS allows the
test code to use all cores.
A clear way to check which library is linked is to run the following:
> extSoftVersion()["BLAS"]
Thanks for your help!
On Sat, Feb 24, 2024 at 9:17 AM Dirk Eddelbuettel <edd at debian.org> wrote:
>
> On 24 February 2024 at 11:44, Robin Liu wrote:
> | Thank you Dirk for the response.
> |
> | I called RcppArmadillo::armadillo_get_number_of_omp_threads() on both
> machines
> | and correctly see that machine A and B have 20 and 40 cores,
> respectively. I
> | also see that calling the setter changes this value.
> |
> | However, calling the setter does not seem to change the number of cores
> used on
> | either machine A or B. I have updated my code example as below: the
> execution
> | uses 20 cores on machine A and 1 core on machine B as before, despite my
> | setting the number of omp threads to 5. Do you have any further hints?
>
> I fear you need to debug that on the machine 'B' in question. It's all open
> source. I do not think either Conrad or myself put code in to constrain
> you
> to one core on 'B' (and then doesn't as you see on 'A').
>
> You can grep around both the RcppArmadillo wrapper code and the include
> Armadillo code, I suggest making a local copy and peppering in some print
> statements.
>
> Also keep in mind that (Rcpp)Armadillo hands off to computation to the
> actual
> LAPACK / BLAS implementation on that machine. Lots of things can go wrong
> there: maybe R was compiled with its own embedded BLAS/LAPACK sources
> (preventing a call out to OpenBLAS even when the machine has it). Or maybe
> R
> was compiled correctly but a single-threaded set of libraries is on the
> machine.
>
> You have not supplied any of that information. Many bug report suggestions
> hint that showing `sessionInfo()` helps -- and it does show the BLAS/LAPACK
> libraries. You are not forced to show us this, but by not showing us you
> prevent us from being more focussed on suggestions. So maybe start at your
> end by glancing at sessionInfo() on A and B?
>
> Dirk
>
>
> | library(RcppArmadillo)
> | library(Rcpp)
> |
> | RcppArmadillo::armadillo_set_number_of_omp_threads(5)
> | print(sprintf("There are %d threads",
> | RcppArmadillo::armadillo_get_number_of_omp_threads()))
> |
> | src <-
> | r"(#include <RcppArmadillo.h>
> |
> | // [[Rcpp::depends(RcppArmadillo)]]
> |
> | // [[Rcpp::export]]
> | arma::vec getEigenValues(arma::mat M) {
> | return arma::eig_sym(M);
> | })"
> |
> | size <- 10000
> | m <- matrix(rnorm(size^2), size, size)
> | m <- m * t(m)
> |
> | # This line compiles the above code with the -fopenmp flag.
> | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
> | result <- getEigenValues(m)
> | print(result[1:10])
> |
> | On Fri, Feb 23, 2024 at 12:53 PM Dirk Eddelbuettel <edd at debian.org>
> wrote:
> |
> |
> | On 23 February 2024 at 09:35, Robin Liu wrote:
> | | Hi all,
> | |
> | | Here is an R script that uses Armadillo to decompose a large
> matrix and
> | print
> | | the first 10 eigenvalues.
> | |
> | | library(RcppArmadillo)
> | | library(Rcpp)
> | |
> | | src <-
> | | r"(#include <RcppArmadillo.h>
> | |
> | | // [[Rcpp::depends(RcppArmadillo)]]
> | |
> | | // [[Rcpp::export]]
> | | arma::vec getEigenValues(arma::mat M) {
> | | return arma::eig_sym(M);
> | | })"
> | |
> | | size <- 10000
> | | m <- matrix(rnorm(size^2), size, size)
> | | m <- m * t(m)
> | |
> | | # This line compiles the above code with the -fopenmp flag.
> | | sourceCpp(code = src, verbose = TRUE, rebuild = TRUE)
> | | result <- getEigenValues(m)
> | | print(result[1:10])
> | |
> | | When I run this code on server A, I see that arma can implicitly
> leverage
> | all
> | | available cores by running top -H. However, on server B it can
> only use
> | one
> | | core despite multiple being available: there is just one process
> entry in
> | top
> | | -H. Both processes successfully exit and return an answer. The
> process on
> | | server B is of course much slower.
> |
> | It is documented in the package how this is applied and the policy
> is to
> | NOT
> | blindly enforce one use case (say all cores, or half, or a magically
> chosen
> | value of N for whatever value of N) but to follow the local admin
> setting
> | and
> | respecting standard environment variables.
> |
> | So I suspect that your machine 'B' differs from machine 'A' in this
> | regards.
> |
> | Not that this is a _run-time_ and not _compile-time_ behavior. As it
> is for
> | multicore-enabled LAPACK and BLAS libraries, the OpenMP library and
> | basically
> | most software of this type.
> |
> | You can override it, see
> | RcppArmadillo::armadillo_set_number_of_omp_threads
> | RcppArmadillo::armadillo_get_number_of_omp_threads
> |
> | Can you try and see if these help you?
> |
> | Dirk
> |
> | | Here is the compilation on server A:
> | | /usr/local/lib/R/bin/R CMD SHLIB --preclean -o 'sourceCpp_2.so'
> | | 'file197c21cbec564.cpp'
> | | g++ -std=gnu++11 -I"/usr/local/lib/R/include" -DNDEBUG
> -I../inst/include
> | | -fopenmp -I"/usr/local/lib/R/site-library/Rcpp/include"
> -I"/usr/local/
> | lib/R/
> | | site-library/RcppArmadillo/include" -I"/tmp/RtmpwhGRi3/
> | | sourceCpp-x86_64-pc-linux-gnu-1.0.9" -I/usr/local/include -fpic
> -g -O2
> | | -fstack-protector-strong -Wformat -Werror=format-security
> -Wdate-time
> | | -D_FORTIFY_SOURCE=2 -g -c file197c21cbec564.cpp -o
> file197c21cbec564.o
> | | g++ -std=gnu++11 -shared -L/usr/local/lib/R/lib -L/usr/local/lib -o
> | | sourceCpp_2.so file197c21cbec564.o -fopenmp -llapack -lblas
> -lgfortran
> | -lm
> | | -lquadmath -L/usr/local/lib/R/lib -lR
> | |
> | | and here it is for server B:
> | | /sw/R/R-4.2.3/lib64/R/bin/R CMD SHLIB --preclean -o
> 'sourceCpp_2.so'
> | | 'file158165b9c4ae1.cpp'
> | | g++ -std=gnu++11 -I"/sw/R/R-4.2.3/lib64/R/include" -DNDEBUG
> -I../inst/
> | include
> | | -fopenmp -I"/home/my_username/.R/library/Rcpp/include" -I"/home/
> | my_username
> | | /.R/library/RcppArmadillo/include" -I"/tmp/RtmpvfPt4l/
> | | sourceCpp-x86_64-pc-linux-gnu-1.0.10" -I/usr/local/include -fpic
> -g
> | -O2 -c
> | | file158165b9c4ae1.cpp -o file158165b9c4ae1.o
> | | g++ -std=gnu++11 -shared -L/sw/R/R-4.2.3/lib64/R/lib
> -L/usr/local/lib64
> | -o
> | | sourceCpp_2.so file158165b9c4ae1.o -fopenmp -llapack -lblas
> -lgfortran
> | -lm
> | | -lquadmath -L/sw/R/R-4.2.3/lib64/R/lib -lR
> | |
> | | I thought that the -fopenmp flag should let arma implicitly
> parallelize
> | matrix
> | | computations. Any hints as to why this may not work on server B?
> | |
> | | The actual code I'm running is an R package that includes
> RcppArmadillo
> | and
> | | RcppEnsmallen. Server B is the login node to an hpc cluster, but
> the code
> | does
> | | not use all cores on the compute nodes either.
> | |
> | | Best,
> | | Robin
> | | _______________________________________________
> | | Rcpp-devel mailing list
> | | Rcpp-devel at lists.r-forge.r-project.org
> | |
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
> |
> | --
> | dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
> |
>
> --
> dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20240302/ebcb5427/attachment.htm>
More information about the Rcpp-devel
mailing list