[Rcpp-devel] OT: Boost NT

Simon Zehnder szehnder at uni-bonn.de
Tue Oct 15 10:58:47 CEST 2013


Darren,

this library looks interesting! Thank you for the link! 

The user-friendly provision of tools to make development of high-performance code more easy seems to be a new trend: Lately Dirk mentioned yeppp! to me (I am always interested in such things), OpenMP 4.0 goes in the same direction and Intel provides special #pragma clauses and so called Intrinsics to enforce vectorization without the need to dive into Assembly code (see http://software.intel.com/en-us/articles/intel-intrinsics-guide). Parallelization using either the CPU or an Accelarator (e.g. GPUs) is not that new and OpenMP (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) and OpenACC (http://www.openacc-standard.org, there is also CUDA and OpenCL, but you need much more code to make the same things happen) are well known to the most of us I assume. Though, some things are not considered by these standards: hardware specifics (excluding Intel Intrinsics). If you want to increase the throughput of your pipeline regarding a certain operation (e.g. a for loop) you should know your pipeline (How deep is it? How many and which operations are possible in one cycle of your CPU? Usually you have to look up the hardware description of the producer of your Chip). Furthermore, data loading: Most systems today are so called non-unified memory access (NUMA) systems. That is, several cores share cache on a socket. It makes sense then to keep data, where it is needed (when code is for example parallelized) and to pin processes to certain cores so that they have not to share on-chip memory but can use it on their own. Methods like memory distribution and process pinning are used to achieve these objectives.  

Parallelizing code is a topic of its own: It makes only sense to parallelize a loop with a lot of iterations and even then often only with a few processes as you suffer under overload (creating the thread pool and closing it). Nested loops are even more problematic if you want to gain real efficiency. A very common rule is to parallelize code that has per iteration a lot of operations. However you decide, there remains the task to decide which variables are shared and which are private among the processes working your parallelized region. You have to be aware of data races (several processes accessing data with at least one of them with a write operation). Such errors are often very difficult do find and produce undefined behavior (like each time you run the loop you get a different result). The only tool so far with a success rate of 100% is to my knowledge the Intel Inspector (still something that can be bought by normal persons - TotalView is another story). 

At last, you can decide where a parallel region should be executed, as your computer has not only the CPU but also for example the graphic card (GPU). The graphic card has usually thousands of cores in a single chip. Each core can do operations. A drawback is though very limited caches and bandwidths (the caches are the on-chip memories and the bandwidth the channels data is loaded through.). So, operations with massive data loads or a lot of different data per operations are usually not well-parallelized on GPUs (in addition fast implementation for GPUs is only possible via OpenACC for Nvidia cards for others you must use CUDA or OpenCL).  

So, you can see each high-performance method has its advantages and drawbacks due to your hardware and your specific instructions in your code. The hardware can be checked very easily by a library and this is done as it seems by this new Boost library (which looks pretty interesting btw.). Still, there remain areas where even best parallelization is not really best as the library cannot know your specific data. How big is it? Will the data to be loaded fit into your cache? The Boost library concentrates (as does also yeppp! but in addition it avoids DIV and SQRT operations and relies on linear approximations - which are from a mathematical point of view really good) on very simple functionals and this could work well - it will be interesting to test it!

Is it possible to use all these methods/tools in R with C/C++ extensions: Yes it is, but you have to tell the compiler! Use OpenMP and OpenACC to parallelize code. Vectorization is yet a little bit more complicated, but wait for OpenMP 4.0 - it has a simd pragma by which you can enforce vectorization. It has to be said, that vectorization is often done automatically by the compiler, but the compiler often fails to, especially if you have parallelized code. We tested at the high-performance computing cluster a simple parallelized loop to compute the constant phi: The Intel compiler did vectorize the operations inside the parallel region, the gcc failed (you can check vectorization via the compile flag -ftree-vectorizer-verbose=2). Compilers are not always as intelligent as we hope they are. 
At the end it remains to say: If a developer needs a code appropriate to her needs and high-performance she must think about hardware and software: Is there something that can be scaled? How does my pipeline looks like and how can I structure my code to use it? How big is my data and how much of it is used in each step of my code? Where is my bottleneck: bandwidth, caches or scalability? Testing the code and testing certain areas of it, fine-tuning parallelisation and nested parallelisation with different number of threads is the way how it is been done. 

To conclude: libraries and tools like Boost Numerical Template Toolbox or yeppp! can help you, but certainly not entirely and not in each problem. If it is needed (makes a better job for simple functionals than self-made solutions) remains to be tested. But, you have all the tools to tune your code yourself. In addition: Rcpp has the great advantage of using memory from R and to my experience this is one of the main reasons why C++ extensions from R perform often better than simple C++ programs not using R at all - in my case. Data has its origin and must be passed to the program. In my opinion this is done very good in R/C++ via the Rcpp API. You can of course always use the Boost library inside your extended C++ code and make use of the many functions. 

I hope this helps. 


Simon


 


 
On Oct 15, 2013, at 1:57 AM, Darren Cook <darren at dcook.org> wrote:

> I was taking a look at this new C++ library, going for Boost review
> soon: (or maybe I misunderstood and only the SIMD sub-library is aiming
> to be in Boost)
>  http://nt2.metascale.org/doc/html/index.html
> 
> It looks a bit like trying to port Matlab to C++, with an emphasis on
> high-level definition of operations that are automatically parallelized.
> 
> I'd be interested to hear from the experts here if it is something that
> could usefully be made to work with Rcpp, or if it is perfect subset of
> what can already be done with Rcpp and R.
> 
> Darren
> 
> 
> 
> -- 
> Darren Cook, Software Researcher/Developer
> 
> http://dcook.org/work/ (About me and my work)
> http://dcook.org/blogs.html (My blogs and articles)
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel



More information about the Rcpp-devel mailing list