[GenABEL-dev] ProbABEL refactor

Maarten Kooyman kooyman at gmail.com
Fri Dec 21 18:31:46 CET 2012


Dear list,

Over the last few months we (mostly Maarten, with some work by Lennart) 
have been working on the refactoring of the ProbABEL code. Our aim was 
twofold:
1) Create a code base that is easier to maintain by splitting of parts 
of the code into separate files (e.g. data.h was getting bigger and 
bigger, contained 'real code' instead of only headers, etc.)
2) See if we could implement performance improvements by using a library 
for much of the matrix algebra involved.

I am now proud to announce that this work has come to fruitition. As of 
SVN r1028 the changes from the probabel-refactoring branch have been 
integrated in trunc. An official release is planned to occur before the 
new year.

In the following some more details about the changes to the code will be 
discussed.

Matrix Algebra library implementation:
After some research and testing the Eigen template library 
(http://eigen.tuxfamily.org/) <http://eigen.tuxfamily.org/%29>was 
chosen. It provides an easy way to use the most common forms of matrix 
manipulation as well as various solvers (which we might use in the 
future). Furthermore, since Eigen is a  C++ template library, only the 
header files are needed (at compile time), i.e. no need to builda 
library or for a user to have a library installed. This makes life 
easier for a user who wants to install the software, but doesn't have 
root permissions on his/her system. Furthermore, Eigen hasanOpen Source 
Initiative approved GPL compatible licence and claimsto be fast for 
matrix*matrix operations 
(http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf) 
<http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf%29>.I 
thought of making use of BLAS libraries, but this is much less intuitive 
(look up the function of matrix multiplication called DGEMM and compare 
this to the Eigen function *(i.e. the * operator)). BLAS libraries can 
also be hard to install to get a good 
performance(ATLAS:http://math-atlas.sourceforge.net/) 
<http://math-atlas.sourceforge.net/%29>or proprietary (Intel 
MKL:http://software.intel.com/en-us/intel-mkl) 
<http://software.intel.com/en-us/intel-mkl%29>)
For the implementation I chose to make a parallel implementation, where 
both the old 'mematrix' operations and datastructures can be used, as 
well as the Eigen variants of the code. This was done by making a set of 
'eigen_mematrix' functions that can be called in the same way as their 
non-Eigen counterparts and wrap those calls into Eigen calls. This way 
the other parts of the code can be used independently of the matrix 
backend and don't need to be rewritten. This also made comparing the new 
and old versions of the PA tools very easy.
For now both backendswill be maintained, but the old mematrix 
implementation is considered deprecated and will be removed in a future 
release of ProbABEL.

More general code refactoring:
Aside from the changes related to Eigen, I also split up the code into 
.cpp and corresponding .h files for the various classes (most of which 
previously lived in data.h). This improves the readability of the code 
as well as reducing compile time when changes is made in only one class.
Furthermore, many function arguments were set to 'const' (where 
appropriate) to help prevent bugs from showing up. This effort is not 
finished and will be continued in future releases.

Build monitoring and other tools:
In order to monitor all these changes and ensure that the project 
remained in a compilable state, I installed the Jenkins continuous 
integration platform on myworkstation. Jenkins monitors SVN and each 
time a change is detected it tries to recompile ProbABEL. But that's not 
all, you can basically run any program. In this way we added checks for 
memory leaks using Valgrind, static code analysis using cppcheck, simian 
to find code duplication, etc. This helped us a lot in not only find 
(possible) bugs, but also in making the code cleaner.
Unfortunately it isn't possible to install Jenkins on the GenABEL.org 
web server as it is a java-based webserver and (of course) requires the 
various compile and check tools to be installed. R-forge doesn't seem to 
provide a similar service either.

Profiling:
In order to find out where ProbABEL (mostly palinear in our tests) 
spends most of its time the application was profiled using Valgrind's 
callgrind option as well as GNU gprof. Data was visualised using 
kcachegrind and Gprof2Dot allowing us to make informed decisions on 
which parts of the code are candidates for speedups. It turns out that 
more than 30% of the time when running with the --mmscore option (the 
main use case in the GenEpi group at the ErasmusMC, Rotterdam) was spent 
in creatingthe var-covar matrix.

The other 69% of the time it spend on matrix matrix products. That  is 
one of the reasons Eigen was chosen as it makes use of SSE  instruction 
in the CPU for its calculations. As a result, operation on  doubles is 
approximately twice as fast as before.

Benchmarking:
And now the moment you have all been waiting for: some benchmarks. These 
were doneon a 12 core intel Xeon X5680  @ 3.33GHz with 140 GB memory 
machine in a  parallel way: The short running jobs had to share the cpu 
and memory  resource with 10 other jobs(becuase our server was busy at 
that time).  The long running jobs had to share this  resource with only 
2 or 3 other tasks.

The metrics used in the graph arre produced with  /usr/bin/time -f 
"%e\t%U\t%K\t%M\t%C" palinear.The options -mmscore ,--chrom 9 , 
--no-head --map were enabled  and the input data wasin filevector 
format.As you can see the work paid off: a factor of ~ 4 decrease in 
computation time when using mmscore, as well as a reduction in memory 
usage (for large sample sizes).


If you have any question I am happy to answer them.


Kind Regards,

Maarten



Below some URLs to tools that were used.

*Analyses: *
/Profiling//:/
valgrind:--tool=callgrind 
<http://piratepad.net/ep/search?query=callgrind>http://valgrind.org/
GNU gprof:http://www.gnu.org/software/binutils/
/Visualisation//:/
Gprof2Dot:http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
kachegrind:http://kcachegrind.sourceforge.net/html/Home.html

*Development:*
jenkins:http://jenkins-ci.org/
lines of codecountL  SLOCCount http://www.dwheeler.com/sloccount/
cppcheck:http://cppcheck.sourceforge.net/
cpplint.py:http://google-styleguide.googlecode.com/svn/trunk/cpplint/
simian:http://www.harukizaemon.com/simian/to detect code duplication
eclipsecdt:http://www.eclipse.org/cdt/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/000dd1d3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speed.png
Type: image/png
Size: 29507 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/000dd1d3/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory.png
Type: image/png
Size: 20886 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/000dd1d3/attachment-0003.png>


More information about the genabel-devel mailing list