[GenABEL-dev] ProbABEL refactor
Maarten Kooyman
kooyman at gmail.com
Fri Dec 21 18:31:46 CET 2012
Dear list,
Over the last few months we (mostly Maarten, with some work by Lennart)
have been working on the refactoring of the ProbABEL code. Our aim was
twofold:
1) Create a code base that is easier to maintain by splitting of parts
of the code into separate files (e.g. data.h was getting bigger and
bigger, contained 'real code' instead of only headers, etc.)
2) See if we could implement performance improvements by using a library
for much of the matrix algebra involved.
I am now proud to announce that this work has come to fruitition. As of
SVN r1028 the changes from the probabel-refactoring branch have been
integrated in trunc. An official release is planned to occur before the
new year.
In the following some more details about the changes to the code will be
discussed.
Matrix Algebra library implementation:
After some research and testing the Eigen template library
(http://eigen.tuxfamily.org/) <http://eigen.tuxfamily.org/%29>was
chosen. It provides an easy way to use the most common forms of matrix
manipulation as well as various solvers (which we might use in the
future). Furthermore, since Eigen is a C++ template library, only the
header files are needed (at compile time), i.e. no need to builda
library or for a user to have a library installed. This makes life
easier for a user who wants to install the software, but doesn't have
root permissions on his/her system. Furthermore, Eigen hasanOpen Source
Initiative approved GPL compatible licence and claimsto be fast for
matrix*matrix operations
(http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf)
<http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf%29>.I
thought of making use of BLAS libraries, but this is much less intuitive
(look up the function of matrix multiplication called DGEMM and compare
this to the Eigen function *(i.e. the * operator)). BLAS libraries can
also be hard to install to get a good
performance(ATLAS:http://math-atlas.sourceforge.net/)
<http://math-atlas.sourceforge.net/%29>or proprietary (Intel
MKL:http://software.intel.com/en-us/intel-mkl)
<http://software.intel.com/en-us/intel-mkl%29>)
For the implementation I chose to make a parallel implementation, where
both the old 'mematrix' operations and datastructures can be used, as
well as the Eigen variants of the code. This was done by making a set of
'eigen_mematrix' functions that can be called in the same way as their
non-Eigen counterparts and wrap those calls into Eigen calls. This way
the other parts of the code can be used independently of the matrix
backend and don't need to be rewritten. This also made comparing the new
and old versions of the PA tools very easy.
For now both backendswill be maintained, but the old mematrix
implementation is considered deprecated and will be removed in a future
release of ProbABEL.
More general code refactoring:
Aside from the changes related to Eigen, I also split up the code into
.cpp and corresponding .h files for the various classes (most of which
previously lived in data.h). This improves the readability of the code
as well as reducing compile time when changes is made in only one class.
Furthermore, many function arguments were set to 'const' (where
appropriate) to help prevent bugs from showing up. This effort is not
finished and will be continued in future releases.
Build monitoring and other tools:
In order to monitor all these changes and ensure that the project
remained in a compilable state, I installed the Jenkins continuous
integration platform on myworkstation. Jenkins monitors SVN and each
time a change is detected it tries to recompile ProbABEL. But that's not
all, you can basically run any program. In this way we added checks for
memory leaks using Valgrind, static code analysis using cppcheck, simian
to find code duplication, etc. This helped us a lot in not only find
(possible) bugs, but also in making the code cleaner.
Unfortunately it isn't possible to install Jenkins on the GenABEL.org
web server as it is a java-based webserver and (of course) requires the
various compile and check tools to be installed. R-forge doesn't seem to
provide a similar service either.
Profiling:
In order to find out where ProbABEL (mostly palinear in our tests)
spends most of its time the application was profiled using Valgrind's
callgrind option as well as GNU gprof. Data was visualised using
kcachegrind and Gprof2Dot allowing us to make informed decisions on
which parts of the code are candidates for speedups. It turns out that
more than 30% of the time when running with the --mmscore option (the
main use case in the GenEpi group at the ErasmusMC, Rotterdam) was spent
in creatingthe var-covar matrix.
The other 69% of the time it spend on matrix matrix products. That is
one of the reasons Eigen was chosen as it makes use of SSE instruction
in the CPU for its calculations. As a result, operation on doubles is
approximately twice as fast as before.
Benchmarking:
And now the moment you have all been waiting for: some benchmarks. These
were doneon a 12 core intel Xeon X5680 @ 3.33GHz with 140 GB memory
machine in a parallel way: The short running jobs had to share the cpu
and memory resource with 10 other jobs(becuase our server was busy at
that time). The long running jobs had to share this resource with only
2 or 3 other tasks.
The metrics used in the graph arre produced with /usr/bin/time -f
"%e\t%U\t%K\t%M\t%C" palinear.The options -mmscore ,--chrom 9 ,
--no-head --map were enabled and the input data wasin filevector
format.As you can see the work paid off: a factor of ~ 4 decrease in
computation time when using mmscore, as well as a reduction in memory
usage (for large sample sizes).
If you have any question I am happy to answer them.
Kind Regards,
Maarten
Below some URLs to tools that were used.
*Analyses: *
/Profiling//:/
valgrind:--tool=callgrind
<http://piratepad.net/ep/search?query=callgrind>http://valgrind.org/
GNU gprof:http://www.gnu.org/software/binutils/
/Visualisation//:/
Gprof2Dot:http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
kachegrind:http://kcachegrind.sourceforge.net/html/Home.html
*Development:*
jenkins:http://jenkins-ci.org/
lines of codecountL SLOCCount http://www.dwheeler.com/sloccount/
cppcheck:http://cppcheck.sourceforge.net/
cpplint.py:http://google-styleguide.googlecode.com/svn/trunk/cpplint/
simian:http://www.harukizaemon.com/simian/to detect code duplication
eclipsecdt:http://www.eclipse.org/cdt/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/000dd1d3/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: speed.png
Type: image/png
Size: 29507 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/000dd1d3/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory.png
Type: image/png
Size: 20886 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/000dd1d3/attachment-0003.png>
More information about the genabel-devel
mailing list