[GenABEL-dev] ProbABEL refactor

L.C. Karssen lennart at karssen.org
Fri Dec 21 19:59:13 CET 2012


Thanks for the summary Maarten. And, of course, thanks for all the work
you put in!

With this major step forward as well as a few bugs fixed I plan to
release ProbABEL v0.3.0 before the end of the year. If people have
uncommitted changes, please let me know as soon as possible.


Enjoy the holidays,

Lennart.

On 21-12-12 18:31, Maarten Kooyman wrote:
> Dear list,
> 
> Over the last few months we (mostly Maarten, with some work by Lennart)
> have been working on the refactoring of the ProbABEL code. Our aim was
> twofold:
> 1) Create a code base that is easier to maintain by splitting of parts
> of the code into separate files (e.g. data.h was getting bigger and
> bigger, contained 'real code' instead of only headers, etc.)
> 2) See if we could implement performance improvements by using a library
> for much of the matrix algebra involved. 
> 
> I am now proud to announce that this work has come to fruitition. As of
> SVN r1028 the changes from the probabel-refactoring branch have been
> integrated in trunc. An official release is planned to occur before the
> new year. 
> 
> In the following some more details about the changes to the code will be
> discussed. 
> 
> Matrix Algebra library implementation:
> After some research and testing the Eigen template library
> (http://eigen.tuxfamily.org/) <http://eigen.tuxfamily.org/%29>was
> chosen. It provides an easy way to use the most common forms of matrix
> manipulation as well as various solvers (which we might use in the
> future). Furthermore, since Eigen is a  C++ template library, only the
> header files are needed (at compile time), i.e. no need to builda
> library or for a user to have a library installed. This makes life
> easier for a user who wants to install the software, but doesn't have
> root permissions on his/her system. Furthermore, Eigen hasanOpen Source
> Initiative approved GPL compatible licence and claimsto be fast for
> matrix*matrix operations
> (http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf)
> <http://download.tuxfamily.org/eigen/btl-results-110323/matrix_matrix.pdf%29>.I
> thought of making use of BLAS libraries, but this is much less intuitive
> (look up the function of matrix multiplication called DGEMM and compare
> this to the Eigen function *(i.e. the * operator)). BLAS libraries can
> also be  hard to install to get a good
> performance(ATLAS:http://math-atlas.sourceforge.net/)
> <http://math-atlas.sourceforge.net/%29>or proprietary (Intel
> MKL:http://software.intel.com/en-us/intel-mkl)
> <http://software.intel.com/en-us/intel-mkl%29>)
> For the implementation I chose to make a parallel implementation, where
> both the old 'mematrix' operations and datastructures can be used, as
> well as the Eigen variants of the code. This was done by making a set of
> 'eigen_mematrix' functions that can be called in the same way as their
> non-Eigen counterparts and wrap those calls into Eigen calls. This way
> the other parts of the code can be used independently of the matrix
> backend and don't need to be rewritten. This also made comparing the new
> and old versions of the PA tools very easy. 
> For now both backendswill be maintained, but the old mematrix
> implementation is considered deprecated and will be removed in a future
> release of ProbABEL. 
> 
> More general code refactoring:
> Aside from the changes related to Eigen, I also split up the code into
> .cpp and corresponding .h files for the various classes (most of which
> previously lived in data.h). This improves the readability of the code
> as well as reducing compile time when changes is made in only one class. 
> Furthermore, many function arguments were set to 'const' (where
> appropriate) to help prevent bugs from showing up. This effort is not
> finished and will be continued in future releases. 
> 
> Build monitoring and other tools:
> In order to monitor all these changes and ensure that the project
> remained in a compilable state, I installed the Jenkins continuous
> integration platform on myworkstation. Jenkins monitors SVN and each
> time a change is detected it tries to recompile ProbABEL. But that's not
> all, you can basically run any program. In this way we added checks for
> memory leaks using Valgrind, static code analysis using cppcheck, simian
> to find code duplication, etc. This helped us a lot in not only find
> (possible) bugs, but also in making the code cleaner. 
> Unfortunately it isn't possible to install Jenkins on the GenABEL.org
> web server as it is a java-based webserver and (of course) requires the
> various compile and check tools to be installed. R-forge doesn't seem to
> provide a similar service either. 
> 
> Profiling:
> In order to find out where ProbABEL (mostly palinear in our tests)
> spends most of its time the application was profiled using Valgrind's
> callgrind option as well as GNU gprof. Data was visualised using
> kcachegrind and Gprof2Dot allowing us to make informed decisions on
> which parts of the code are candidates for speedups. It turns out that
> more than 30% of the time when running with the --mmscore option (the
> main use case in the GenEpi group at the ErasmusMC, Rotterdam) was spent
> in creatingthe var-covar matrix. 
> 
> The other 69% of the time it spend on matrix matrix products. That  is
> one of the reasons Eigen was chosen as it makes use of SSE  instruction
> in the CPU for its calculations. As a result, operation on  doubles is
> approximately twice as fast as before. 
> 
> Benchmarking:
> And now the moment you have all been waiting for: some benchmarks. These
> were doneon a 12 core intel Xeon X5680  @ 3.33GHz with 140 GB memory
> machine in a  parallel way:  The short running jobs had to share the cpu
> and memory  resource with 10 other jobs(becuase our server was busy at
> that time).  The long running jobs had to share this  resource with only
> 2 or 3 other tasks.
> 
> The metrics used in the graph arre produced with  /usr/bin/time -f
> "%e\t%U\t%K\t%M\t%C" palinear.The options -mmscore ,--chrom 9 ,
> --no-head --map were enabled  and the input data wasin filevector
> format.As you can see the work paid off: a factor of ~ 4 decrease in
> computation time when using mmscore, as well as a reduction in memory
> usage (for large sample sizes).
> 
> 
> If you have any question I am happy to answer them.
> 
> 
> Kind Regards,
> 
> Maarten
> 
> 
> 
> Below some URLs to tools that were used.
> 
> *Analyses: *
> /Profiling//:/
> valgrind:--tool=callgrind
> <http://piratepad.net/ep/search?query=callgrind>http://valgrind.org/
> GNU gprof:http://www.gnu.org/software/binutils/
> /Visualisation//:/
> Gprof2Dot:http://code.google.com/p/jrfonseca/wiki/Gprof2Dot
> kachegrind:http://kcachegrind.sourceforge.net/html/Home.html
> 
> *Development:*
> jenkins:http://jenkins-ci.org/
> lines of codecountL  SLOCCount http://www.dwheeler.com/sloccount/
> cppcheck:http://cppcheck.sourceforge.net/
> cpplint.py:http://google-styleguide.googlecode.com/svn/trunk/cpplint/
> simian:http://www.harukizaemon.com/simian/to detect code duplication
> eclipsecdt:http://www.eclipse.org/cdt/
> 
> 
> 
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> 

-- 
-----------------------------------------------------------------
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org

Stuur mij aub geen Word of Powerpoint bestanden!
Zie http://www.gnu.org/philosophy/no-word-attachments.nl.html
------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20121221/e760795d/attachment.sig>


More information about the genabel-devel mailing list