[GenABEL-dev] multiple ProbABEL's palinear runs

Mon Jul 15 17:07:26 CEST 2013

Dear all,

I am working on a high performance implementation of an ordinary linear estimator (OLS model), similar to the one implemented in ProbABEL's palinear (without --mmscore option), where X are SNP given and Y are the phenotypes.
(As given by the ProbABEl manual on section 7 "Methodology" at http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf)

 b = (X'*X)^-1 * X' * y.

The goal is to solve this with multiple design matrices (SNPs??) X and Phenotypes Y. For this we compute the formula as

for each X
   for each Y
       b=(X'*X)^-1 * X' * y.

We want to offer the GenABEL community an Estimator to be used in the same way people use the current tools (ProbABEL in R), but faster, and capable of handling LARGE datasets (in disk & memory).
That is why I am writing it in C++, while making sure that it can be called directly from R.

My understanding:
A few concerns came to mind when researching the workflow in using OMICS data in Linear Estimators.
There seems to be a long process before the real life data from MaCH (test.mldose? for X and mlinfo? for Y) that is sitting on files can be used in calculations. The first concern is how to obtain the design matrices X from the files.

It is my understanding that there are two types of data, imputed data and databel data. Either way, data seems to be pre-processed early in the workflow; my impression is that this preprocessing is done in R. It also seems that R can't handle large amounts of data loaded in memory at once.

>From what I see, data comes with some irregularities in its values (missing values, invalid rows in X/Y matrices), and this makes it difficult to use Linear Estimators right away; this is why the preprocessing exists. DatABEL seems to be the R tool (implemented in C++) that can do fast pre-processing of big sets of data. Well, I think that DatABEL only does the reading and writing of files in C++ (called filevector), while the pre-processing functions are defined and implemented in R. Am I correct?

My Problems:
This is where my troubles start. Since I am trying to make this tool usable for the GenABEL community while still being able to handle TERABYTES of data with fast computations, I would really like to include the preprocessing of X and Y into my C++ workflow. To solve the memory and performance limitations of R, I am trying to load the data from disk from within C++. Since I am performing my estimator function in C++, it expects those matrices to have numbers that can be used for computation. Assuming that data must be preprocessed to be able to get valid matrices with usable numbers, I have the following options:

A)
For performance reasons, I was considering having the data already pre-processed in disk files. Is this feasible, (preprocessed data would take at most as much space in disk as original data, is this cumbersome)?

B)
If there are only a few preprocessing functions that people use, I could re-implement them inside C++ and use them on the fly while loading the data from disk. This would be more difficult if everyone has their own customized R pre-processing functions.

C)
Another alternative is to allow users to use their own R pre-processing functions that pre-process the data. I would then go about preprocessing on the fly from inside C++ by doing calls back to R. This would be slower and harder to do than B).

D)
If DatABEL really does all the necesary pre-processing from inside C++, I could just directly use it or allow the user to specify what to use and won't need to re-implement the pre-processing functions. It seems tho, that preprocessing of the data takes from 30mins to an hour into DatABEL filevector format.

I would really appreciate any help that would clarify my understanding of how the pre-processing of data works and where it fits in the work-flow.

Best regards,

- Alvaro Frank