[GenABEL-dev] multiple ProbABEL's palinear runs

Sun Jul 21 21:21:04 CEST 2013

Dear Alvaro,

I did some benchmarking on ProbABEL's palinear (without --mmscore 
option) in the past and I can recall that most time the program spend on 
getting the genotype data to the OLS part, and not the OLS part itself. 
I could not find the results of the profiling so I am not sure this was 
truly the case. Loading the genotypes only ones instead of it N 
times(where N is number of phenotypes) would give a speed up. However, 
be aware when using real life data, outliers of the phonotypes are 
removed. If this outliers are not removed in your data, the amount of 
false positives will be high. So matrix X is for every phenotype 
unique.  Since the

  (X'*X)^-1 * X'

which  is a part of

  b = (X'*X)^-1 * X' * y.

is not the same for each phenotype, the speed-up there will be hard(er) 
to get.

I think without the ability to censor phenotypes the program will not 
have much real life use.

Kind regards,

Maarten

On 07/15/2013 05:07 PM, Alvaro Jesus Frank wrote:
> Dear all,
>
> I am working on a high performance implementation of an ordinary linear estimator (OLS model), similar to the one implemented in ProbABEL's palinear (without --mmscore option), where X are SNP given and Y are the phenotypes.
> (As given by the ProbABEl manual on section 7 "Methodology" at http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf)
>
>
>   b = (X'*X)^-1 * X' * y.
>
> The goal is to solve this with multiple design matrices (SNPs??) X and Phenotypes Y. For this we compute the formula as
>
> for each X
>     for each Y
>         b=(X'*X)^-1 * X' * y.
>
>
> We want to offer the GenABEL community an Estimator to be used in the same way people use the current tools (ProbABEL in R), but faster, and capable of handling LARGE datasets (in disk & memory).
> That is why I am writing it in C++, while making sure that it can be called directly from R.
>
> My understanding:
> A few concerns came to mind when researching the workflow in using OMICS data in Linear Estimators.
> There seems to be a long process before the real life data from MaCH (test.mldose? for X and mlinfo? for Y) that is sitting on files can be used in calculations. The first concern is how to obtain the design matrices X from the files.
>
> It is my understanding that there are two types of data, imputed data and databel data. Either way, data seems to be pre-processed early in the workflow; my impression is that this preprocessing is done in R. It also seems that R can't handle large amounts of data loaded in memory at once.
>
>  From what I see, data comes with some irregularities in its values (missing values, invalid rows in X/Y matrices), and this makes it difficult to use Linear Estimators right away; this is why the preprocessing exists. DatABEL seems to be the R tool (implemented in C++) that can do fast pre-processing of big sets of data. Well, I think that DatABEL only does the reading and writing of files in C++ (called filevector), while the pre-processing functions are defined and implemented in R. Am I correct?
>
>
> My Problems:
> This is where my troubles start. Since I am trying to make this tool usable for the GenABEL community while still being able to handle TERABYTES of data with fast computations, I would really like to include the preprocessing of X and Y into my C++ workflow. To solve the memory and performance limitations of R, I am trying to load the data from disk from within C++. Since I am performing my estimator function in C++, it expects those matrices to have numbers that can be used for computation. Assuming that data must be preprocessed to be able to get valid matrices with usable numbers, I have the following options:
>
> A)
> For performance reasons, I was considering having the data already pre-processed in disk files. Is this feasible, (preprocessed data would take at most as much space in disk as original data, is this cumbersome)?
>
> B)
> If there are only a few preprocessing functions that people use, I could re-implement them inside C++ and use them on the fly while loading the data from disk. This would be more difficult if everyone has their own customized R pre-processing functions.
>
> C)
> Another alternative is to allow users to use their own R pre-processing functions that pre-process the data. I would then go about preprocessing on the fly from inside C++ by doing calls back to R. This would be slower and harder to do than B).
>
> D)
> If DatABEL really does all the necesary pre-processing from inside C++, I could just directly use it or allow the user to specify what to use and won't need to re-implement the pre-processing functions. It seems tho, that preprocessing of the data takes from 30mins to an hour into DatABEL filevector format.
>
>
> I would really appreciate any help that would clarify my understanding of how the pre-processing of data works and where it fits in the work-flow.
>
> Best regards,
>
> - Alvaro Frank
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel