[GenABEL-dev] multiple ProbABEL's palinear runs

Thu Jul 18 23:33:26 CEST 2013

Dear Alvaro,

Thank you for showing interest in the ProbABEL project!

On 15-07-13 17:07, Alvaro Jesus Frank wrote:
>
> Dear all,
>
> I am working on a high performance implementation of an ordinary
> linear estimator (OLS model), similar to the one implemented in
> ProbABEL's palinear (without --mmscore option), where X are SNP given
> and Y are the phenotypes. (As given by the ProbABEl manual on section
> 7 "Methodology" at
> http://www.genabel.org/sites/default/files/pdfs/ProbABEL_manual.pdf)
>
>
> b = (X'*X)^-1 * X' * y.
>
> The goal is to solve this with multiple design matrices (SNPs??) X

Indeed, the design matrix contains both SNP data and other covariates
(e.g. sex, age, etc.).

> and Phenotypes Y. For this we compute the formula as
>
> for each X for each Y b=(X'*X)^-1 * X' * y.
>
>
> We want to offer the GenABEL community an Estimator to be used in the
> same way people use the current tools (ProbABEL in R)

Actually, ProbABEL is a command line tool. Even though several packages
of the GenABEL project are R packages, ProbABEL is not.

>, but faster,
> and capable of handling LARGE datasets (in disk & memory). That is
> why I am writing it in C++,

Sounds good! ProbABEL is written in a mixture of C and C++.

> while making sure that it can be called
> directly from R.

I'm not sure that that should be a requirement. At the moment the
workflow is roughly the following:
1) prepare phenotype data (e.g. specify covariates, do QC like removing
outliers, log transformation, etc.). This is done by each researcher
independently, as they are the experts on their phenotypes.
2) Imputation of genetic data is done centrally as this is a time
consuming task, that only needs to be redone if additional individuals
have been genotyped or whenever a genomic reference set has been
updated. This happens roughly once or twice per year.

>
> My understanding: A few concerns came to mind when researching the
> workflow in using OMICS data in Linear Estimators. There seems to be
> a long process before the real life data from MaCH (test.mldose? for
> X

Just to be sure, for each SNP, X contains dosage or probability data
that SNP and the covariate data as specified by the researcher.

> and mlinfo? for Y)

Nope, Y is not take from the mlinfo file. The data from the mlinfo file
is not used in the regression. After the regression is done, the
information in the mlinfo file (e.g. SNP name, chromosome number, base
pair position) is simply copied to the output file.

> that is sitting on files can be used in
> calculations. The first concern is how to obtain the design matrices
> X from the files.

I agree.

>
> It is my understanding that there are two types of data, imputed data
> and databel data.

Almost correct. Imputed genotype data "comes out of" the imputation
software in the form of (possibly zipped) text files, the test.mldose
(basically N_SNPs x N_ids) and test.mlinfo files (N_SNPs x ~7).

The filevector/DatABEL file format is simply a way to store the dosage
data in such a (binary) way that we don't need to load a complete text
file into memory.

FYI: An imputed data set of ~7000 individuals and ~20e6 imputed SNPs
uses 459 GB in DatABEL format, the text-based mlinfo files take up 881
MB and the gzipped dosage text files take up 59GB.
The top item on my wishlist is a compressed form of the
filevector/DatABEL files, as you can see from these numbers.

> Either way, data seems to be pre-processed early in
> the workflow;

Actually, there isn't too much preprocessing going on. If we only look
at dosage data the only thing that needs to be done for each SNP is to
add the dosage data for each individual as a column to the (constant)
matrix of covariate data to form the design matrix X.
Because we want to allow for missing (genotype) data we have added some
routines to get the data without missing values.

> my impression is that this preprocessing is done in R.

Usually only for the creation of the phenotype file. For a single
(non-omics) phenotype like height, disease status, a blood lipid level,
etc. this is easy. The researcher usually has these files (N_IDs rows,
one column for the phenotype and a few columns for covariates like age,
sex, age^2, etc).

Of course, for omics data the number of phenotypes is much larger. But
for that scenario OmicABEL is developed.

> It also seems that R can't handle large amounts of data loaded in
> memory at once.

That is another reasons why DatABEL (the R library interface to the
filevector format) was developed.

>
> From what I see, data comes with some irregularities in its values
> (missing values, invalid rows in X/Y matrices), and this makes it
> difficult to use Linear Estimators right away; this is why the
> preprocessing exists.

Correct. Most people use imputed genotype data, there won't be many NA's
there. On the other hand, since genotype imputation is done centrally
for all genotype individuals, it is very common to have missing data in
the phenotype file (i.e. Y and covariate data).

> DatABEL seems to be the R tool (implemented in
> C++) that can do fast pre-processing of big sets of data. 

A very common use case after running a GWAS (the genome-wide linear
regression we're talking about), is that a reasearcher wants to know the
exact dosages for all individuals for his top ten of most significant
hits. This is when (s)he uses grep to find out in which mlinfo files the
SNPs are located. This is necessary because the genotype data is split
up in several files per chromosome to make handling the files easier
(parallel computation of the linear regression on a multicore cluster is
easy that way, we simply submit one job per 'chunk' and in that way
several chunks run in parallel). Then (s)he starts R, loads the DatABEL
library to read the genotype data from those specific files.

> Well, I
> think that DatABEL only does the reading and writing of files in C++
> (called filevector), 

Correct.

> while the pre-processing functions are defined
> and implemented in R. Am I correct?
>

Not quite. Apart from the one-time only conversion of the text files
with (imputed) genotype data to DatABEL format (which is done in R
usually, but the filevector lib also has command line tools (written in
C++) to do this), the end user doesn't do much with DatABEL (for
pre-processing). Within ProbABEL we do some pre-processing (e.g. removal
of individuals without genotype information), and in the loop over all
SNPs the combining of the genotype information with the other covariates
into the design matrix.

>
> My Problems: This is where my troubles start. Since I am trying to
> make this tool usable for the GenABEL community while still being
> able to handle TERABYTES of data with fast computations, I would
> really like to include the preprocessing of X and Y into my C++
> workflow. To solve the memory and performance limitations of R, I am
> trying to load the data from disk from within C++. Since I am
> performing my estimator function in C++, it expects those matrices to
> have numbers that can be used for computation. Assuming that data
> must be preprocessed to be able to get valid matrices with usable
> numbers, I have the following options:
>
> A) For performance reasons, I was considering having the data already
> pre-processed in disk files. Is this feasible, (preprocessed data
> would take at most as much space in disk as original data, is this
> cumbersome)?
>
> B) If there are only a few preprocessing functions that people use, I
> could re-implement them inside C++ and use them on the fly while
> loading the data from disk. This would be more difficult if everyone
> has their own customized R pre-processing functions.

>
> C) Another alternative is to allow users to use their own R
> pre-processing functions that pre-process the data. I would then go
> about preprocessing on the fly from inside C++ by doing calls back to
> R. This would be slower and harder to do than B).
>
> D) If DatABEL really does all the necesary pre-processing from inside
> C++, I could just directly use it or allow the user to specify what
> to use and won't need to re-implement the pre-processing functions.
> It seems tho, that preprocessing of the data takes from 30mins to an
> hour into DatABEL filevector format.

I think it would be a good idea to rethink the DatABEL/filevector
format. As I already mentioned, if we could store the data in a
compressed way (while still retaining good speed and (relatively) low
RAM usage life for the user would be much better.

>
>
> I would really appreciate any help that would clarify my
> understanding of how the pre-processing of data works and where it
> fits in the work-flow.

If you like we could set up a Skype call. I think that would help both
of us a lot in understanding each other. Maybe Yurii and Maarten would
like to participate as well?

Thanks again for showing interest in ProbABEL. I think we can learn a
lot from your expertise!

Best regards,

Lennart Karssen
(present maintainer of the ProbABEL package)

>
> Best regards,
>
> - Alvaro Frank _______________________________________________
> genabel-devel mailing list genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>
>
--
-----------------------------------------------------------------
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org

Stuur mij aub geen Word of Powerpoint bestanden!
Zie http://www.gnu.org/philosophy/no-word-attachments.nl.html
------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20130718/2d5eb853/attachment.sig>