[GenABEL-dev] multiple ProbABEL's palinear runs

Sun Jul 21 20:28:52 CEST 2013

Dear Lennart,

Thanks for the reply with all the useful information. Perhaps when I have a prototype (computational core excluding real data handling) working we could set up the skype call? 
Here I have some follow up questions.
> 
> I'm not sure that that should be a requirement. At the moment the
> workflow is roughly the following:
> 1) prepare phenotype data (e.g. specify covariates, do QC like removing
> outliers, log transformation, etc.). This is done by each researcher
> independently, as they are the experts on their phenotypes.
> Usually only for the creation of the phenotype file. For a single
> (non-omics) phenotype like height, disease status, a blood lipid level,
> etc. this is easy. The researcher usually has these files (N_IDs rows,
> one column for the phenotype and a few columns for covariates like age,
> sex, age^2, etc).
> Of course, for omics data the number of phenotypes is much larger. But
> for that scenario OmicABEL is developed.

The purpose is to go along the lines of OmicABEL where multiple phenotypes can be used in the computation, but by being as flexible to any existing ways of storing the multiple phenotype data as possible. I.e: If the standard already is (for a single phenotype ) to have a .txt file for analysis, simply use their existing files in bulk. If everyone stores this in their own way, then simply going the way of OmicABEL would be the best, requiring all phen. files to be re-packaged in a DatABEL format. 
If everyone uses the same standard for phenotype files, then I can just support those directly (supporting low memory usage too, as this is not dependent on how data is stored, but on how it is accessed).

> 2) Imputation of genetic data is done centrally as this is a time
> consuming task,

It takes hours to my understanding right?

> that only needs to be redone if additional individuals
> have been genotyped or whenever a genomic reference set has been
> updated. This happens roughly once or twice per year.

Data on files on disk that is used in computations already went through this process right? (I.e: is ready to compute)

> FYI: An imputed data set of ~7000 individuals and ~20e6 imputed SNPs
> uses 459 GB in DatABEL format, the text-based mlinfo files take up 881
> MB and the gzipped dosage text files take up 59GB.
> The top item on my wishlist is a compressed form of the
> filevector/DatABEL files, as you can see from these numbers.

So the DatABEL binary file takes MORE space than the raw equivalent dosage text files *.mldose (when gzipped)?
What about when they are not compressed? 
According to my calculations if there are N=10^9 entries, in binary you can store with single precision 1 entry in 32bits(4Bytes)
to a total of 3.72 Gigs (N*4) but in raw text file each digit requires 1 byte, storing 9 characters to represent the number, then it would requires at least N*8 = 8,38 Gigs, which should be double the size.

> Imputed genotype data "comes out of" the imputation
> software in the form of (possibly zipped) text files, the test.mldose
> (basically N_SNPs x N_ids) and test.mlinfo files (N_SNPs x ~7).
> 
> The filevector/DatABEL file format is simply a way to store the dosage
> data in such a (binary) way that we don't need to load a complete text
> file into memory.

If users had the choice, what would the rather have the application do:
a) Use the existing raw text .mldose file(files?) they already have without requiring to use their entire memory at once (similar to filevector).
b) Force them to transform their files into even more files that use the filevector format and the application would use those (also low memory usage).
c) Something else? 

> Actually, there isn't too much preprocessing going on. If we only look
> at dosage data the only thing that needs to be done for each SNP is to
> add the dosage data for each individual as a column to the (constant)
> matrix of covariate data to form the design matrix X.

This is the process that I refer to as X = [ XL | XR ] where the design matrix X is formed like:
-Covariates XL that is constant (of size N_ids:rows, N_covariates:columns)  
-XR that is built with dosage data and is different for each ___ what? (how? If the dosage data is a big sequence how do you establish how much to take and add to XL to form X.

> Because we want to allow for missing (genotype) data we have added some
> routines to get the data without missing values.
> That is another reasons why DatABEL (the R library interface to the
> filevector format) was developed.

This is already done in that central process that happens only once or twice a year like you mentioned before right? 
Data sitting on files already accounts for this missing values?

> Most people use imputed genotype data, there won't be many NA's
> there. On the other hand, since genotype imputation is done centrally
> for all genotype individuals, it is very common to have missing data in
> the phenotype file (i.e. Y and covariate data).

How does the processing of genotype data create missing pheno data?
How is this then corrected? (by user/probabel?)

Does this mean that if phenotype data is missing for an individual, then this individual is simply not used in the calculation?
I.e: in the part of the regresion where: X' * Y  the calculation is not performed? 
Or "missing data in the phenotype file" means that Y has missing rows and data must be dropped/filled? (for non covariate entries).

I know that OmicABEL does averaging for the missing covariate entries. Is this done for non covariate missing entries?
If each Phenotype file comes with both covariate data (whic his supposed to be cosntant) and phenotype data, does this mean that the constant data is duplicated in disk?

> Not quite. Apart from the one-time only conversion of the text files
> with (imputed) genotype data to DatABEL format (which is done in R
> usually, but the filevector lib also has command line tools (written in
> C++) to do this), the end user doesn't do much with DatABEL (for
> pre-processing). Within ProbABEL we do some pre-processing (e.g. removal
> of individuals without genotype information), 

How do you determine which are these? This means that the users leaves their phenotypic data uncorrected in files? (prev.question).
So if genotype data is also missing for Y's that DO exist, these are also dropped?

What other data manipulations not part of the regression process are done inside ProbABEL?

> and in the loop over all
> SNPs the combining of the genotype information with the other covariates
> into the design matrix.
> 
this is the formation of
X = [ XL | XR ] right?

> The top item on my wishlist is a compressed form of the
> filevector/DatABEL files, as you can see from these numbers.
> 
> I think it would be a good idea to rethink the DatABEL/filevector
> format. As I already mentioned, if we could store the data in a
> compressed way (while still retaining good speed and (relatively) low
> RAM usage life for the user would be much better.

I have looked into this and there are some solutions for data compression of random floating point data. I am not sure how efficient they are but my guess is what disk usage can be reduced to around 70-60%. It must be stated that data loading into memory is independet on how it is stored. It is ALWAYS possible to just load parts of files into memory, be either in filevector format or inputed *.mldose data.
The routine that DatABEL uses to load memory are the only thing that needs to be worked on to support ow memory usage, and not the format itself. 

On another topic related to OmicABEL, I wish to know to what extend it is used and if its not used widely, what the reason is. 
What hinders its adoption to do multiple Xr and Y analysis?

Thanks again for the input!

-Alvaro Frank