[GenABEL-dev] databel vs impute2 vs me

Thu Sep 11 01:29:08 CEST 2014

Hi Alvaro,

Sorry for the late reply.

On 27-08-14 18:56, Frank, Alvaro Jesus wrote:
> Hi Lennart,
> 
> I wanted to re-introduce the issue of compression, file sizes and formats.

Great! I think it's a fun topic and IIRC we disagreed last time, so lots
of opportunities for a good discussion :-).

> 
> At the moment I am trying to use a a file in format impute2, which seems
> to code a lot of 0 1 and every now and then a 0. + 3digits.

Yup.

> 
> When converting such a file to databel, the size is clearly BIGGER,
> since (instead of using 1 byte for 1,0 s, like impute2) DATABEL will use
> 4bytes. Databel has no idea what is binary and what is not so codes all
> as floats/doubles.

Indeed.

> Never the less a compressed 7z of the databel format can reduce 200MBs
> to less than 4MBs.
> 80MB of impute2 get compressed to 5Mbs in gz format and around 3MB in 7z
> format.
> 
> Compression is already an option for databel as is. 

So far, we agree :-).

> 
> Now to the real issue, Compression of data SHOULD NEVER HAPPEN!
> (Decompression of data on the fly, (to analyze it) is just adding
> compute overhead (cpus are being used to decompress!))

I think this is a point that you still need to convince me of (I accept
the fact that decompression uses CPU cycles, but I'm not convinced yet
that that is a bad thing).

I haven't yet read the rest of the e-mail, so I may be getting ahead of
things, but I can see that from a computer science/computational
efficiency point of view you are right. However, from the point of view
of a system administrator or a financial decision maker (storage (also
for backups) is expensive) I don't agree with that. The way I see it is
as follows: Let's say that OmicABELnoMM is 10× faster than current
'state of the art' ProbABEL, for example finishing a GWAS in a day
instead of 1.5 week on a given system. If using compressed data
increases computation time by 10% or 25% I would still be OK with that
if that means I reduce the amount of disk space for a given imputed data
set from 1TB to e.g. 100GB.
Moreover, if you also use DatABEL format files to store output data, the
advantage of the decreased file size is even bigger. For example, an
imputed data set you probably back up only once, but user data changes
more often and thus will consume much more backup space in a scheme with
daily, weekly, monthly incremental backups.

But that's just to give you an idea if my current point of view. I'll
read on to see what's waiting there for me.

> 
> 
> To deal with (not using compressed) output data I developed a small
> footprint format of the data and a program that reads it and outputs
> .txt human readable versions of the results (for subsets of the
> results). The binary custom version of the output is very aware of data
> and stores significant values (user defined) only, as well as required
> data to reproduce the entire output, independently of the source data
> used to produce it. This means that p values, t statistics and such can
> be recomputed with the outputfiles and only very minimal data is stored
> and virtually no compute time is required. As an extra, a .txt file is
> also produced automatically by omicabelnomm which contains significant
> data only (another parameter set by the user). The output binary data
> can then be used to produce new txt files according to different degrees
> of significance, as long as the data had been stored.

That sounds very interesting! So just to see if I understand you:
OmicABELnoMM produces in principle two files:
- a small text file with significant hits (at a user-definable threshold
T_1)
- a 'reasonably' sized binary file containing significant values at
another user-definable threshold T_2. This file contains all data to
create new text files with the results at any threshold T < T_2 (if I
understand your example below correctly).

> 
> For example, from 1000 Phe and 1000 SNP, 10^6 results are meant to be
> computed. from those only 0.1% are relevant/significant. The user says,
> display as txt  only P < 0.05 and store all results with P < 0.1. This
> is done. File sizes are minimal. User then comes in a week and wants to
> see not only what he had but perhaps only P < 0.0005. This results were
> stored. He also want to see P < 0.9 

Do you mean 0.09 here? Because only data with P < 0.1 was stored.

> and those were stored too, so for
> both cases he receives new .txts with human readable format. If he wants
> to see all results above P >0.1, those were not stored.... so no luck
> there. Re-computation should not be an issue as it is FAST.

That sounds very convincing, I must say. Can you give me an indication
of the compute times we are talking about (i.e. what is FAST)? For
example, how fast would the above 1000×1000 analysis run in your case?
And what would the cost (in computation time) be should compression be
added (or is that too difficult to estimate without a proper
implementaiton)?

> 
> That is just a sample of how to handle the "big data" problem, which I
> insist, is not a problem at all.
> The next issue is storing data like the one from impute2 I have
> encountered here.
> Is this kind of data normal? or are there situations where EVERY entry
> (90%+?) are floating point numbers? 

> Are 3 digits after the . the maximum impute2 supports?

I haven't checked with Impute2, but Mach and minimac (two other programs
used for genetic imputation) indeed only output 3 decimals. From an
experimental precision point of view that is enough. Even if you assume
that the genotyping + imputation process is perfect (or has e.g. 1e-9
precision), most (if not all) phenotype measurements are much less
precise. For example, nobody measures human height in mm, or,
concentrations of HDL cholesterol, for example, are measured with two or
three significant digits.

> If so, I can already envision a super "compressed" file format to
> contain this impute2 like data with megabytes instead of
> gygabytes/terabytes. What other formats are used for bot Y and X?
> (genotypes/phenotypes) Do they have same impute2 structure?

Two other commonly used tools for genetic imputation are MaCH and its
newer sibling minimac [1] and Beagle [2]. Currently I'd say that minimac
and Impute2 are used the most.

The ProbABEL example files (.mldose, .mlprob and .mlinfo) are typical
examples of the MaCH/minimac formats. Rows are individuals and columns
contain SNP data (dosage or probabilities), all with ~3 digits after the
decimal point. By default minimac outputs these as gzipped text files.

> I know there is non imputed datatypes, how do they look?

I guess with non-imputed data types you mean what we call (measured)
genotype data. This is the type of data that comes from the biochemical
process of determining the genotypes (DNA bases) of a given individual
(see below for some more info). Incidentally, this type of measured
genotype data serves as input for the imputation process.

The files resulting from this process (after quality control) can be
stored in various formats. Typical dimensions would be 100 to 10000
people and 2e5 to 2e6 SNPs.

One format would be SNPs as rows, individuals as columns and each entry
would be AA or AC or TG or any other combination of two letters/DNA
bases A C T and G. In case a call cannot be made for a given person and
SNP missing data will occur.

Another very common set of formats are the Plink (a tool [3]) formats.
There are three file formats, each encoding the same information:
- .ped files have people as rows, SNPs as columns and the first 6
columns contain additional information like person and family IDs, IDs
of the parents and sex.
- .tped is the transposed version of the above file, so SNPs as rows and
people as columns
- .bed files are the binary version of the above (either SNP major or
person major), see [4] or the specs.

And lastly the GenABEL format, i.e. the binary format of R objects of
the gwaa.data-class, which uses two bits to encode the four genotype
options (AA, AB, BB, missing) for a given person at a given DNA location.

A bit more background: This genotypng process is done on so-called
genotyping arrays, which contain roughly 1e6 SNPs per person. The
lab/machine measures fluorescence intensities. These intensity values
(usually between 0 and 1) are plotted as a 2D scatter plot, see for
example http://urr.cat/cnv/im1.jpg. There you see three groups of dots.
Each dot is the intensity data for one individual. This plot shows all
individuals for one SNP. The three groups are the three possible
genotypes. If at the DNA location of that SNP people can have an A or a
C there are three options: AA, AC or CC.
If the three clusters are well separated and all dots (people) fall well
into a cluster confident calls (AA or AC or CC) can be made for each
person. However, if the data looks like plot A at
http://www.biomedcentral.com/content/figures/1471-2164-13-140-1-l.jpg
making good genotype calls is difficult/impossible. Or, for example, the
red dot in figure D, is it a good call or just one spurious measurement?
That is why after these measurements various QC steps are taken and the
resulting data are confident calls (no uncertainty).

And, just to give you a taste of what other stuff there is: another way
of measuring genotype data is through NGS (Next-Generation Sequencing).
With this method (nearly) all 3e9 base pairs of the human DNA can be
measured. But depending on the method accuracy can vary, so the genotype
call at a given location is usually accompanied by a quality metric.
Just to give you an idea: storing intermediate data from this process
for 1300 people, 30e6 genotypes used 14TB. Consequently people to a lot
of filtering and quality control reducing the file size and actually
ending up with files in the aforementioned Plink format (thus loosing
all uncertainty information!!).
But let's not go into this, because that's a completely different topic
and too much for an e-mail discussion. If you'd like to know more a call
would be better.

> 
> Hope to commit the new omicabelnomm soon and will work on a real life
> sample usage too.

That splendid news! Looking forward to see/discuss the results.

> 
> Thank you for any help on the matter!
> 

Hope this helps! If not, let me know.

And, just to summarise my view of the compress or not compress discussion:
- I think your solution for the output data is a good one.
- As for the input data (imputed genetic data), I still think that
compression can help there (not for the computations, but to reduce disk
space usage).

One more thing to note is that neither DatABEL, nor your binary format
takes care of endianness. So people on different architectures may run
into problems. Nowadays Apple's Macs no longer use PowerPC CPUs, but in
the future we may see ARM processors coming up (which are bi-endian
IIRC). So that may be something to keep in mind.
This is the right time to plug my idea of using the HDF5 format again
(or maybe the BioHDF subproject). It has several advantages:
- it's hierarchical (by definition) nature allows it to be
self-describing, so understanding what information (e.g. phenotype,
measured genotypes, imputed genotypes) is stored where in the file is easy.
- allows compression (with various backends like gzip or LZ4),
- takes care of endianness,
- has C, C++, Python, Matlab and R bindings (and more)
- has an MPI interface that allows both parallel writing and reading
- is developed and maintained by a non-profit organisation
- is used by many institutions that have large data sets, e.g. NASA, so
its proven technology.

Unfortunately, I haven't had the time to do proper performance testing,
but maybe you could have a look at it (I guess the MPI part is the most
relevant to your expertise) and tell me what you think.

Lennart.

[1] http://genome.sph.umich.edu/wiki/Minimac,
http://www.sph.umich.edu/csg/abecasis/MaCH/tour/imputation.html
[2] http://faculty.washington.edu/browning/beagle/beagle.html and the
manual for a description of the file formats:
http://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf
[3] http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped
[4] http://pngu.mgh.harvard.edu/~purcell/plink/binary.shtml

> -Alvaro
> 
> 
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
> 

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140911/cf852259/attachment.sig>