[GenABEL-dev] databel vs impute2 vs me

Wed Aug 27 18:56:44 CEST 2014

Hi Lennart,

I wanted to re-introduce the issue of compression, file sizes and formats.

At the moment I am trying to use a a file in format impute2, which seems to code a lot of 0 1 and every now and then a 0. + 3digits.

When converting such a file to databel, the size is clearly BIGGER, since (instead of using 1 byte for 1,0 s, like impute2) DATABEL will use 4bytes. Databel has no idea what is binary and what is not so codes all as floats/doubles.
Never the less a compressed 7z of the databel format can reduce 200MBs to less than 4MBs.
80MB of impute2 get compressed to 5Mbs in gz format and around 3MB in 7z format.

Compression is already an option for databel as is.

Now to the real issue, Compression of data SHOULD NEVER HAPPEN! (Decompression of data on the fly, (to analyze it) is just adding compute overhead (cpus are being used to decompress!))

To deal with (not using compressed) output data I developed a small footprint format of the data and a program that reads it and outputs .txt human readable versions of the results (for subsets of the results). The binary custom version of the output is very aware of data and stores significant values (user defined) only, as well as required data to reproduce the entire output, independently of the source data used to produce it. This means that p values, t statistics and such can be recomputed with the outputfiles and only very minimal data is stored and virtually no compute time is required. As an extra, a .txt file is also produced automatically by omicabelnomm which contains significant data only (another parameter set by the user). The output binary data can then be used to produce new txt files according to different degrees of significance, as long as the data had been stored.

For example, from 1000 Phe and 1000 SNP, 10^6 results are meant to be computed. from those only 0.1% are relevant/significant. The user says, display as txt  only P < 0.05 and store all results with P < 0.1. This is done. File sizes are minimal. User then comes in a week and wants to see not only what he had but perhaps only P < 0.0005. This results were stored. He also want to see P < 0.9 and those were stored too, so for both cases he receives new .txts with human readable format. If he wants to see all results above P >0.1, those were not stored.... so no luck there. Re-computation should not be an issue as it is FAST.

That is just a sample of how to handle the "big data" problem, which I insist, is not a problem at all.
The next issue is storing data like the one from impute2 I have encountered here.
Is this kind of data normal? or are there situations where EVERY entry (90%+?) are floating point numbers? Are 3 digits after the . the maximum impute2 supports?
If so, I can already envision a super "compressed" file format to contain this impute2 like data with megabytes instead of gygabytes/terabytes. What other formats are used for bot Y and X? (genotypes/phenotypes) Do they have same impute2 structure? I know there is non imputed datatypes, how do they look?

Hope to commit the new omicabelnomm soon and will work on a real life sample usage too.

Thank you for any help on the matter!

-Alvaro
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140827/e1406274/attachment.html>