[GenABEL-dev] compressed dosage files and Big Data issues

Wed May 21 13:01:45 CEST 2014

Hi All,

It has been brought to my attention that dosage files with imputed data used in regression analysis are ussually stored in disk in a compressed manner. For the tool snptest and perhaps others, users seem to just pass the path to the compressed files. Do the other GenABEL tools also decompress the data on the fly before using it on the analysis?

It seems to me that the project is meant to handle the "Big Data" problem, but many aspects of input and out data are being ignored.

The example dataset that I am talking about requires around 1.4 terabytes of data in compressed form. The uncompressed form seems to bring this to 10-20 terabytes. If one were to use the computational power of an entire 16core machine with pigz(parallel gzip) to uncompress the data, 4 hours would be required to uncompress the data. Any other tool wither in unix/R/c would take even longer in a sequential manner. I am being informed that reported times are around 24h+ hours to wait for the total uncompressing of the data when trying to extrac a subset of the data.  I can imagine that great chinks of the runtome from tools that do regression also suffer from having to uncompress on the fly. The total uncompressed data is never kept, only partial subparts of it as temporary files. Drives tipically offer 4Tb of storage space, so storing 10-20 TB seems a bit overwhelming if not done or designed correctly.

Now consider a regression tool that is supposed to use all the data for a proper GWAS. This tool would have to spend days decompressing the data alone. If an entire research group is meant to use this data, and each member has to share resources on the same system, then uncompressing on the fly is even less optimal. The only real alternative is to keep the data uncompressed in disk and use only computational resources to calculate regressions. But then this runs into the problem of how to store it since drives are small and limited compared to the amount of data.

A solution to this aspect of "Big data" comes from properly design supercomputing clusters or databases. This systems do not handle the filesystem where files are stored as part of individual drives. They simply use a version of a Distributed File System, like HDSP from Apache. The entire capacity filesystem can be expanded by simply adding drives to it, like normal hard drives for storage and PCIE SSD´s for high speed cache. This is all transparent to the end user, who only sees a unified filesystem.

To solve the "Big Data" problem,  such aspects of IT Infrastructure and systems like HDSP have to be included in the entire workflow process. What is the stance of the GenABEL in regards to how data is stored and handled?

My recommendation would be to to a tleast have a best practice disclosure, where many other aspects of workflow are included and discussed, as to make the usage of the computational tools optimal. It is not feasible to tackle big data with just faster or easier to use computational tools, since those tools have to adapt to the data going in and going out.

Sorry for the long email.

TL;DR: Lets encourage uncompressing data and keeping it in disk using Distributed file systems as to make the computational tools faster and workflows more efficient to end-users.

http://www-01.ibm.com/software/data/infosphere/hadoop/hdfs/
https://en.wikipedia.org/wiki/Clustered_file_system#Distributed_file_systems

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140521/39847d39/attachment.html>