[GenABEL-dev] Large imputations, ProbABEL and DatABEL

Thu Sep 27 13:48:02 CEST 2012

Dear list,

As imputation with large reference sets becomes more and more common,
the file size of the output becomes larger and larger. Until now we've
always stored our imputation results in one file per chromosome. This
becomes more and more problematic because we run our imputations on a
cluster and therefore divide the input files in chunks of several Mbase.
After imputations we combine all the dosage files, info files, etc. back
into one file per chromosome. This is time consuming, but more
importantly very I/O intensive because it involves merging files based
on columns instead of rows.
I'd like to start a discussion here on how to best deal with this on the
ProbABEL/DatABEL side of things.

In short these are the steps in our present imputation pipeline:
1) divide input data set into chunks of several Mbase per file
2) run imputations (mostly using MaCH/minimac); this results in several
gzipped files for each chunk for each chromosome
3) combine these output files into one file per type per chromosome
4) convert the dosage files to DatABEL format; these are then used by
ProbABEL for running GWAS.

Step 3 is the bottleneck in terms of I/O. I'd like to get rid of this step.

Possible solutions:
A) Keep the imputation output files 'as is', i.e. in chunks and change
the probabel.pl script to work with the chunked data (and hide the
chunks from the user). This will probably also mean we have to add a
chunk variable in probabel.cfg.
B) Skip step 3, convert each chunk file to DatABEL and then merge the
DatABEL files.

The positive side of option A is that we could add a '--chunk' option to
probabel.pl (like the start and stop chromosome) which would allow it to
be run massively parallel on a cluster, one chunk per CPU core (at the
moment we 'parallelise' it by running each chromosome on a different CPU
core).
A downside will be the fact that the sysadmin would need to convert his
probabel.cfg to the new format that includes a chunk variable.
Furthermore, tools like the extract_snps script we use internally to
extract the dosage for all individuals for a given SNP need to be
adapted as well. This option also makes using DatABEL less useful
because the size of each chunk file will small.

Option B doesn't have this problem, but probably has an increased CPU
and/or I/O burden when combining the DatABEL files. I haven't tested
this yet.

What are your ideas on this?
Thanks for thinking along,

Lennart.

-- 
-----------------------------------------------------------------
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org

Stuur mij aub geen Word of Powerpoint bestanden!
Zie http://www.gnu.org/philosophy/no-word-attachments.nl.html
------------------------------------------------------------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20120927/2b862a36/attachment.sig>