[GenABEL-dev] [Genabel-commits] r1664 - branches/ProbABEL-0.50/src

Yurii Aulchenko yurii.aulchenko at gmail.com
Wed Apr 9 22:20:40 CEST 2014


Absolutely agree. More than supportive! Would be absolutely cool to be able
to have all these different packages and functions we have working with
different type of data via centralized API. Tremendous help in development
of new methods, something which would really make GenA project attractive
for other developers.

Yurii

On Monday, March 31, 2014, Maarten Kooyman <kooyman at gmail.com> wrote:

> Dear All,
>
> It might be usefull to make next generation Databel with a interface for
> IMPUTE2/SHAPEIT and mach/minimac. Having one library/package to read the
> data would help all projects in usability. I'm not the one waiting to
> convert my 1kg imputations into other format. Nobody (in user perspective)
> feels like saving the same hundreds of GB of data in multiple formats. (And
> that is a practical reason for choosing a program to work with, and might
> not be the same as the best program)
>
> To centralize these function would also benefit method developers. They do
> not have to bother with writing another parser. Creating a reliable, fast
> and multi-format parser is boilerplate code and this kind of code you do
> not want to bother with if you have a new powerful methodology in mind.
> That is why lots of scientific software is picky on input format. There are
> offcourse some problems caused by the nature of the data format eg [1].
>
>
> Kind regards,
>
> Maarten
>
>
>
>
> [1] One problem is that there is an number of different predictors in
> those formats. It varies between 1 and 3, where in case of IMPUTE2/SHAPEIT
> the probabilities do not sum to one.  mach/minimac might be converted to 3
> predictors since it should[1] add to one.
>
> On 31-03-14 20:46, Yury Aulchenko wrote:
>
> I personally find the fact that text outperforms binary disappointing
> (and, if you forget about technical details - well, strange). On the other
> hand this is probably good for user as it eradicates the need to do
> conversion. Especially if we could work with compressed files. Especially
> if we build interface to work with other type of text outputs (e.g. IMPUTE2
> would be a candidate)...
>
> Yurii
>
> ----------------
> Sent from mobile device, please excuse possible typos
>
>  On 28 Mar 2014, at 23:19, "L.C. Karssen" <lennart at karssen.org> wrote:
>
> Dear all,
>
> (I guess the previous version of this mail went to the commit email
> list, so here it is again for the devel list).
>
>
> Indeed: an impressive speed-up! Well done Maarten.
>
>  On 28-03-14 20:30, Maarten Kooyman wrote:
> I tested speed of ProbABEL on a dataset 33815 snp / 3485 people adjusted
> for sex and age (I did not run it in triplet but gives an idea)
>
> version 0.42 0.50_branch
> FV         58     52
> mldose  48    12
> all times ate in seconds.
>
> As you can see the filevector format in the part that slows down the
> program. When profiling the reading from FV takes up 86% of all the time
> the program takes.
>
>
> The current problem with reading from filevector is that the fv dat ais
> stored in floats (this is logical as it means half the disk space usage
> compared to storing doubles, moreover, the imputed data is never more
> precise than a float anyway).
> However, internally ProbABEL uses doubles for calculations. This means
> conversion from float to double must occur at some point.
>
> Simply casting to double gives impression. For example casting a float
> 0.677 to double gives: 0.67699998617172241
> Therefore, with version 0.4.0 I changed this and used a string as
> intermediate form, followed by strtod(). First I used stringstreams, but
> these turn out to be much too slow for our use case. Now snprintf() is
> used. For the above example the double value is: 0.67700000000000005,
> much closer to what we would like to see. Using this two-step conversion
> means the output when using fv is equal to the output using txt data
> (and equal to using R), within float precision.
>
> Using Maarten's 'strtod' will speed up this part as well, but the
> snprintf() call is still expensive.
>
> Apart from this two-step conversion we may also be inefficient because
> the dosage/probability values are converted one array element at the
> time. Maybe we can gain something there, like Maarten did for the txt
> format and simply sending a whole 'line'/array to the conversion may help.
>
>
>
>
> Given that most people nowadays store their imputation results in chunks
> of chromosomes anyway (i.e. small(er) files), and the fact that I think
> implementing the ability to read gziped files is not difficult, it may
> be time to give mldose.gz files another chance for ProbABEL users. It
> will save them the conversion from mldose.gz to DatABEL.
> Of course we can still support DatABEL files, but (depending on how fast
> reading from gzipped files is), our recommendation could change with the
> upcoming ProbABEL v0.5.0.
>
> Any thoughts on this?
>
>
> Best,
>
> Lennart.
>
>
>
>
>
>  On 28-03-14 20:15, Yury Aulchenko wrote:
> 10 fold is good speed up. An order of magnitude :)
>
> Wonder how it compares now to the reading from plain text files?
>
> Y
>
> ----------------
> Sent from mobile device, please excuse possible typos
>
>  On 28 Mar 2014, at 20:12, noreply at r-forge.r-project.org wrote:
>
> Author: maartenk
> Date: 2014-03-28 20:12:41 +0100 (Fri, 28 Mar 2014)
> New Revision: 1664
>
> Modified:
>    branches/ProbABEL-0.50/src/gendata.cpp
>    branches/ProbABEL-0.50/src/gendata.h
> Log:
> new implementation of reading in numbers of mldose file: this version
> is about a 10(!) fold faster than in ProABEL 0.42
>
> Modified: branches/ProbABEL-0.50/src/gendata.cpp
> ===================================================================
> --- branches/ProbABEL-0.50/src/gendata.cpp    2014-03-27 21:16:16 UTC
> (rev 1663)
> +++ branches/ProbABEL-0.50/src/gendata.cpp    2014-03-28 19:12:41 UTC<
>
>

-- 
-----------------------------------------------------
Yurii S. Aulchenko

[ LinkedIn <http://nl.linkedin.com/in/yuriiaulchenko> ] [
Twitter<http://twitter.com/YuriiAulchenko>] [
Blog <http://yurii-aulchenko.blogspot.nl/> ]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140409/e308e88d/attachment.html>


More information about the genabel-devel mailing list