Absolutely agree. More than supportive! Would be absolutely cool to be able to have all these different packages and functions we have working with different type of data via centralized API. Tremendous help in development of new methods, something which would really make GenA project attractive for other developers. <div>
<br></div><div>Yurii<br><br>On Monday, March 31, 2014, Maarten Kooyman <<a href="mailto:kooyman@gmail.com">kooyman@gmail.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Dear All,<br>
<br>
It might be usefull to make next generation Databel with a interface for IMPUTE2/SHAPEIT and mach/minimac. Having one library/package to read the data would help all projects in usability. I'm not the one waiting to convert my 1kg imputations into other format. Nobody (in user perspective) feels like saving the same hundreds of GB of data in multiple formats. (And that is a practical reason for choosing a program to work with, and might not be the same as the best program)<br>
<br>
To centralize these function would also benefit method developers. They do not have to bother with writing another parser. Creating a reliable, fast and multi-format parser is boilerplate code and this kind of code you do not want to bother with if you have a new powerful methodology in mind. That is why lots of scientific software is picky on input format. There are offcourse some problems caused by the nature of the data format eg [1].<br>
<br>
<br>
Kind regards,<br>
<br>
Maarten<br>
<br>
<br>
<br>
<br>
[1] One problem is that there is an number of different predictors in those formats. It varies between 1 and 3, where in case of IMPUTE2/SHAPEIT the probabilities do not sum to one. mach/minimac might be converted to 3 predictors since it should[1] add to one.<br>
<br>
On 31-03-14 20:46, Yury Aulchenko wrote:<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I personally find the fact that text outperforms binary disappointing (and, if you forget about technical details - well, strange). On the other hand this is probably good for user as it eradicates the need to do conversion. Especially if we could work with compressed files. Especially if we build interface to work with other type of text outputs (e.g. IMPUTE2 would be a candidate)...<br>
<br>
Yurii<br>
<br>
----------------<br>
Sent from mobile device, please excuse possible typos<br>
<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 28 Mar 2014, at 23:19, "L.C. Karssen" <<a>lennart@karssen.org</a>> wrote:<br>
<br>
Dear all,<br>
<br>
(I guess the previous version of this mail went to the commit email<br>
list, so here it is again for the devel list).<br>
<br>
<br>
Indeed: an impressive speed-up! Well done Maarten.<br>
<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 28-03-14 20:30, Maarten Kooyman wrote:<br>
I tested speed of ProbABEL on a dataset 33815 snp / 3485 people adjusted<br>
for sex and age (I did not run it in triplet but gives an idea)<br>
<br>
version 0.42 0.50_branch<br>
FV 58 52<br>
mldose 48 12<br>
all times ate in seconds.<br>
<br>
As you can see the filevector format in the part that slows down the<br>
program. When profiling the reading from FV takes up 86% of all the time<br>
the program takes.<br>
</blockquote>
<br>
The current problem with reading from filevector is that the fv dat ais<br>
stored in floats (this is logical as it means half the disk space usage<br>
compared to storing doubles, moreover, the imputed data is never more<br>
precise than a float anyway).<br>
However, internally ProbABEL uses doubles for calculations. This means<br>
conversion from float to double must occur at some point.<br>
<br>
Simply casting to double gives impression. For example casting a float<br>
0.677 to double gives: 0.67699998617172241<br>
Therefore, with version 0.4.0 I changed this and used a string as<br>
intermediate form, followed by strtod(). First I used stringstreams, but<br>
these turn out to be much too slow for our use case. Now snprintf() is<br>
used. For the above example the double value is: 0.67700000000000005,<br>
much closer to what we would like to see. Using this two-step conversion<br>
means the output when using fv is equal to the output using txt data<br>
(and equal to using R), within float precision.<br>
<br>
Using Maarten's 'strtod' will speed up this part as well, but the<br>
snprintf() call is still expensive.<br>
<br>
Apart from this two-step conversion we may also be inefficient because<br>
the dosage/probability values are converted one array element at the<br>
time. Maybe we can gain something there, like Maarten did for the txt<br>
format and simply sending a whole 'line'/array to the conversion may help.<br>
<br>
<br>
<br>
<br>
Given that most people nowadays store their imputation results in chunks<br>
of chromosomes anyway (i.e. small(er) files), and the fact that I think<br>
implementing the ability to read gziped files is not difficult, it may<br>
be time to give mldose.gz files another chance for ProbABEL users. It<br>
will save them the conversion from mldose.gz to DatABEL.<br>
Of course we can still support DatABEL files, but (depending on how fast<br>
reading from gzipped files is), our recommendation could change with the<br>
upcoming ProbABEL v0.5.0.<br>
<br>
Any thoughts on this?<br>
<br>
<br>
Best,<br>
<br>
Lennart.<br>
<br>
<br>
<br>
<br>
<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 28-03-14 20:15, Yury Aulchenko wrote:<br>
10 fold is good speed up. An order of magnitude :)<br>
<br>
Wonder how it compares now to the reading from plain text files?<br>
<br>
Y<br>
<br>
----------------<br>
Sent from mobile device, please excuse possible typos<br>
<br>
<blockquote style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 28 Mar 2014, at 20:12, <a>noreply@r-forge.r-project.org</a> wrote:<br>
<br>
Author: maartenk<br>
Date: 2014-03-28 20:12:41 +0100 (Fri, 28 Mar 2014)<br>
New Revision: 1664<br>
<br>
Modified:<br>
branches/ProbABEL-0.50/src/<u></u>gendata.cpp<br>
branches/ProbABEL-0.50/src/<u></u>gendata.h<br>
Log:<br>
new implementation of reading in numbers of mldose file: this version<br>
is about a 10(!) fold faster than in ProABEL 0.42<br>
<br>
Modified: branches/ProbABEL-0.50/src/<u></u>gendata.cpp<br>
==============================<u></u>==============================<u></u>=======<br>
--- branches/ProbABEL-0.50/src/<u></u>gendata.cpp 2014-03-27 21:16:16 UTC<br>
(rev 1663)<br>
+++ branches/ProbABEL-0.50/src/<u></u>gendata.cpp 2014-03-28 19:12:41 UTC<</blockquote></blockquote></blockquote></blockquote></blockquote></blockquote></div><br><br>-- <br>-----------------------------------------------------<br>
Yurii S. Aulchenko<br><div><br></div><div>[ <a href="http://nl.linkedin.com/in/yuriiaulchenko" target="_blank">LinkedIn</a> ] [ <a href="http://twitter.com/YuriiAulchenko" target="_blank">Twitter</a> ] [ <a href="http://yurii-aulchenko.blogspot.nl/" target="_blank">Blog</a> ]</div>
<br>