[GenABEL-dev] [Genabel-commits] r1664 - branches/ProbABEL-0.50/src
L.C. Karssen
lennart at karssen.org
Mon Mar 31 23:28:56 CEST 2014
Hi Yurii,
On 31-03-14 20:46, Yury Aulchenko wrote:
> I personally find the fact that text outperforms binary disappointing
Me too...
One way out of the dilemma could be if we truncate the output to e.g. 4
significant digits. I don't think the output of mach/minimac will be
more precise.
I gave it a try, but it seems that the LRT-based chi^2 values don't play
along nicely: they differ at the 1e-3 or 1e-4 level in the example data.
My hunch is that a careful look at how we calculate the LRT may solve
this, as subtraction of two values is a likely candidate for loss of
precision.
> (and, if you forget about technical details - well, strange). On the
> other hand this is probably good for user as it eradicates the need
> to do conversion. Especially if we could work with compressed files.
That's my hope as well. Of course I haven't tested the impact of reading
zipped files yet. Maybe it'll put us back in square one.
> Especially if we build interface to work with other type of text
> outputs (e.g. IMPUTE2 would be a candidate)...
Indeed.
Lennart.
>
> Yurii
>
> ----------------
> Sent from mobile device, please excuse possible typos
>
>> On 28 Mar 2014, at 23:19, "L.C. Karssen" <lennart at karssen.org> wrote:
>>
>> Dear all,
>>
>> (I guess the previous version of this mail went to the commit email
>> list, so here it is again for the devel list).
>>
>>
>> Indeed: an impressive speed-up! Well done Maarten.
>>
>>> On 28-03-14 20:30, Maarten Kooyman wrote:
>>> I tested speed of ProbABEL on a dataset 33815 snp / 3485 people adjusted
>>> for sex and age (I did not run it in triplet but gives an idea)
>>>
>>> version 0.42 0.50_branch
>>> FV 58 52
>>> mldose 48 12
>>> all times ate in seconds.
>>>
>>> As you can see the filevector format in the part that slows down the
>>> program. When profiling the reading from FV takes up 86% of all the time
>>> the program takes.
>>
>>
>> The current problem with reading from filevector is that the fv dat ais
>> stored in floats (this is logical as it means half the disk space usage
>> compared to storing doubles, moreover, the imputed data is never more
>> precise than a float anyway).
>> However, internally ProbABEL uses doubles for calculations. This means
>> conversion from float to double must occur at some point.
>>
>> Simply casting to double gives impression. For example casting a float
>> 0.677 to double gives: 0.67699998617172241
>> Therefore, with version 0.4.0 I changed this and used a string as
>> intermediate form, followed by strtod(). First I used stringstreams, but
>> these turn out to be much too slow for our use case. Now snprintf() is
>> used. For the above example the double value is: 0.67700000000000005,
>> much closer to what we would like to see. Using this two-step conversion
>> means the output when using fv is equal to the output using txt data
>> (and equal to using R), within float precision.
>>
>> Using Maarten's 'strtod' will speed up this part as well, but the
>> snprintf() call is still expensive.
>>
>> Apart from this two-step conversion we may also be inefficient because
>> the dosage/probability values are converted one array element at the
>> time. Maybe we can gain something there, like Maarten did for the txt
>> format and simply sending a whole 'line'/array to the conversion may help.
>>
>>
>>
>>
>> Given that most people nowadays store their imputation results in chunks
>> of chromosomes anyway (i.e. small(er) files), and the fact that I think
>> implementing the ability to read gziped files is not difficult, it may
>> be time to give mldose.gz files another chance for ProbABEL users. It
>> will save them the conversion from mldose.gz to DatABEL.
>> Of course we can still support DatABEL files, but (depending on how fast
>> reading from gzipped files is), our recommendation could change with the
>> upcoming ProbABEL v0.5.0.
>>
>> Any thoughts on this?
>>
>>
>> Best,
>>
>> Lennart.
>>
>>
>>
>>
>>
>>>> On 28-03-14 20:15, Yury Aulchenko wrote:
>>>> 10 fold is good speed up. An order of magnitude :)
>>>>
>>>> Wonder how it compares now to the reading from plain text files?
>>>>
>>>> Y
>>>>
>>>> ----------------
>>>> Sent from mobile device, please excuse possible typos
>>>>
>>>>> On 28 Mar 2014, at 20:12, noreply at r-forge.r-project.org wrote:
>>>>>
>>>>> Author: maartenk
>>>>> Date: 2014-03-28 20:12:41 +0100 (Fri, 28 Mar 2014)
>>>>> New Revision: 1664
>>>>>
>>>>> Modified:
>>>>> branches/ProbABEL-0.50/src/gendata.cpp
>>>>> branches/ProbABEL-0.50/src/gendata.h
>>>>> Log:
>>>>> new implementation of reading in numbers of mldose file: this version
>>>>> is about a 10(!) fold faster than in ProABEL 0.42
>>>>>
>>>>> Modified: branches/ProbABEL-0.50/src/gendata.cpp
>>>>> ===================================================================
>>>>> --- branches/ProbABEL-0.50/src/gendata.cpp 2014-03-27 21:16:16 UTC
>>>>> (rev 1663)
>>>>> +++ branches/ProbABEL-0.50/src/gendata.cpp 2014-03-28 19:12:41 UTC
>>>>> (rev 1664)
>>>>> @@ -40,58 +40,69 @@
>>>>> #endif
>>>>> #include "utilities.h"
>>>>>
>>>>> -double mldose_strtod(const char *str_pointer) {
>>>>> - // This function is inspired on some answers found at
>>>>> stackoverflow :
>>>>> - // eg question 5678932
>>>>> - int sign = 0;
>>>>> - double result = 0;
>>>>> - //check if not a null pointer or NaN (right now checks only
>>>>> first character)
>>>>> -//TODO: make catching of NaN more rigid
>>>>> - if (!*str_pointer | *str_pointer == 'N'){
>>>>> - return std::numeric_limits<double>::quiet_NaN();
>>>>> +
>>>>> +void gendata::mldose_line_to_matrix(int k,const char
>>>>> *all_numbers,int amount_of_numbers){
>>>>> + int j = 0;
>>>>> + //check if not a null pointer
>>>>> + if (!*all_numbers){
>>>>> + perror("Error while reading genetic data (expected pointer
>>>>> to char but found a null pointer)");
>>>>> + exit(EXIT_FAILURE);
>>>>> }
>>>>> - //skip whitespace
>>>>> - while (*str_pointer == ' ')
>>>>> + while (j<amount_of_numbers)
>>>>> {
>>>>> - str_pointer++;
>>>>> - }
>>>>> - //set sign to -1 if negative: multiply by sign just before return
>>>>> - if (*str_pointer == '-')
>>>>> - {
>>>>> - str_pointer++;
>>>>> - sign = -1;
>>>>> - }
>>>>> - //read digits before dot
>>>>> - while (*str_pointer <= '9' && *str_pointer >= '0'){
>>>>> - result = result * 10 + (*str_pointer++ - '0');
>>>>> - }
>>>>> - //read digit after dot
>>>>> - if (*str_pointer == '.')
>>>>> - {
>>>>> - double decimal_counter = 1.0;
>>>>> - str_pointer++;
>>>>> - while (*str_pointer <= '9' && *str_pointer >= '0')
>>>>> + double result = 0;
>>>>> + //skip whitespace
>>>>> + while (*all_numbers == ' ')
>>>>> {
>>>>> - decimal_counter *= 0.1;
>>>>> - result += (*str_pointer++ - '0') * decimal_counter;
>>>>> + all_numbers++;
>>>>> }
>>>>> + //check NaN (right now checks only first character)
>>>>> + //TODO: make catching of NaN more rigid
>>>>> + if (*all_numbers == 'N')
>>>>> + {
>>>>> + result = std::numeric_limits<double>::quiet_NaN();
>>>>> + //skip other characters of NaN
>>>>> + while ((*all_numbers == 'a') | (*all_numbers == 'N'))
>>>>> + {
>>>>> + all_numbers++;
>>>>> + }
>>>>> + }
>>>>> + else
>>>>> + {
>>>>> + int sign = 0;
>>>>> + //set sign to -1 if negative: multiply by sign just
>>>>> before return
>>>>> + if (*all_numbers == '-')
>>>>> + {
>>>>> + all_numbers++;
>>>>> + sign = -1;
>>>>> + }
>>>>> + //read digits before dot
>>>>> + while (*all_numbers <= '9' && *all_numbers >= '0')
>>>>> + {
>>>>> + result = result * 10 + (*all_numbers++ - '0');
>>>>> + }
>>>>> + //read digit after dot
>>>>> + if (*all_numbers == '.')
>>>>> + {
>>>>> + double decimal_counter = 1.0;
>>>>> + all_numbers++;
>>>>> + while (*all_numbers <= '9' && *all_numbers >= '0')
>>>>> + {
>>>>> + decimal_counter *= 0.1;
>>>>> + result += (*all_numbers++ - '0') * decimal_counter;
>>>>> + }
>>>>> + }
>>>>> + //correct for negative number
>>>>> + if (sign == -1)
>>>>> + {
>>>>> + result = sign * result;
>>>>> + }
>>>>> + }
>>>>> + G.put(result, k, j);
>>>>> + j++;
>>>>> }
>>>>> - //str_pointer should be null since all characters are read.
>>>>> - if (*str_pointer){
>>>>> - perror("Error while reading genetic data (mldose_strtod)");
>>>>> - exit(EXIT_FAILURE);
>>>>> - }
>>>>> - //correct for negative number
>>>>> - if (sign == -1){
>>>>> - return sign * result;
>>>>> - }else{
>>>>> - return result;
>>>>> - }
>>>>> -
>>>>> }
>>>>>
>>>>> -
>>>>> -
>>>>> void gendata::get_var(int var, double * data)
>>>>> {
>>>>> // Read the genetic data for SNP 'var' and store in the array
>>>>> 'data'
>>>>> @@ -246,7 +257,7 @@
>>>>> size_t strpos = tmpstr.find("->");
>>>>> if (strpos != string::npos)
>>>>> {
>>>>> - tmpid = tmpstr.substr(strpos+2, string::npos);
>>>>> + tmpid = tmpstr.substr(strpos + 2, string::npos);
>>>>> }
>>>>> else
>>>>> {
>>>>> @@ -255,8 +266,8 @@
>>>>> if (tmpid != idnames[k])
>>>>> {
>>>>> cerr << "phenotype file and dose or probability
>>>>> file "
>>>>> - << "did not match at line " << i + 2 << "
>>>>> (" << tmpid
>>>>> - << " != " << idnames[k] << ")" << endl;
>>>>> + << "did not match at line " << i + 2 <<
>>>>> " ("
>>>>> + << tmpid << " != " << idnames[k] << ")"
>>>>> << endl;
>>>>> infile.close();
>>>>> exit(1);
>>>>> }
>>>>> @@ -267,47 +278,58 @@
>>>>> infile >> tmpstr;
>>>>> }
>>>>>
>>>>> - for (unsigned int j = 0; j < (nsnps * ngpreds); j++)
>>>>> + int oldstyle = 0;
>>>>> + if (oldstyle == 1)
>>>>> {
>>>>> - if (infile.good())
>>>>> + for (unsigned int j = 0; j < (nsnps * ngpreds); j++)
>>>>> {
>>>>> - infile >> inStr;
>>>>> - // tmpstr contains the dosage/probability in
>>>>> - // string form. Convert it to double (if tmpstr is
>>>>> - // NaN it will be set to nan).
>>>>> - double dosage;
>>>>> - char *endptr;
>>>>> - errno = 0; // To distinguish success/failure
>>>>> - // after strtod()
>>>>> + if (infile.good())
>>>>> + {
>>>>> + infile >> inStr;
>>>>> + // tmpstr contains the dosage/probability in
>>>>> + // string form. Convert it to double (if
>>>>> tmpstr is
>>>>> + // NaN it will be set to nan).
>>>>> + double dosage;
>>>>> + char *endptr;
>>>>> + errno = 0; // To distinguish
>>>>> success/failure
>>>>> + // after strtod()
>>>>>
>>>>> - dosage = mldose_strtod(inStr);
>>>>> - //dosage = strtod(tmpstr.c_str(), &endptr);
>>>>> -// if ((errno == ERANGE &&
>>>>> -// (dosage == HUGE_VALF || dosage ==
>>>>> HUGE_VALL))
>>>>> -// || (errno != 0 && dosage == 0)) {
>>>>> -// perror("Error while reading genetic data
>>>>> (strtod)");
>>>>> -// exit(EXIT_FAILURE);
>>>>> -// }
>>>>> + dosage = strtod(inStr, &endptr);
>>>>> + if ((errno == ERANGE
>>>>> + && (dosage == HUGE_VALF || dosage ==
>>>>> HUGE_VALL))
>>>>> + || (errno != 0 && dosage == 0))
>>>>> + {
>>>>> + perror("Error while reading genetic data
>>>>> (strtod)");
>>>>> + exit(EXIT_FAILURE);
>>>>> + }
>>>>>
>>>>> - if (endptr == tmpstr.c_str()) {
>>>>> - cerr << "No digits were found while reading
>>>>> genetic data"
>>>>> - << " (individual " << i + 1
>>>>> - << ", position " << j + 1 << ")"
>>>>> - << endl;
>>>>> - exit(EXIT_FAILURE);
>>>>> + if (endptr == tmpstr.c_str())
>>>>> + {
>>>>> + cerr
>>>>> + << "No digits were found while
>>>>> reading genetic data"
>>>>> + << " (individual " << i + 1 <<
>>>>> ", position "
>>>>> + << j + 1 << ")" << endl;
>>>>> + exit(EXIT_FAILURE);
>>>>> + }
>>>>> + /* If we got here, strtod() successfully
>>>>> parsed a number */
>>>>> + G.put(dosage, k, j);
>>>>> }
>>>>> -
>>>>> - /* If we got here, strtod() successfully parsed
>>>>> a number */
>>>>> - G.put(dosage, k, j);
>>>>> + else
>>>>> + {
>>>>> + std::cerr << "cannot read dose-file: " << fname
>>>>> + << "check skipd and ngpreds
>>>>> parameters\n";
>>>>> + infile.close();
>>>>> + exit(1);
>>>>> + }
>>>>> }
>>>>> - else
>>>>> - {
>>>>> - std::cerr << "cannot read dose-file: " << fname
>>>>> - << "check skipd and ngpreds
>>>>> parameters\n";
>>>>> - infile.close();
>>>>> - exit(1);
>>>>> - }
>>>>> }
>>>>> + else
>>>>> + {
>>>>> + std::string all_numbers;
>>>>> + all_numbers.reserve(nsnps * ngpreds * 7);
>>>>> + std::getline(infile, all_numbers);
>>>>> + mldose_line_to_matrix(k, all_numbers.c_str(), nsnps
>>>>> * ngpreds);
>>>>> + }
>>>>> k++;
>>>>> }
>>>>> else
>>>>>
>>>>> Modified: branches/ProbABEL-0.50/src/gendata.h
>>>>> ===================================================================
>>>>> --- branches/ProbABEL-0.50/src/gendata.h 2014-03-27 21:16:16 UTC
>>>>> (rev 1663)
>>>>> +++ branches/ProbABEL-0.50/src/gendata.h 2014-03-28 19:12:41 UTC
>>>>> (rev 1664)
>>>>> @@ -44,7 +44,7 @@
>>>>> unsigned int nids;
>>>>> unsigned int ngpreds;
>>>>> gendata();
>>>>> - double convert( char* source, char** endPtr );
>>>>> + void mldose_line_to_matrix(int k,const char *all_numbers,int
>>>>> amount_of_numbers);
>>>>>
>>>>> void re_gendata(char * fname, unsigned int insnps, unsigned int
>>>>> ingpreds,
>>>>> unsigned int npeople, unsigned int nmeasured,
>>>>>
>>>>> _______________________________________________
>>>>> Genabel-commits mailing list
>>>>> Genabel-commits at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>>>> _______________________________________________
>>>> Genabel-commits mailing list
>>>> Genabel-commits at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>>>
>>> _______________________________________________
>>> Genabel-commits mailing list
>>> Genabel-commits at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>>
>> --
>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>> L.C. Karssen
>> Utrecht
>> The Netherlands
>>
>> lennart at karssen.org
>> http://blog.karssen.org
>> GPG key ID: A88F554A
>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
--
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands
lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140331/f96682e8/attachment-0001.sig>
More information about the genabel-devel
mailing list