[GenABEL-dev] [Genabel-commits] r1664 - branches/ProbABEL-0.50/src

L.C. Karssen lennart at karssen.org
Mon Mar 31 23:28:56 CEST 2014


Hi Yurii,

On 31-03-14 20:46, Yury Aulchenko wrote:
> I personally find the fact that text outperforms binary disappointing

Me too...

One way out of the dilemma could be if we truncate the output to e.g. 4
significant digits. I don't think the output of mach/minimac will be
more precise.
I gave it a try, but it seems that the LRT-based chi^2 values don't play
along nicely: they differ at the 1e-3 or 1e-4 level in the example data.
My hunch is that a careful look at how we calculate the LRT may solve
this, as subtraction of two values is a likely candidate for loss of
precision.


> (and, if you forget about technical details - well, strange). On the
> other hand this is probably good for user as it eradicates the need
> to do conversion. Especially if we could work with compressed files.

That's my hope as well. Of course I haven't tested the impact of reading
zipped files yet. Maybe it'll put us back in square one.

> Especially if we build interface to work with other type of text
> outputs (e.g. IMPUTE2 would be a candidate)...

Indeed.


Lennart.

> 
> Yurii
> 
> ----------------
> Sent from mobile device, please excuse possible typos
> 
>> On 28 Mar 2014, at 23:19, "L.C. Karssen" <lennart at karssen.org> wrote:
>>
>> Dear all,
>>
>> (I guess the previous version of this mail went to the commit email
>> list, so here it is again for the devel list).
>>
>>
>> Indeed: an impressive speed-up! Well done Maarten.
>>
>>> On 28-03-14 20:30, Maarten Kooyman wrote:
>>> I tested speed of ProbABEL on a dataset 33815 snp / 3485 people adjusted
>>> for sex and age (I did not run it in triplet but gives an idea)
>>>
>>> version 0.42 0.50_branch
>>> FV         58     52
>>> mldose  48    12
>>> all times ate in seconds.
>>>
>>> As you can see the filevector format in the part that slows down the
>>> program. When profiling the reading from FV takes up 86% of all the time
>>> the program takes.
>>
>>
>> The current problem with reading from filevector is that the fv dat ais
>> stored in floats (this is logical as it means half the disk space usage
>> compared to storing doubles, moreover, the imputed data is never more
>> precise than a float anyway).
>> However, internally ProbABEL uses doubles for calculations. This means
>> conversion from float to double must occur at some point.
>>
>> Simply casting to double gives impression. For example casting a float
>> 0.677 to double gives: 0.67699998617172241
>> Therefore, with version 0.4.0 I changed this and used a string as
>> intermediate form, followed by strtod(). First I used stringstreams, but
>> these turn out to be much too slow for our use case. Now snprintf() is
>> used. For the above example the double value is: 0.67700000000000005,
>> much closer to what we would like to see. Using this two-step conversion
>> means the output when using fv is equal to the output using txt data
>> (and equal to using R), within float precision.
>>
>> Using Maarten's 'strtod' will speed up this part as well, but the
>> snprintf() call is still expensive.
>>
>> Apart from this two-step conversion we may also be inefficient because
>> the dosage/probability values are converted one array element at the
>> time. Maybe we can gain something there, like Maarten did for the txt
>> format and simply sending a whole 'line'/array to the conversion may help.
>>
>>
>>
>>
>> Given that most people nowadays store their imputation results in chunks
>> of chromosomes anyway (i.e. small(er) files), and the fact that I think
>> implementing the ability to read gziped files is not difficult, it may
>> be time to give mldose.gz files another chance for ProbABEL users. It
>> will save them the conversion from mldose.gz to DatABEL.
>> Of course we can still support DatABEL files, but (depending on how fast
>> reading from gzipped files is), our recommendation could change with the
>> upcoming ProbABEL v0.5.0.
>>
>> Any thoughts on this?
>>
>>
>> Best,
>>
>> Lennart.
>>
>>
>>
>>
>>
>>>> On 28-03-14 20:15, Yury Aulchenko wrote:
>>>> 10 fold is good speed up. An order of magnitude :)
>>>>
>>>> Wonder how it compares now to the reading from plain text files?
>>>>
>>>> Y
>>>>
>>>> ----------------
>>>> Sent from mobile device, please excuse possible typos
>>>>
>>>>> On 28 Mar 2014, at 20:12, noreply at r-forge.r-project.org wrote:
>>>>>
>>>>> Author: maartenk
>>>>> Date: 2014-03-28 20:12:41 +0100 (Fri, 28 Mar 2014)
>>>>> New Revision: 1664
>>>>>
>>>>> Modified:
>>>>>   branches/ProbABEL-0.50/src/gendata.cpp
>>>>>   branches/ProbABEL-0.50/src/gendata.h
>>>>> Log:
>>>>> new implementation of reading in numbers of mldose file: this version
>>>>> is about a 10(!) fold faster than in ProABEL 0.42
>>>>>
>>>>> Modified: branches/ProbABEL-0.50/src/gendata.cpp
>>>>> ===================================================================
>>>>> --- branches/ProbABEL-0.50/src/gendata.cpp    2014-03-27 21:16:16 UTC
>>>>> (rev 1663)
>>>>> +++ branches/ProbABEL-0.50/src/gendata.cpp    2014-03-28 19:12:41 UTC
>>>>> (rev 1664)
>>>>> @@ -40,58 +40,69 @@
>>>>> #endif
>>>>> #include "utilities.h"
>>>>>
>>>>> -double mldose_strtod(const char *str_pointer) {
>>>>> -    // This function is inspired on some answers found at
>>>>> stackoverflow :
>>>>> -    // eg question 5678932
>>>>> -    int sign = 0;
>>>>> -    double result = 0;
>>>>> -    //check if not a null pointer or NaN (right now checks only
>>>>> first character)
>>>>> -//TODO: make catching of NaN more rigid
>>>>> -    if (!*str_pointer | *str_pointer == 'N'){
>>>>> -        return std::numeric_limits<double>::quiet_NaN();
>>>>> +
>>>>> +void gendata::mldose_line_to_matrix(int k,const char
>>>>> *all_numbers,int amount_of_numbers){
>>>>> +    int j = 0;
>>>>> +    //check if not a null pointer
>>>>> +    if (!*all_numbers){
>>>>> +        perror("Error while reading genetic data (expected pointer
>>>>> to char but found a null pointer)");
>>>>> +                       exit(EXIT_FAILURE);
>>>>>     }
>>>>> -    //skip whitespace
>>>>> -    while (*str_pointer == ' ')
>>>>> +    while (j<amount_of_numbers)
>>>>>     {
>>>>> -        str_pointer++;
>>>>> -    }
>>>>> -    //set sign to -1 if negative: multiply by sign just before return
>>>>> -    if (*str_pointer == '-')
>>>>> -    {
>>>>> -        str_pointer++;
>>>>> -        sign = -1;
>>>>> -    }
>>>>> -    //read digits before dot
>>>>> -    while (*str_pointer <= '9' && *str_pointer >= '0'){
>>>>> -        result = result * 10 + (*str_pointer++ - '0');
>>>>> -    }
>>>>> -    //read digit after dot
>>>>> -    if (*str_pointer == '.')
>>>>> -    {
>>>>> -        double decimal_counter = 1.0;
>>>>> -        str_pointer++;
>>>>> -        while (*str_pointer <= '9' && *str_pointer >= '0')
>>>>> +        double result = 0;
>>>>> +        //skip whitespace
>>>>> +        while (*all_numbers == ' ')
>>>>>         {
>>>>> -            decimal_counter *= 0.1;
>>>>> -            result += (*str_pointer++ - '0') * decimal_counter;
>>>>> +            all_numbers++;
>>>>>         }
>>>>> +        //check NaN (right now checks only first character)
>>>>> +        //TODO: make catching of NaN more rigid
>>>>> +        if (*all_numbers == 'N')
>>>>> +        {
>>>>> +            result = std::numeric_limits<double>::quiet_NaN();
>>>>> +            //skip other characters of NaN
>>>>> +            while ((*all_numbers == 'a') | (*all_numbers == 'N'))
>>>>> +            {
>>>>> +                all_numbers++;
>>>>> +            }
>>>>> +        }
>>>>> +        else
>>>>> +        {
>>>>> +            int sign = 0;
>>>>> +            //set sign to -1 if negative: multiply by sign just
>>>>> before return
>>>>> +            if (*all_numbers == '-')
>>>>> +            {
>>>>> +                all_numbers++;
>>>>> +                sign = -1;
>>>>> +            }
>>>>> +            //read digits before dot
>>>>> +            while (*all_numbers <= '9' && *all_numbers >= '0')
>>>>> +            {
>>>>> +                result = result * 10 + (*all_numbers++ - '0');
>>>>> +            }
>>>>> +            //read digit after dot
>>>>> +            if (*all_numbers == '.')
>>>>> +            {
>>>>> +                double decimal_counter = 1.0;
>>>>> +                all_numbers++;
>>>>> +                while (*all_numbers <= '9' && *all_numbers >= '0')
>>>>> +                {
>>>>> +                    decimal_counter *= 0.1;
>>>>> +                    result += (*all_numbers++ - '0') * decimal_counter;
>>>>> +                }
>>>>> +            }
>>>>> +            //correct for negative number
>>>>> +            if (sign == -1)
>>>>> +            {
>>>>> +                result = sign * result;
>>>>> +            }
>>>>> +        }
>>>>> +        G.put(result, k, j);
>>>>> +        j++;
>>>>>     }
>>>>> -    //str_pointer should be null since all characters are read.
>>>>> -    if (*str_pointer){
>>>>> -        perror("Error while reading genetic data (mldose_strtod)");
>>>>> -                          exit(EXIT_FAILURE);
>>>>> -    }
>>>>> -    //correct for negative number
>>>>> -    if (sign == -1){
>>>>> -        return sign * result;
>>>>> -    }else{
>>>>> -        return result;
>>>>> -    }
>>>>> -
>>>>> }
>>>>>
>>>>> -
>>>>> -
>>>>> void gendata::get_var(int var, double * data)
>>>>> {
>>>>>     // Read the genetic data for SNP 'var' and store in the array
>>>>> 'data'
>>>>> @@ -246,7 +257,7 @@
>>>>>                 size_t strpos = tmpstr.find("->");
>>>>>                 if (strpos != string::npos)
>>>>>                 {
>>>>> -                    tmpid = tmpstr.substr(strpos+2, string::npos);
>>>>> +                    tmpid = tmpstr.substr(strpos + 2, string::npos);
>>>>>                 }
>>>>>                 else
>>>>>                 {
>>>>> @@ -255,8 +266,8 @@
>>>>>                 if (tmpid != idnames[k])
>>>>>                 {
>>>>>                     cerr << "phenotype file and dose or probability
>>>>> file "
>>>>> -                         << "did not match at line " << i + 2 << "
>>>>> (" << tmpid
>>>>> -                         << " != " << idnames[k] << ")" << endl;
>>>>> +                            << "did not match at line " << i + 2 <<
>>>>> " ("
>>>>> +                            << tmpid << " != " << idnames[k] << ")"
>>>>> << endl;
>>>>>                     infile.close();
>>>>>                     exit(1);
>>>>>                 }
>>>>> @@ -267,47 +278,58 @@
>>>>>                 infile >> tmpstr;
>>>>>             }
>>>>>
>>>>> -            for (unsigned int j = 0; j < (nsnps * ngpreds); j++)
>>>>> +            int oldstyle = 0;
>>>>> +            if (oldstyle == 1)
>>>>>             {
>>>>> -                if (infile.good())
>>>>> +                for (unsigned int j = 0; j < (nsnps * ngpreds); j++)
>>>>>                 {
>>>>> -                    infile >> inStr;
>>>>> -                    // tmpstr contains the dosage/probability in
>>>>> -                    // string form. Convert it to double (if tmpstr is
>>>>> -                    // NaN it will be set to nan).
>>>>> -                    double dosage;
>>>>> -                    char *endptr;
>>>>> -                    errno = 0;      // To distinguish success/failure
>>>>> -                                    // after strtod()
>>>>> +                    if (infile.good())
>>>>> +                    {
>>>>> +                        infile >> inStr;
>>>>> +                        // tmpstr contains the dosage/probability in
>>>>> +                        // string form. Convert it to double (if
>>>>> tmpstr is
>>>>> +                        // NaN it will be set to nan).
>>>>> +                        double dosage;
>>>>> +                        char *endptr;
>>>>> +                        errno = 0;      // To distinguish
>>>>> success/failure
>>>>> +                                        // after strtod()
>>>>>
>>>>> -                    dosage = mldose_strtod(inStr);
>>>>> -                    //dosage = strtod(tmpstr.c_str(), &endptr);
>>>>> -//                    if ((errno == ERANGE &&
>>>>> -//                         (dosage == HUGE_VALF || dosage ==
>>>>> HUGE_VALL))
>>>>> -//                        || (errno != 0 && dosage == 0)) {
>>>>> -//                        perror("Error while reading genetic data
>>>>> (strtod)");
>>>>> -//                        exit(EXIT_FAILURE);
>>>>> -//                    }
>>>>> +                        dosage = strtod(inStr, &endptr);
>>>>> +                        if ((errno == ERANGE
>>>>> +                                && (dosage == HUGE_VALF || dosage ==
>>>>> HUGE_VALL))
>>>>> +                                || (errno != 0 && dosage == 0))
>>>>> +                        {
>>>>> +                            perror("Error while reading genetic data
>>>>> (strtod)");
>>>>> +                            exit(EXIT_FAILURE);
>>>>> +                        }
>>>>>
>>>>> -                    if (endptr == tmpstr.c_str()) {
>>>>> -                        cerr << "No digits were found while reading
>>>>> genetic data"
>>>>> -                             << " (individual " << i + 1
>>>>> -                             << ", position " << j + 1 << ")"
>>>>> -                             << endl;
>>>>> -                        exit(EXIT_FAILURE);
>>>>> +                        if (endptr == tmpstr.c_str())
>>>>> +                        {
>>>>> +                            cerr
>>>>> +                                    << "No digits were found while
>>>>> reading genetic data"
>>>>> +                                    << " (individual " << i + 1 <<
>>>>> ", position "
>>>>> +                                    << j + 1 << ")" << endl;
>>>>> +                            exit(EXIT_FAILURE);
>>>>> +                        }
>>>>> +                        /* If we got here, strtod() successfully
>>>>> parsed a number */
>>>>> +                        G.put(dosage, k, j);
>>>>>                     }
>>>>> -
>>>>> -                    /* If we got here, strtod() successfully parsed
>>>>> a number */
>>>>> -                    G.put(dosage, k, j);
>>>>> +                    else
>>>>> +                    {
>>>>> +                        std::cerr << "cannot read dose-file: " << fname
>>>>> +                                << "check skipd and ngpreds
>>>>> parameters\n";
>>>>> +                        infile.close();
>>>>> +                        exit(1);
>>>>> +                    }
>>>>>                 }
>>>>> -                else
>>>>> -                {
>>>>> -                    std::cerr << "cannot read dose-file: " << fname
>>>>> -                              << "check skipd and ngpreds
>>>>> parameters\n";
>>>>> -                    infile.close();
>>>>> -                    exit(1);
>>>>> -                }
>>>>>             }
>>>>> +            else
>>>>> +            {
>>>>> +                std::string all_numbers;
>>>>> +                all_numbers.reserve(nsnps * ngpreds * 7);
>>>>> +                std::getline(infile, all_numbers);
>>>>> +                mldose_line_to_matrix(k, all_numbers.c_str(), nsnps
>>>>> * ngpreds);
>>>>> +            }
>>>>>             k++;
>>>>>         }
>>>>>         else
>>>>>
>>>>> Modified: branches/ProbABEL-0.50/src/gendata.h
>>>>> ===================================================================
>>>>> --- branches/ProbABEL-0.50/src/gendata.h    2014-03-27 21:16:16 UTC
>>>>> (rev 1663)
>>>>> +++ branches/ProbABEL-0.50/src/gendata.h    2014-03-28 19:12:41 UTC
>>>>> (rev 1664)
>>>>> @@ -44,7 +44,7 @@
>>>>>     unsigned int nids;
>>>>>     unsigned int ngpreds;
>>>>>     gendata();
>>>>> -    double convert(   char* source,  char** endPtr );
>>>>> +    void mldose_line_to_matrix(int k,const char *all_numbers,int
>>>>> amount_of_numbers);
>>>>>
>>>>>     void re_gendata(char * fname, unsigned int insnps, unsigned int
>>>>> ingpreds,
>>>>>             unsigned int npeople, unsigned int nmeasured,
>>>>>
>>>>> _______________________________________________
>>>>> Genabel-commits mailing list
>>>>> Genabel-commits at lists.r-forge.r-project.org
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>>>> _______________________________________________
>>>> Genabel-commits mailing list
>>>> Genabel-commits at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>>>
>>> _______________________________________________
>>> Genabel-commits mailing list
>>> Genabel-commits at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
>>
>> -- 
>> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
>> L.C. Karssen
>> Utrecht
>> The Netherlands
>>
>> lennart at karssen.org
>> http://blog.karssen.org
>> GPG key ID: A88F554A
>> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140331/f96682e8/attachment-0001.sig>


More information about the genabel-devel mailing list