[GenABEL-dev] P-values

Mon Jun 23 18:39:35 CEST 2014

Hi Alvaro,

On 23-06-14 16:02, Frank, Alvaro Jesus wrote:
>     That sounds cool. An (unnecessary?) word of caution though: I don't know
>     how much research you have done on this before starting, but make sure
>     to look at existing implementations of p-value calculation. Given that
>     in genetics p-values can become extremely small (e.g. 1e-100 or smaller)
>     numerical problems abound.
>     See also http://dx.doi.org/10.1016/j.csda.2008.11.028
> 
>     Be sure to check out existing implementations (e.g. in Boost) on how
>     integration is being handled. Somewhere at the back of my mind the term
>     'incomplete Gamma function' comes to mind.
> 
> Such extreme values are directly related to the t-score, and it starts
> being a problem at around 20+.
> For those cases I am set to use arbitrary precise functions as in:
> http://www.mpfr.org/mpfr-current/mpfr.html

Thanks for the link. I'm not really familiar with arbitrary precision
math, and this seems like a good place to start reading and playing.

> 
> Which leads me to the critic that if people only report p-values as
> their work, perhaps there is a cultural and scientific problem with the
> workflow and conceptual basis. But if they want their 3.5*10^-300 I will
> calculate it.

For now, that's the way life is. ;-)

> 
>     I would make the actual threshold a user-definable option, by the way.
>     There may be use cases where people want to have accurate p-values even
>     for non-significant cases.
> 
> I was thinking the same. From Local workflows here at Helmholtz I see
> that any insignificant value gets ignored completely unless it comes to
> doing Manhattan plots.

That's mostly my experience as well. However, when it comes to
meta-analysing data from different groups data (sometimes including
p-values) need to be present for all SNPs.

> So another setting would be to not calculate or
> store at all anything for non significant t-scores, which is helpfull
> during early discovery of significant signals. 

I think that such an option would be a good thing.

> For Final calculations
> for plotting and reporting, the setting can then be changed to enable
> those. This effectivly removes the "Big data" problem, since 99% of the
> data is meaningless to determining whether the slope is significant.

I agree.

> 
>     In ProbABEL we are now implementing the calculation of p-values, but the
>     actual calculation is being left to the Boost libraries. Currently we
>     calculate the p-values irrespective of the value of the statistic. I'll
>     profile the code to see how much time this takes and whether it makes
>     sense to only calculate p-values for statistics with a value above a
>     certain threshold.
> 
> I was considering boost, if your profiling is good, perhaps boost is
> doing the threshold already. Let me know to see if I should switch over
> to boost too.

I'm not familiar with MPFR, but from a maintainability point of view it
may be worth considering Boost since we're already using that in
ProbABEL. Moreover, since Boost contains many other algorithms and is
extensively used I also assume it is well-tested. Another point to
consider is that Boost is C++ whereas MPFR is C. Since most of the code
we write is C++ it makes sense to go for Boost (again, from a
maintainability point of view; from a performance PoV that may be
different).

As to the profiling: I'll keep you posted!

Best,

Lennart.

> 
> -A Frank
> ________________________________________
> From: genabel-devel-bounces at lists.r-forge.r-project.org
> [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of L.C.
> Karssen [lennart at karssen.org]
> Sent: Monday, June 23, 2014 3:51 PM
> To: genabel-devel at lists.r-forge.r-project.org
> Subject: Re: [GenABEL-dev] P-values
> 
> Hi Alvaro,
> 
> On 23-06-14 15:21, Frank, Alvaro Jesus wrote:
>> Hi all,
>>
>> I have been trying to figure out an efficient way to calculate p-values,
>> and it seems that I managed to come to an efficient compromise between
>> speed an accuracy.
> 
> That sounds cool. An (unnecessary?) word of caution though: I don't know
> how much research you have done on this before starting, but make sure
> to look at existing implementations of p-value calculation. Given that
> in genetics p-values can become extremely small (e.g. 1e-100 or smaller)
> numerical problems abound.
> See also http://dx.doi.org/10.1016/j.csda.2008.11.028
> 
>> 99% of the regressions will yield a non significant p
>> value, that is to say p > 0.05 or even conservatively 0.1 . It is easy
>> to know apriori if the p-value is significant, by looking at the t-score
>> where it originates from. For any t-score < 1.28 the pvalue will not go
>> below 0.1 for a t-distribution or normal distribution. In this cases
>> (9X%) an aproximation with an error of 10^-(4~5) of the p-value is
> 
> Do you mean an absolute or relative error of 1e-4 here?
> 
>> enough. This calculation is efficient involving only a quadratic
>> polynomial to be approximated. For possible significant p-values, with
>> t-score > 1.28 a proper calculation of the p-value can be done.
> 
> I like the idea of setting a threshold for t, below which doing slow but
> accurate calculations make no sense. If that gives a considerable speed
> up, go for it!
> I would make the actual threshold a user-definable option, by the way.
> There may be use cases where people want to have accurate p-values even
> for non-significant cases.
> 
>> Note
>> that a p-value calculation involves approximating the integral of the
>> distribution used, either t-students (n<1000?) or normal distribution.
> 
> Be sure to check out existing implementations (e.g. in Boost) on how
> integration is being handled. Somewhere at the back of my mind the term
> 'incomplete Gamma function' comes to mind.
> 
>>
>> How are the different genabel packages handling this at the moment? This
>> is a speedup that can be applied to any p-value calculation.
> 
> The R-based packages will most likely use the functions provided by R,
> e.g. pchisq() for calculating the p-value of a chi^2 statistic.
> In ProbABEL we are now implementing the calculation of p-values, but the
> actual calculation is being left to the Boost libraries. Currently we
> calculate the p-values irrespective of the value of the statistic. I'll
> profile the code to see how much time this takes and whether it makes
> sense to only calculate p-values for statistics with a value above a
> certain threshold.
> 
>>
>> Plotted Error between 1-ncdf(x) and polynomial approx:
>>
>>
> http://www.wolframalpha.com/input/?i=y+%3D+%28%281%2F2+-+1%2F2*erf%28x%2Fsqrt%282%29%29%29+-+%281%2F2-%280.1*x*%284.4-x%29%29%29%29+from+0+to+1.28
>>
> 
> ah, that leads me to believe the answer to my question above is that you
> mean an absolute error of ~1e-4, correct?
> 
> 
> Best,
> 
> Lennart.
> 
>>
>> -Alvaro
>>
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>
> 
> --
> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
> L.C. Karssen
> Utrecht
> The Netherlands
> 
> lennart at karssen.org
> http://blog.karssen.org
> GPG key ID: A88F554A
> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
> 

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140623/f6786d40/attachment.sig>