<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style id="owaParaStyle" type="text/css">P {margin-top:0;margin-bottom:0;}</style>

</head>

<body ocsi="0" fpstyle="1">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">

<blockquote>That sounds cool. An (unnecessary?) word of caution though: I don't know<br>

how much research you have done on this before starting, but make sure<br>

to look at existing implementations of p-value calculation. Given that<br>

in genetics p-values can become extremely small (e.g. 1e-100 or smaller)<br>

numerical problems abound.<br>

See also http://dx.doi.org/10.1016/j.csda.2008.11.028<br>

<br>

Be sure to check out existing implementations (e.g. in Boost) on how<br>

integration is being handled. Somewhere at the back of my mind the term<br>

'incomplete Gamma function' comes to mind.<br>

<br>

</blockquote>

Such extreme values are directly related to the t-score, and it starts being a problem at around 20+.<br>

For those cases I am set to use arbitrary precise functions as in: <font color="black" face="Tahoma" size="2">

<span style="font-size:10pt;" dir="ltr">http://www.mpfr.org/mpfr-current/mpfr.html</span></font><br>

<br>

Which leads me to the critic that if people only report p-values as their work, perhaps there is a cultural and scientific problem with the workflow and conceptual basis. But if they want their 3.5*10^-300 I will calculate it.<br>

<br>

<blockquote>I would make the actual threshold a user-definable option, by the way.<br>

There may be use cases where people want to have accurate p-values even<br>

for non-significant cases.<br>

</blockquote>

I was thinking the same. From Local workflows here at Helmholtz I see that any insignificant value gets ignored completely unless it comes to doing Manhattan plots. So another setting would be to not calculate or store at all anything for non significant t-scores,

 which is helpfull during early discovery of significant signals. For Final calculations for plotting and reporting, the setting can then be changed to enable those. This effectivly removes the "Big data" problem, since 99% of the data is meaningless to determining

 whether the slope is significant.<br>

<br>

<blockquote>In ProbABEL we are now implementing the calculation of p-values, but the<br>

actual calculation is being left to the Boost libraries. Currently we<br>

calculate the p-values irrespective of the value of the statistic. I'll<br>

profile the code to see how much time this takes and whether it makes<br>

sense to only calculate p-values for statistics with a value above a<br>

certain threshold.<br>

<br>

</blockquote>

I was considering boost, if your profiling is good, perhaps boost is doing the threshold already. Let me know to see if I should switch over to boost too.<br>

<br>

-A Frank<br>

________________________________________<br>

From: genabel-devel-bounces@lists.r-forge.r-project.org [genabel-devel-bounces@lists.r-forge.r-project.org] on behalf of L.C. Karssen [lennart@karssen.org]<br>

Sent: Monday, June 23, 2014 3:51 PM<br>

To: genabel-devel@lists.r-forge.r-project.org<br>

Subject: Re: [GenABEL-dev] P-values<br>

<br>

Hi Alvaro,<br>

<br>

On 23-06-14 15:21, Frank, Alvaro Jesus wrote:<br>

> Hi all,<br>

><br>

> I have been trying to figure out an efficient way to calculate p-values,<br>

> and it seems that I managed to come to an efficient compromise between<br>

> speed an accuracy.<br>

<br>

That sounds cool. An (unnecessary?) word of caution though: I don't know<br>

how much research you have done on this before starting, but make sure<br>

to look at existing implementations of p-value calculation. Given that<br>

in genetics p-values can become extremely small (e.g. 1e-100 or smaller)<br>

numerical problems abound.<br>

See also http://dx.doi.org/10.1016/j.csda.2008.11.028<br>

<br>

> 99% of the regressions will yield a non significant p<br>

> value, that is to say p > 0.05 or even conservatively 0.1 . It is easy<br>

> to know apriori if the p-value is significant, by looking at the t-score<br>

> where it originates from. For any t-score < 1.28 the pvalue will not go<br>

> below 0.1 for a t-distribution or normal distribution. In this cases<br>

> (9X%) an aproximation with an error of 10^-(4~5) of the p-value is<br>

<br>

Do you mean an absolute or relative error of 1e-4 here?<br>

<br>

> enough. This calculation is efficient involving only a quadratic<br>

> polynomial to be approximated. For possible significant p-values, with<br>

> t-score > 1.28 a proper calculation of the p-value can be done.<br>

<br>

I like the idea of setting a threshold for t, below which doing slow but<br>

accurate calculations make no sense. If that gives a considerable speed<br>

up, go for it!<br>

I would make the actual threshold a user-definable option, by the way.<br>

There may be use cases where people want to have accurate p-values even<br>

for non-significant cases.<br>

<br>

> Note<br>

> that a p-value calculation involves approximating the integral of the<br>

> distribution used, either t-students (n<1000?) or normal distribution.<br>

<br>

Be sure to check out existing implementations (e.g. in Boost) on how<br>

integration is being handled. Somewhere at the back of my mind the term<br>

'incomplete Gamma function' comes to mind.<br>

<br>

><br>

> How are the different genabel packages handling this at the moment? This<br>

> is a speedup that can be applied to any p-value calculation.<br>

<br>

The R-based packages will most likely use the functions provided by R,<br>

e.g. pchisq() for calculating the p-value of a chi^2 statistic.<br>

In ProbABEL we are now implementing the calculation of p-values, but the<br>

actual calculation is being left to the Boost libraries. Currently we<br>

calculate the p-values irrespective of the value of the statistic. I'll<br>

profile the code to see how much time this takes and whether it makes<br>

sense to only calculate p-values for statistics with a value above a<br>

certain threshold.<br>

<br>

><br>

> Plotted Error between 1-ncdf(x) and polynomial approx:<br>

><br>

> http://www.wolframalpha.com/input/?i=y+%3D+%28%281%2F2+-+1%2F2*erf%28x%2Fsqrt%282%29%29%29+-+%281%2F2-%280.1*x*%284.4-x%29%29%29%29+from+0+to+1.28<br>

><br>

<br>

ah, that leads me to believe the answer to my question above is that you<br>

mean an absolute error of ~1e-4, correct?<br>

<br>

<br>

Best,<br>

<br>

Lennart.<br>

<br>

><br>

> -Alvaro<br>

><br>

><br>

> _______________________________________________<br>

> genabel-devel mailing list<br>

> genabel-devel@lists.r-forge.r-project.org<br>

> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel<br>

><br>

<br>

--<br>

*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*<br>

L.C. Karssen<br>

Utrecht<br>

The Netherlands<br>

<br>

lennart@karssen.org<br>

http://blog.karssen.org<br>

GPG key ID: A88F554A<br>

-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-<br>

<br>

</div>

</body>

</html>