[GenABEL-dev] P-values
Frank, Alvaro Jesus
alvaro.frank at rwth-aachen.de
Mon Jun 23 16:38:43 CEST 2014
Sorry for the p-value spam, but...
On Another note, it is important to note that a t-score is equivalent (in the 2 tailed case or a well defined head/tail) to any p-value since there is a 1-1 correspondence of which t-score produces which p-value after integrating the corresponding CDF.
Reporting p<0.05 is as meaningful/meaningless as reporting t-score > 1.64485 (Normal CDF 1 tail test) .
Instead of reporting the p<3524.1646*10^-300 one could just report t-score > 50 too.
This is a cultural issue of how the workflow has always been and it really needs revisiting.
As a side side note, perhaps since all that matters is if a line is present (correlation) what about proposing a less expensive method that looks for this relation in very loose terms? Since the result really doesn't matter beyond identifying a "signal" for further study, the whole GWAS could be revisited to still give 'meaningful' results by identifying slopes ina visual or similar manner. Just a thought.
A Frank
________________________________
From: genabel-devel-bounces at lists.r-forge.r-project.org [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of Frank, Alvaro Jesus [alvaro.frank at rwth-aachen.de]
Sent: Monday, June 23, 2014 4:02 PM
To: L.C. Karssen; genabel-devel at lists.r-forge.r-project.org
Subject: Re: [GenABEL-dev] P-values
That sounds cool. An (unnecessary?) word of caution though: I don't know
how much research you have done on this before starting, but make sure
to look at existing implementations of p-value calculation. Given that
in genetics p-values can become extremely small (e.g. 1e-100 or smaller)
numerical problems abound.
See also http://dx.doi.org/10.1016/j.csda.2008.11.028
Be sure to check out existing implementations (e.g. in Boost) on how
integration is being handled. Somewhere at the back of my mind the term
'incomplete Gamma function' comes to mind.
Such extreme values are directly related to the t-score, and it starts being a problem at around 20+.
For those cases I am set to use arbitrary precise functions as in: http://www.mpfr.org/mpfr-current/mpfr.html
Which leads me to the critic that if people only report p-values as their work, perhaps there is a cultural and scientific problem with the workflow and conceptual basis. But if they want their 3.5*10^-300 I will calculate it.
I would make the actual threshold a user-definable option, by the way.
There may be use cases where people want to have accurate p-values even
for non-significant cases.
I was thinking the same. From Local workflows here at Helmholtz I see that any insignificant value gets ignored completely unless it comes to doing Manhattan plots. So another setting would be to not calculate or store at all anything for non significant t-scores, which is helpfull during early discovery of significant signals. For Final calculations for plotting and reporting, the setting can then be changed to enable those. This effectivly removes the "Big data" problem, since 99% of the data is meaningless to determining whether the slope is significant.
In ProbABEL we are now implementing the calculation of p-values, but the
actual calculation is being left to the Boost libraries. Currently we
calculate the p-values irrespective of the value of the statistic. I'll
profile the code to see how much time this takes and whether it makes
sense to only calculate p-values for statistics with a value above a
certain threshold.
I was considering boost, if your profiling is good, perhaps boost is doing the threshold already. Let me know to see if I should switch over to boost too.
-A Frank
________________________________________
From: genabel-devel-bounces at lists.r-forge.r-project.org [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of L.C. Karssen [lennart at karssen.org]
Sent: Monday, June 23, 2014 3:51 PM
To: genabel-devel at lists.r-forge.r-project.org
Subject: Re: [GenABEL-dev] P-values
Hi Alvaro,
On 23-06-14 15:21, Frank, Alvaro Jesus wrote:
> Hi all,
>
> I have been trying to figure out an efficient way to calculate p-values,
> and it seems that I managed to come to an efficient compromise between
> speed an accuracy.
That sounds cool. An (unnecessary?) word of caution though: I don't know
how much research you have done on this before starting, but make sure
to look at existing implementations of p-value calculation. Given that
in genetics p-values can become extremely small (e.g. 1e-100 or smaller)
numerical problems abound.
See also http://dx.doi.org/10.1016/j.csda.2008.11.028
> 99% of the regressions will yield a non significant p
> value, that is to say p > 0.05 or even conservatively 0.1 . It is easy
> to know apriori if the p-value is significant, by looking at the t-score
> where it originates from. For any t-score < 1.28 the pvalue will not go
> below 0.1 for a t-distribution or normal distribution. In this cases
> (9X%) an aproximation with an error of 10^-(4~5) of the p-value is
Do you mean an absolute or relative error of 1e-4 here?
> enough. This calculation is efficient involving only a quadratic
> polynomial to be approximated. For possible significant p-values, with
> t-score > 1.28 a proper calculation of the p-value can be done.
I like the idea of setting a threshold for t, below which doing slow but
accurate calculations make no sense. If that gives a considerable speed
up, go for it!
I would make the actual threshold a user-definable option, by the way.
There may be use cases where people want to have accurate p-values even
for non-significant cases.
> Note
> that a p-value calculation involves approximating the integral of the
> distribution used, either t-students (n<1000?) or normal distribution.
Be sure to check out existing implementations (e.g. in Boost) on how
integration is being handled. Somewhere at the back of my mind the term
'incomplete Gamma function' comes to mind.
>
> How are the different genabel packages handling this at the moment? This
> is a speedup that can be applied to any p-value calculation.
The R-based packages will most likely use the functions provided by R,
e.g. pchisq() for calculating the p-value of a chi^2 statistic.
In ProbABEL we are now implementing the calculation of p-values, but the
actual calculation is being left to the Boost libraries. Currently we
calculate the p-values irrespective of the value of the statistic. I'll
profile the code to see how much time this takes and whether it makes
sense to only calculate p-values for statistics with a value above a
certain threshold.
>
> Plotted Error between 1-ncdf(x) and polynomial approx:
>
> http://www.wolframalpha.com/input/?i=y+%3D+%28%281%2F2+-+1%2F2*erf%28x%2Fsqrt%282%29%29%29+-+%281%2F2-%280.1*x*%284.4-x%29%29%29%29+from+0+to+1.28
>
ah, that leads me to believe the answer to my question above is that you
mean an absolute error of ~1e-4, correct?
Best,
Lennart.
>
> -Alvaro
>
>
> _______________________________________________
> genabel-devel mailing list
> genabel-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>
--
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands
lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140623/6c10f3b7/attachment.html>
More information about the genabel-devel
mailing list