[GenABEL-dev] P-values

Mon Jun 23 17:38:28 CEST 2014

Hi Alvaro,

On 23-06-14 16:38, Frank, Alvaro Jesus wrote:
> Sorry for the p-value spam, but...

:-).

> 
> On Another note, it is important to note that a t-score is equivalent
> (in the 2 tailed case or a well defined head/tail) to any p-value since
> there is a 1-1 correspondence of which t-score produces which p-value
> after integrating the corresponding CDF.

I completely agree. However (and maybe superfluous), note that p-values
can also be calculated for statistics other than the t-statistic. In
that sense I can appreciate the fact that the p-value can be used to
'define' significance. Instead of remembering significance thresholds
for the t-distribution, chi^2 distribution, Z-distribution,
F-distribution and others, I can simply remember the p-value threshold.

That said, you are not the only one concerned about the simple reporting
of p-values or the taking of 0.05 as a strict threshold for
significance. See for example this recent feature in Nature:
http://dx.doi.org/10.1038/506150a

Another very good read is the paper "Why most published research
findings are false" by epidemiologist John Ioannidis, which puts the
p-value debate in a broader context by looking at reproducibility.

Back in 1998 Kenneth Rothman (a renowned epidemiologist) even
discouraged the use of p-values from the journal "Epidemiology", of
which he was the editor, see
http://journals.lww.com/epidem/Citation/1998/01000/That_Confounded_P_Value_.4.aspx.
It looks like he didn't succeed...

> Reporting p<0.05 is as meaningful/meaningless as reporting t-score >
> 1.64485 (Normal CDF 1 tail test) .
> Instead of reporting the p<3524.1646*10^-300 one could just report
> t-score > 50 too.

Yup. And numerically that would be more easy to do as well.

> 
> This is a cultural issue of how the workflow has always been and it
> really needs revisiting.
> As a side side note, perhaps since all that matters is if a line is
> present (correlation) what about proposing a less expensive method that
> looks for this relation in very loose terms? Since the result really
> doesn't matter beyond identifying a "signal" for further study, the
> whole GWAS could be revisited to still give 'meaningful' results by
> identifying slopes ina visual or similar manner. Just a thought.

Not only that, but remember that a p-value are the result of some form
of hypothesis testing. So it's not only about whether a certain
coefficient is different from zero (the usual hypothesis), but also
whether the alternative hypothesis makes sense in the first place.

While we're at it :-), here are few other things that should be changed
in the culture of the field:
- coefficients have physical units: these should be reported! So instead
of writing "beta is 3" one should add units (e.g. km/h, mmol/l per copy
of the A allele, kg/year, whatever).
- number of significant digits that are reported (and doing so
consistently throughout the paper).
- version numbers of software used for a given analysis should be reported.

A lot of these issues are addressed every once in a while, and until the
change in culture has taken place, the least thing you can do is keep
arguing the case(s) each time they come up (or when giving lectures, etc.).

Best,

Lennart.

> 
> A Frank
> 
> 
> 
> 
> ------------------------------------------------------------------------
> *From:* genabel-devel-bounces at lists.r-forge.r-project.org
> [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of Frank,
> Alvaro Jesus [alvaro.frank at rwth-aachen.de]
> *Sent:* Monday, June 23, 2014 4:02 PM
> *To:* L.C. Karssen; genabel-devel at lists.r-forge.r-project.org
> *Subject:* Re: [GenABEL-dev] P-values
> 
>     That sounds cool. An (unnecessary?) word of caution though: I don't know
>     how much research you have done on this before starting, but make sure
>     to look at existing implementations of p-value calculation. Given that
>     in genetics p-values can become extremely small (e.g. 1e-100 or smaller)
>     numerical problems abound.
>     See also http://dx.doi.org/10.1016/j.csda.2008.11.028
> 
>     Be sure to check out existing implementations (e.g. in Boost) on how
>     integration is being handled. Somewhere at the back of my mind the term
>     'incomplete Gamma function' comes to mind.
> 
> Such extreme values are directly related to the t-score, and it starts
> being a problem at around 20+.
> For those cases I am set to use arbitrary precise functions as in:
> http://www.mpfr.org/mpfr-current/mpfr.html
> 
> Which leads me to the critic that if people only report p-values as
> their work, perhaps there is a cultural and scientific problem with the
> workflow and conceptual basis. But if they want their 3.5*10^-300 I will
> calculate it.
> 
>     I would make the actual threshold a user-definable option, by the way.
>     There may be use cases where people want to have accurate p-values even
>     for non-significant cases.
> 
> I was thinking the same. From Local workflows here at Helmholtz I see
> that any insignificant value gets ignored completely unless it comes to
> doing Manhattan plots. So another setting would be to not calculate or
> store at all anything for non significant t-scores, which is helpfull
> during early discovery of significant signals. For Final calculations
> for plotting and reporting, the setting can then be changed to enable
> those. This effectivly removes the "Big data" problem, since 99% of the
> data is meaningless to determining whether the slope is significant.
> 
>     In ProbABEL we are now implementing the calculation of p-values, but the
>     actual calculation is being left to the Boost libraries. Currently we
>     calculate the p-values irrespective of the value of the statistic. I'll
>     profile the code to see how much time this takes and whether it makes
>     sense to only calculate p-values for statistics with a value above a
>     certain threshold.
> 
> I was considering boost, if your profiling is good, perhaps boost is
> doing the threshold already. Let me know to see if I should switch over
> to boost too.
> 
> -A Frank
> ________________________________________
> From: genabel-devel-bounces at lists.r-forge.r-project.org
> [genabel-devel-bounces at lists.r-forge.r-project.org] on behalf of L.C.
> Karssen [lennart at karssen.org]
> Sent: Monday, June 23, 2014 3:51 PM
> To: genabel-devel at lists.r-forge.r-project.org
> Subject: Re: [GenABEL-dev] P-values
> 
> Hi Alvaro,
> 
> On 23-06-14 15:21, Frank, Alvaro Jesus wrote:
>> Hi all,
>>
>> I have been trying to figure out an efficient way to calculate p-values,
>> and it seems that I managed to come to an efficient compromise between
>> speed an accuracy.
> 
> That sounds cool. An (unnecessary?) word of caution though: I don't know
> how much research you have done on this before starting, but make sure
> to look at existing implementations of p-value calculation. Given that
> in genetics p-values can become extremely small (e.g. 1e-100 or smaller)
> numerical problems abound.
> See also http://dx.doi.org/10.1016/j.csda.2008.11.028
> 
>> 99% of the regressions will yield a non significant p
>> value, that is to say p > 0.05 or even conservatively 0.1 . It is easy
>> to know apriori if the p-value is significant, by looking at the t-score
>> where it originates from. For any t-score < 1.28 the pvalue will not go
>> below 0.1 for a t-distribution or normal distribution. In this cases
>> (9X%) an aproximation with an error of 10^-(4~5) of the p-value is
> 
> Do you mean an absolute or relative error of 1e-4 here?
> 
>> enough. This calculation is efficient involving only a quadratic
>> polynomial to be approximated. For possible significant p-values, with
>> t-score > 1.28 a proper calculation of the p-value can be done.
> 
> I like the idea of setting a threshold for t, below which doing slow but
> accurate calculations make no sense. If that gives a considerable speed
> up, go for it!
> I would make the actual threshold a user-definable option, by the way.
> There may be use cases where people want to have accurate p-values even
> for non-significant cases.
> 
>> Note
>> that a p-value calculation involves approximating the integral of the
>> distribution used, either t-students (n<1000?) or normal distribution.
> 
> Be sure to check out existing implementations (e.g. in Boost) on how
> integration is being handled. Somewhere at the back of my mind the term
> 'incomplete Gamma function' comes to mind.
> 
>>
>> How are the different genabel packages handling this at the moment? This
>> is a speedup that can be applied to any p-value calculation.
> 
> The R-based packages will most likely use the functions provided by R,
> e.g. pchisq() for calculating the p-value of a chi^2 statistic.
> In ProbABEL we are now implementing the calculation of p-values, but the
> actual calculation is being left to the Boost libraries. Currently we
> calculate the p-values irrespective of the value of the statistic. I'll
> profile the code to see how much time this takes and whether it makes
> sense to only calculate p-values for statistics with a value above a
> certain threshold.
> 
>>
>> Plotted Error between 1-ncdf(x) and polynomial approx:
>>
>>
> http://www.wolframalpha.com/input/?i=y+%3D+%28%281%2F2+-+1%2F2*erf%28x%2Fsqrt%282%29%29%29+-+%281%2F2-%280.1*x*%284.4-x%29%29%29%29+from+0+to+1.28
>>
> 
> ah, that leads me to believe the answer to my question above is that you
> mean an absolute error of ~1e-4, correct?
> 
> 
> Best,
> 
> Lennart.
> 
>>
>> -Alvaro
>>
>>
>> _______________________________________________
>> genabel-devel mailing list
>> genabel-devel at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-devel
>>
> 
> --
> *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
> L.C. Karssen
> Utrecht
> The Netherlands
> 
> lennart at karssen.org
> http://blog.karssen.org
> GPG key ID: A88F554A
> -*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-
> 

-- 
*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*
L.C. Karssen
Utrecht
The Netherlands

lennart at karssen.org
http://blog.karssen.org
GPG key ID: A88F554A
-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20140623/7a9ef3eb/attachment-0001.sig>