[Genabel-commits] r1546 - in branches/ProbABEL-pvals/ProbABEL: . src

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Mon Jan 13 15:07:05 CET 2014


Author: lckarssen
Date: 2014-01-13 15:07:04 +0100 (Mon, 13 Jan 2014)
New Revision: 1546

Modified:
   branches/ProbABEL-pvals/ProbABEL/configure.ac
   branches/ProbABEL-pvals/ProbABEL/src/Makefile.am
   branches/ProbABEL-pvals/ProbABEL/src/main.cpp
   branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.cpp
   branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.h
Log:
First commit for adding calculation of p-values to ProbABEL.
This uses the statistical distributions from Boost::Math library (http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/dist.html). By default this is enabled. To disable this run: ./configure --without-boost. If the header file is not in the default include search path, use ./configure --with-boost-include-path "/some/path/to/boost_headers".

At the moment both the columns for the chi^2 values and the p-values are printed. We may want to disable printing the chi^2 values in the future, possibly enabling them with a command line switch.

The actual calculation is performed by the function pchisq() named like its R counterpart in the main_functions_dump.cpp file. In main.cpp an extra vector is created to contain the chi^2 values in double format (next to the ostringstream that was already there). This vector of doubles (one per model) is then used when calculating the p-values.

Compilation is succesfull, but the two checks that compare output against "known-good" output files fail (because I haven't updated those files yet). Compilation and checks when running ./configure --without-boost work as expected (this should be the old (v0.4.x) behaviour).

TODO: Like before (v0.4.x series) the chi^2 is not calculated when using mmscore and the 2DF model, because that requires the 2DF Wald test, which we haven't implemented yet (main.cpp lines 440--457.)


Modified: branches/ProbABEL-pvals/ProbABEL/configure.ac
===================================================================
--- branches/ProbABEL-pvals/ProbABEL/configure.ac	2014-01-12 20:10:41 UTC (rev 1545)
+++ branches/ProbABEL-pvals/ProbABEL/configure.ac	2014-01-13 14:07:04 UTC (rev 1546)
@@ -84,6 +84,39 @@
 AM_CONDITIONAL([WITH_EIGEN], test "x$with_eigen" != "xno")
 
 
+# See if we can use the Boost libraries for calculating p-values
+AC_ARG_WITH([boost],
+ AS_HELP_STRING([--with-boost], [Use the Boost Math library for
+ calculating p-values. This is enabled by default]
+               )
+)
+
+if test "x$with_boost" != "xno"; then
+    AC_MSG_NOTICE([building using the Boost Math library enabled])
+
+    AC_ARG_WITH([boost-include-path],
+        [AS_HELP_STRING([--with-boost-include-path],
+          [location of the Boost Math headers, defaults to /usr/include/boost])],
+        [CXXFLAGS+=" -I${withval}"
+        CPPFLAGS+=" -I${withval}"],
+        [CXXFLAGS+=' -I/usr/include/boost'
+        CPPFLAGS+=' -I/usr/include/boost'])
+
+    # Check for the Boost Math header files
+    AC_CHECK_HEADERS([boost/math/distributions.hpp])
+
+    if test x$ac_cv_header_boost_math_distributions_hpp = xno; then
+      AC_MSG_ERROR([Could not find the Boost Math header files. Did \
+you specify --with-boost-include-path correctly? Or use --without-boost \
+to disable the calculation of p-values.])
+    fi
+else
+   AC_MSG_NOTICE([not using the Boost libraries, so no p-values in the \
+output])
+fi
+AM_CONDITIONAL([WITH_BOOST], test "x$with_boost" != "xno")
+
+
 # Checks for typedefs, structures, and compiler characteristics.
 AC_HEADER_STDBOOL
 AC_C_INLINE

Modified: branches/ProbABEL-pvals/ProbABEL/src/Makefile.am
===================================================================
--- branches/ProbABEL-pvals/ProbABEL/src/Makefile.am	2014-01-12 20:10:41 UTC (rev 1545)
+++ branches/ProbABEL-pvals/ProbABEL/src/Makefile.am	2014-01-13 14:07:04 UTC (rev 1546)
@@ -1,13 +1,14 @@
 ## Process this file with automake to produce Makefile.in
 
-## Using wildcards in these lists doesn't work. Also GNU make's ($wildcard,) doesn't
-## work. It gives warning message about portability, but in the end doesn't work,
-## I tried :-).
-REGFILES = data.h data.cpp gendata.h gendata.cpp mematrix.h mematri1.h	\
- command_line_settings.h command_line_settings.cpp reg1.h usage.h		\
- usage.cpp main.cpp utilities.h utilities.cpp phedata.h phedata.cpp	\
- cholesky.h cholesky.cpp regdata.h regdata.cpp  \
- maskedmatrix.cpp maskedmatrix.h reg1.cpp main_functions_dump.h main_functions_dump.cpp
+## Using wildcards in these lists doesn't work. Also GNU make's
+## ($wildcard,) doesn't work. It gives warning message about
+## portability, but in the end doesn't work, I tried :-).
+REGFILES = data.h data.cpp gendata.h gendata.cpp mematrix.h		\
+ mematri1.h command_line_settings.h command_line_settings.cpp reg1.h	\
+ usage.h usage.cpp main.cpp utilities.h utilities.cpp phedata.h		\
+ phedata.cpp cholesky.h cholesky.cpp regdata.h regdata.cpp		\
+ maskedmatrix.cpp maskedmatrix.h reg1.cpp main_functions_dump.h		\
+ main_functions_dump.cpp
 
 EIGENFILES = eigen_mematrix.h eigen_mematrix.cpp
 
@@ -77,6 +78,12 @@
 pacoxph_CXXFLAGS += -DEIGEN
 endif
 
+if WITH_BOOST
+palinear_CXXFLAGS += -DWITH_BOOST
+palogist_CXXFLAGS += -DWITH_BOOST
+pacoxph_CXXFLAGS += -DWITH_BOOST
+endif
+
 extract_snp_SOURCES = extract-snp.cpp $(FVSRC) $(FVHEADERS)
 
 ## Install these scripts in the bin directory as well:

Modified: branches/ProbABEL-pvals/ProbABEL/src/main.cpp
===================================================================
--- branches/ProbABEL-pvals/ProbABEL/src/main.cpp	2014-01-12 20:10:41 UTC (rev 1545)
+++ branches/ProbABEL-pvals/ProbABEL/src/main.cpp	2014-01-13 14:07:04 UTC (rev 1546)
@@ -202,11 +202,16 @@
 
     int start_pos, end_pos;
 
+
+    // Output streams for the regression results we want to print in
+    // the outpute file(s).
     std::vector<std::ostringstream *> beta_sebeta;
-    // Han Chen
     std::vector<std::ostringstream *> covvalue;
-    // Oct 26, 2009
+    std::vector<double*> chi2val; // vector that contains the chi2 values
     std::vector<std::ostringstream *> chi2;
+#if WITH_BOOST
+    std::vector<std::ostringstream *> pval;
+#endif
 
     // Create string streams for betas, SEs, etc. These are used to
     // later store the various output values that will be written to
@@ -224,6 +229,11 @@
         chi2.push_back(new std::ostringstream());
         chi2[i]->precision(6);
         // *chi2[i] << scientific;
+        chi2val.push_back(new double);
+#if WITH_BOOST
+        pval.push_back(new std::ostringstream());
+        pval[i]->precision(6);
+#endif
     }
 
 
@@ -408,17 +418,21 @@
                                                  input_var.getNgpreds(),
                                                  true, 1, mli, csnp);
 #endif
-                            *chi2[model] << 2. * (loglik - new_null_rd.loglik);
+                            *chi2val[model] = 2. *
+                                (loglik - new_null_rd.loglik);
+                            *chi2[model] << *chi2val[model];
                         }
                         else
                         {
                             // No missing SNP data, we can compute the LRT
-                            *chi2[model] << 2. * (loglik - null_loglik);
+                            *chi2val[model] = 2. * (loglik - null_loglik);
+                            *chi2[model] << *chi2val[model];
                         }
                     } else
                     {
                         // We want score test output
-                        *chi2[model] << rd.chi2_score;
+                        *chi2val[model] = rd.chi2_score;
+                        *chi2[model] << *chi2val[model];
                     }
                 }  // END if( inv.sigma == NULL )
                 else if (input_var.getInverseFilename() != NULL)
@@ -432,12 +446,14 @@
                          * equation just below Eq.(4) in the ProbABEL
                          * paper. TODO LCK
                          */
+                        *chi2val[model] = NAN;
                         *chi2[model] << "nan";
                     }
                     else
                     {
                         double Z = rd.beta[start_pos] / rd.sebeta[start_pos];
-                        *chi2[model] << Z * Z;
+                        *chi2val[model] = Z * Z;
+                        *chi2[model] << *chi2val[model];
                     }
                 }
             }  // END first part of if(poly); allele not too rare
@@ -493,6 +509,7 @@
                     }
 #endif
                     // Oct 26, 2009
+                    *chi2val[model] = NAN;
                     *chi2[model] << "nan";
                 } else
                 { // ngpreds==1 (and SNP is rare)
@@ -508,9 +525,16 @@
 #endif
                         // Oct 26, 2009
                     }  // END if getInverseFilename == NULL
+                    *chi2val[model] = NAN;
                     *chi2[model] << "nan";
                 }  // END ngpreds == 1 (and SNP is rare)
             }  // END else: SNP is rare
+
+#if WITH_BOOST
+            // Calulate p-values based on the chi^2 values
+            int chi2df = 1;
+            *pval[model] << pchisq(*chi2val[model], chi2df);
+#endif
         }  // END of model cycle
 
 
@@ -528,8 +552,13 @@
                                     << input_var.getSep();
                 }
 #endif
-                *outfile[model] << chi2[model]->str()
-                                << "\n";
+                *outfile[model] << chi2[model]->str();
+
+#if WITH_BOOST
+                *outfile[model] << input_var.getSep() << pval[model]->str();
+#endif
+
+                *outfile[model] << endl;
             }  // END for loop over all models
         }
         else  // Dose data: only additive model. Only one output file
@@ -541,10 +570,15 @@
                 *outfile[0] << covvalue[0]->str() << input_var.getSep();
             }
 #endif
-            *outfile[0] << chi2[0]->str() << "\n";
+            *outfile[0] << chi2[0]->str();
+
+#if WITH_BOOST
+            *outfile[0] << input_var.getSep() << pval[0]->str();
+#endif
+
+            *outfile[0] << endl;
         }  // End ngpreds == 1 when writing output files
 
-
         // Clean chi2 and other streams
         for (int model = 0; model < maxmod; model++)
         {
@@ -553,6 +587,9 @@
             covvalue[model]->str("");
             // Oct 26, 2009
             chi2[model]->str("");
+#if WITH_BOOST
+            pval[model]->str("");
+#endif
         }
 
         update_progress_to_cmd_line(csnp, nsnps);
@@ -582,12 +619,14 @@
         delete *it;
         ++it;
     }
+
     it = covvalue.begin();
     while (it != covvalue.end())
     {
         delete *it;
         ++it;
     }
+
     it = chi2.begin();
     while (it != chi2.end())
     {
@@ -595,5 +634,14 @@
         ++it;
     }
 
+#if WITH_BOOST
+    it = pval.begin();
+    while (it != pval.end())
+    {
+        delete *it;
+        ++it;
+    }
+#endif
+
     return (0);
 }

Modified: branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.cpp
===================================================================
--- branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.cpp	2014-01-12 20:10:41 UTC (rev 1545)
+++ branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.cpp	2014-01-13 14:07:04 UTC (rev 1546)
@@ -15,6 +15,10 @@
 #include <iomanip>
 #include <vector>
 
+#if WITH_BOOST
+#include <boost/math/distributions.hpp>
+#endif
+
 #include "maskedmatrix.h"
 #include "phedata.h"
 #include "data.h"
@@ -55,6 +59,7 @@
     std::cout << std::setprecision(6);
 }
 
+
 /**
  * Open an output file for each model when using probability data
  * (npgreds == 2). This function creates the _2df.out.txt etc. files.
@@ -238,6 +243,9 @@
 #endif
         }
         *outfile[0] << input_var.getSep() << "chi2_SNP";
+#if WITH_BOOST
+        *outfile[0] << input_var.getSep() << "pval_SNP";
+#endif
         *outfile[0] << "\n";
     } // ngpreds == 1
     else if (input_var.getNgpreds() == 2) // prob data: all models
@@ -315,11 +323,25 @@
                 //Oct 26, 2009
             }
         }
-        *outfile[0] << input_var.getSep() << "chi2_SNP_2df\n";
-        *outfile[1] << input_var.getSep() << "chi2_SNP_A1\n";
-        *outfile[2] << input_var.getSep() << "chi2_SNP_domA1\n";
-        *outfile[3] << input_var.getSep() << "chi2_SNP_recA1\n";
-        *outfile[4] << input_var.getSep() << "chi2_SNP_odomA1\n";
+        *outfile[0] << input_var.getSep() << "chi2_SNP_2df";
+        *outfile[1] << input_var.getSep() << "chi2_SNP_A1";
+        *outfile[2] << input_var.getSep() << "chi2_SNP_domA1";
+        *outfile[3] << input_var.getSep() << "chi2_SNP_recA1";
+        *outfile[4] << input_var.getSep() << "chi2_SNP_odomA1";
+
+#ifdef WITH_BOOST
+        *outfile[0] << input_var.getSep() << "pval_SNP_2df";
+        *outfile[1] << input_var.getSep() << "pval_SNP_A1";
+        *outfile[2] << input_var.getSep() << "pval_SNP_domA1";
+        *outfile[3] << input_var.getSep() << "pval_SNP_recA1";
+        *outfile[4] << input_var.getSep() << "pval_SNP_odomA1";
+#endif
+
+        *outfile[0] << endl;
+        *outfile[1] << endl;
+        *outfile[2] << endl;
+        *outfile[3] << endl;
+        *outfile[4] << endl;
     } // End: ngpreds == 2
     else
     {
@@ -426,3 +448,34 @@
 
     return start_pos;
 }
+
+
+#ifdef WITH_BOOST
+/**
+ * Calculate the p-value based on the chi^2 distribution.
+ *
+ * \param chi2 chi^2 value for which the p-value should be
+ * calculated.
+ * \param df degrees of freedom of the chi^2 distribution
+ */
+double pchisq(const double chi2, const int df)
+{
+    double pval;
+
+    if (!std::isnan(chi2))
+    {
+        // Initialise the distribution
+        boost::math::chi_squared chi2dist(df);
+
+        /* Use the complement here (in R we would also set
+         * lower.tail=FALSE)
+         */
+        pval = boost::math::cdf(complement(chi2dist, chi2));
+    }
+    else
+    {
+        pval = NAN;
+    }
+        return pval;
+}
+#endif

Modified: branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.h
===================================================================
--- branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.h	2014-01-12 20:10:41 UTC (rev 1545)
+++ branches/ProbABEL-pvals/ProbABEL/src/main_functions_dump.h	2014-01-13 14:07:04 UTC (rev 1546)
@@ -35,4 +35,6 @@
 
 int get_start_position(const cmdvars& input_var, const int model,
         const int number_of_rows_or_columns);
+
+double pchisq(const double chi2, const int df);
 #endif /* MAIN_FUNCTIONS_DUMP_H_ */



More information about the Genabel-commits mailing list