[Genabel-commits] r762 - in pkg/ProbABEL: doc src
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Tue Aug 23 21:48:36 CEST 2011
Author: yurii
Date: 2011-08-23 21:48:36 +0200 (Tue, 23 Aug 2011)
New Revision: 762
Modified:
pkg/ProbABEL/doc/CHANGES.LOG
pkg/ProbABEL/doc/ProbABEL_manual.tex
pkg/ProbABEL/src/main.cpp
pkg/ProbABEL/src/mematri1.h
pkg/ProbABEL/src/reg1.h
pkg/ProbABEL/src/usage.h
Log:
update manual; changes.log; add '--mmscore' option for binary traits
Modified: pkg/ProbABEL/doc/CHANGES.LOG
===================================================================
--- pkg/ProbABEL/doc/CHANGES.LOG 2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/doc/CHANGES.LOG 2011-08-23 19:48:36 UTC (rev 762)
@@ -1,3 +1,54 @@
+***** v.0.2-0 (2011.08.23)
+
+* ProbABEL can now (experimentally!) analyze binary traits accounting for relationship
+structure (thanks to Yurii and Nicola Pirastu). This adds '--mmscore'
+option for logistic regression. Important: on the contrast to the
+'palinear' the matrix which should be supplied to palogist with
+--mmscore should contain an inverse of the correlation matrix (not
+the inverse of var-cov matrix, as is the case for 'palinear' with
+--mmscore). This matrix can be obtained through (GenABEL notation):
+
+h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
+
+where h2.object is the object returned by 'polygenic()'. Documentation
+explains this procedure in more details; a wrapper function to prepare
+and export correct objects for ProbABEL is planned. (thanks to Yurii)
+
+- in probabel.pl the location for the probabel config file is now set
+ to /etc/, the default location it will end up. Before it still
+ pointed to the location the probabel.pl file was in. However, a
+ better fix would be to somehow put in the actual value of the
+ --sysconfdir option to ./configure.
+- The PDF of the LaTeX documentation is now only generated if the
+ pdflatex binary can be found. So now we also build the documentation
+ using autotools.
+
+
+* ProbABEL uses autoconf and automake now (thanks to Lennart). Now the
+compilation process is like this: first, from the ProbABEL directory,
+generate documentation by
+
+cd doc
+make
+make clean_doc
+cd ..
+
+Then, run
+
+./configure
+make
+make install
+
+To see options, run
+
+./configure --help
+
+In the ProbABEL autotools integration branch:
+- Removed the old Makefile
+- Added configure.ac that servers as input for auto(re)conf
+- Added the Makefile.am files needed by automake to generate the final Makefiles.
+
+
***** v.0.1-9e (2011.05.15)
Previous version wouldn't compile because of missing frerror. file.
Fixes bug #1339 (the bug fix was done a month ago, but no package had
Modified: pkg/ProbABEL/doc/ProbABEL_manual.tex
===================================================================
--- pkg/ProbABEL/doc/ProbABEL_manual.tex 2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/doc/ProbABEL_manual.tex 2011-08-23 19:48:36 UTC (rev 762)
@@ -1,6 +1,9 @@
\title{ProbABEL manual}
-\author{Yurii Aulchenko, Maksim Struchalin, Lennart Karssen \\
- Erasmus MC Rotterdam
+\author{
+Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\
+\\
+$^{1}$Erasmus MC Rotterdam \\
+$^{2}$Institute of Cytology and Genetics SD RAS
}
\date{\today}
@@ -28,6 +31,10 @@
\makeindex
+\newcommand{\PA}{\texttt{ProbABEL-package}}
+\newcommand{\GA}{\texttt{GenABEL-package}}
+\newcommand{\DA}{\texttt{DatABEL-package}}
+
\begin{document}
\maketitle
\tableofcontents
@@ -76,16 +83,16 @@
performing association analysis by means of regression of the
outcome of interest onto estimated genotypic probabilities.
-The \texttt{ProbABEL} package was designed to perform such regression
+The \PA{} package was designed to perform such regression
in a fast, memory-efficient and consequently genome-wide feasible manner.
-Currently, \texttt{ProbABEL} implements linear, logistic regression,
+Currently, \PA{} implements linear, logistic regression,
and Cox proportional hazards models. The corresponding analysis
programs are called \texttt{palinear}, \texttt{palogist},
and \texttt{pacoxph}.
\section{Input files}
-\texttt{ProbABEL} takes three files as input: a file containing SNP
+\PA{} takes three files as input: a file containing SNP
information (e.g.~the MLINFO file of MACH), a file with genome- or
chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
and a file containing the phenotype of interest and covariates.
@@ -93,6 +100,13 @@
Optionally, the map information can be supplied (e.g.~the "legend"
files of HapMap).
+The dose/probability file may be supplied in filevector format
+in which case \PA{} will operate much faster, and
+in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
+\DA{} on how to convert MACH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
+
\subsection{SNP information file}
\label{ssec:infoin}
In the simplest scenario, the SNP information file is an MLINFO
@@ -103,7 +117,7 @@
``Rsq'', the proportion of variance decrease after imputations).
Actually,
-for \texttt{ProbABEL}, it does not matter what is written in this file --
+for \PA{}, it does not matter what is written in this file --
this information is just brought forward to the output. However,
\textbf{it is critical} that the number of columns is seven and the number
of lines in the file is equal to the number of SNPs in the
@@ -239,7 +253,7 @@
the program).
There are in total 11 command line options you can specify to the
-\texttt{ProbABEL} analysis functions \texttt{palinear} or
+\PA{} analysis functions \texttt{palinear} or
\texttt{palogist}. If you run either program without any argument, you
will get a short explanation to command line options:
\begin{verbatim}
@@ -350,25 +364,25 @@
first column contains id names, the second the trait. The others are
covariates.
-An example of how a polygenic object estimated by GenABEL can be used
+An example of how a polygenic object estimated by \GA{} can be used
with ProbABEL is provided in \texttt{../example/mmscore.R}
Though technically \texttt{--mmscore} allows for inclusion of multiple
covariates, these should be kept to minimum as this is a score test. We
suggest that any covariates explaining an essential proportion of
-variance should be fit as part of \texttt{GenABEL}'s
+variance should be fit as part of \GA{}'s
\texttt{polygenic} procedure.
%Option \texttt{--interaction\_only} is like \texttt{--interaction} but does not
%include in the model the main effect of the covariate, which is acting in
%interaction with SNP. This option is useful when running \texttt{--mmscore},
%in whch case the main effect should normally estimated in the polygenic
-%model and only the interaction term in the \texttt{ProbABEL} analysis.
+%model and only the interaction term in the \PA{} analysis.
\subsection{Running multiple analyses at once: \texttt{probabel.pl}}
The Perl script \texttt{bin/probabel.pl\_example} represents a handy
-wraper for \texttt{ProbABEL} functions. To start using it the
+wraper for \PA{} functions. To start using it the
configuration file \texttt{bin/probabel\_config.cfg\_example} needs to
be edited. The configuration file consists of five columns. Each
column except the first is a pattern for files produced by
@@ -383,7 +397,7 @@
for the ``mlinfo'' files. Probably you also have to change the variable
\texttt{\$config} in the script to point to the full path of the
configuration file and the variable \texttt{@anprog} to point full
-path to the \texttt{ProbABEL} scripts.
+path to the \PA{} scripts.
\section{Output file format}
@@ -412,7 +426,7 @@
At this point, unless you have complete measurements on all
subjects, you should feel alarmed if the number here is exactly the
number of people in the file -- this probably indicates you did not code
-missing values according to \texttt{ProbABEL} format ('NA', 'NaN', or 'N').
+missing values according to \PA{} format ('NA', 'NaN', or 'N').
The next column, nine (``Mean\_predictor\_allele''), gives the estimated
frequency of the predictor allele in subjects with complete phenotypic data.
@@ -433,14 +447,24 @@
\section{Memory use and performance}
Maximum likelihood regression is implemented in
-\texttt{ProbABEL}. With 6,000 people and 2.5 millions SNPs, a
+\PA{}. With 6,000 people and 2.5 millions SNPs, a
genome-wide scan is completed in less that an hour for a linear model
with 1-2 covariates and overnight for logistic regression or the Cox
-proportional hazards model.
+proportional hazards model (figures for a PC bought back in 2007).
-Memory is an issue with \texttt{ProbABEL} -- large chromosomes,
+Memory may be an issue with \PA{} if you use
+MACH text dose/probability files, e.g. for large chromosomes,
such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
+We suggest that dose/probability file is to be supplied in filevector format
+in which case \PA{} will operate about 2-3 times faster, and
+in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
+\DA{} on how to convert MACH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
+
+When '--mmscore' option is used, the analysis may take quite some time.
+
\section{Methodology}
\label{sec:methodology}
\subsection{Analysis of population-based data}
@@ -598,7 +622,7 @@
\index{Cox proportional hazards model}
\index{proportional hazards model}
\index{regression!Cox proportional hazards}
-in \texttt{ProbABEL} is entirely based on the code of \texttt{R}
+in \PA{} is entirely based on the code of \texttt{R}
library \texttt{survival} developed by Thomas Lumley
(function \texttt{coxfit2}), and is therefore not described here.
@@ -742,14 +766,14 @@
\section{How to cite}
-If you used \texttt{ProbABEL} for
-your analysis please give a link to the \texttt{ABEL} home
+If you used \PA{} for
+your analysis please give a link to the \texttt{GenABEL project} home
page
\begin{quote}
\url{http://www.genabel.org/}
\end{quote}
-and cite the \texttt{ProbABEL} paper to give us some credit:
+and cite the \PA{} paper to give us some credit:
\begin{quote}
Aulchenko YS, Struchalin MV, van Duijn CM.\\
\emph{ProbABEL package for genome-wide association analysis of imputed data.}\\
@@ -757,15 +781,15 @@
\end{quote}
A proper reference may look like
\begin{quote}
-For the analysis of imputed data, we used the \texttt{ProbABEL} package
-from the GenABEL suite of programs (Aulchenko \emph{et al.}, 2010).
+For the analysis of imputed data, we used the \PA{}
+from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
\end{quote}
If you have used the Cox proportional hazard model, please mention the
\texttt{R} package \texttt{survival} by Thomas Lumley. Additionally
to the above citation, please tell that
\begin{quote}
-The Cox proportional hazards model implemented in \texttt{ProbABEL}
+The Cox proportional hazards model implemented in \PA{}
makes use of the source code of the \texttt{R} package ''\texttt{survival}''
as implemented by T. Lumley.
\end{quote}
Modified: pkg/ProbABEL/src/main.cpp
===================================================================
--- pkg/ProbABEL/src/main.cpp 2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/main.cpp 2011-08-23 19:48:36 UTC (rev 762)
@@ -303,9 +303,12 @@
mematrix<double> invvarmatrix;
+/*
+ * now should be possible... delete this part later when everything works
#if LOGISTIC
if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for logistic regression\n";exit(1);}
#endif
+*/
#if COXPH
if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for cox regression\n";exit(1);}
Modified: pkg/ProbABEL/src/mematri1.h
===================================================================
--- pkg/ProbABEL/src/mematri1.h 2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/mematri1.h 2011-08-23 19:48:36 UTC (rev 762)
@@ -368,7 +368,7 @@
{
if (M.ncol != M.nrow)
{
- fprintf(stderr,"invert: only suare matrices possible\n");
+ fprintf(stderr,"invert: only square matrices possible\n");
exit(1);
}
if (M.ncol == 1)
@@ -376,12 +376,19 @@
mematrix<DT> temp(1,1);
temp[0] = 1./M[0];
}
+ /*
for (int i=0;i<M.ncol;i++)
if (M.data[i*M.ncol+i]==0)
{
fprintf(stderr,"invert: zero elements in diagonal\n");
- exit(1);
+ mematrix<DT> temp = M;
+ for (int i = 0; i < M.ncol; i++)
+ for (int j = 0; j < M.ncol; j++)
+ temp.put(NAN,i,j);
+ return temp;
+ //exit(1);
}
+ */
int actualsize = M.ncol;
int maxsize = M.ncol;
mematrix<DT> temp = M;
Modified: pkg/ProbABEL/src/reg1.h
===================================================================
--- pkg/ProbABEL/src/reg1.h 2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/reg1.h 2011-08-23 19:48:36 UTC (rev 762)
@@ -626,7 +626,7 @@
//Oct 26, 2009
residuals.reinit((rdata.X).nrow,1);
sigma2=-1.;
- loglik=-9.999e+32;
+ loglik=-9.999e+32; // should actually be MAX of the corresponding type
niter=-1;
chi2_score=-1.;
}
@@ -635,10 +635,36 @@
// delete beta;
// delete sebeta;
}
- void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int robust, int nullmodel=0)
+
+ void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol,
+ int model, int interaction, int ngpreds, mematrix<double> & invvarmatrixin, int robust,
+ int nullmodel=0)
{
+ // on the contrast to the 'linear' 'invvarmatrix' contains
+ // inverse of correlation matrix (not the inverse of var-cov matrix)
+ // h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
+ // the inverse of var-cov matrix scaled by total variance
regdata rdata = rdatain.get_unmasked_data();
+ // a lot of code duplicated between linear and logistic...
+ // e.g. a piece below...
+ mematrix<double> invvarmatrix;
+ if(invvarmatrixin.nrow!=0 && invvarmatrixin.ncol!=0)
+ {
+ invvarmatrix.reinit(rdata.nids,rdata.nids);
+ int i1=0, j1=0;
+ for (int i=0;i<rdata.nids;i++)
+ if (rdatain.masked_data[i]==0) {
+ j1 = 0;
+ for (int j=0;j<rdata.nids;j++) if (rdatain.masked_data[j]==0) {
+ invvarmatrix.put(invvarmatrixin.get(i,j),i1,j1);
+ j1++;
+ }
+ i1++;
+ }
+
+ }
+
mematrix<double> X = apply_model(rdata.X,model, interaction, ngpreds, false, nullmodel);
int length_beta = X.ncol;
beta.reinit(length_beta,1);
@@ -668,10 +694,20 @@
mematrix<double> tX = transpose(X);
- // if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0) tX = X*invvarmatrix;
+ if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0)
+ {
+ tX = tX*invvarmatrix;
+ }
+ /*
+ fprintf(stdout,"\n");
+ fprintf(stdout,"X %f %f %f\n",X.get(0,0),X.get(0,1),X.get(0,2));
+ if (X.ncol==4) fprintf(stdout,"X[4] %f\n",X.get(0,3));
+ fprintf(stdout,"Inv %f %f %f\n",invvarmatrix.get(0,0),invvarmatrix.get(0,1),invvarmatrix.get(0,2));
+ if (X.ncol==4) fprintf(stdout,"X[4] %f\n",invvarmatrix.get(0,3));
+ fprintf(stdout,"tXInv %f %f %f\n",tX.get(0,0),tX.get(1,0),tX.get(2,0));
+ if (X.ncol==4) fprintf(stdout,"X[4] %f\n",tX.get(3,0));
+ */
-
-
double N;
niter=0;
@@ -700,21 +736,29 @@
N = tXWX.get(0,0);
if (verbose) {printf("tXWX:\n");tXWX.print();}
+ //printf("tXWX:\n");tXWX.print();
//
// use cholesky to invert
//
tXWX_i = tXWX;
- cholesky2_mm(tXWX_i,tol_chol);
- if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
- chinv2_mm(tXWX_i);
+ //cholesky2_mm(tXWX_i,tol_chol);
+ //if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
+ //printf("chole tXWX:\n");tXWX_i.print();
+ //chinv2_mm(tXWX_i);
// was before
- // tXWX_i=invert(tXWX);
+ tXWX_i=invert(tXWX);
if (verbose) {printf("tXWX-1:\n");tXWX_i.print();}
+ //fprintf(stdout,"*** tXWX_i\n");tXWX_i.print();
mematrix<double> tmp1 = productMatrDiag(tX,W);
mematrix<double> tXWz = tmp1*z;
if (verbose) {printf("tXWz:\n");tXWz.print();}
beta = tXWX_i*tXWz;
+ //fprintf(stdout,"*** res: %f %f %f\n",residuals[0],residuals[1],residuals[2]);
+ //mematrix<double> txres = tx * residuals;
+ //fprintf(stdout,"*** txres\n");txres.print();
+ //beta = txwx_i* txres;
if (verbose) {printf("beta:\n");beta.print();}
+ //printf("beta:\n");beta.print();
// compute likelihood
prevlik = loglik;
loglik=0.;
@@ -780,6 +824,9 @@
}
}
if (verbose) {printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();}
+ //printf("sebeta (%d):\n",beta.nrow);beta.print();
+ //printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();
+ //exit(1);
}
// just a stupid copy from linear_reg
void score(mematrix<double> &resid,regdata &rdata,int verbose, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int nullmodel=0)
Modified: pkg/ProbABEL/src/usage.h
===================================================================
--- pkg/ProbABEL/src/usage.h 2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/usage.h 2011-08-23 19:48:36 UTC (rev 762)
@@ -22,7 +22,7 @@
fprintf(stdout,"\t --allcov : report estimates for all covariates (large outputs!)\n");
fprintf(stdout,"\t --interaction: Which covariate to use for interaction with SNP analysys (default is no ineraction, 0)\n");
fprintf(stdout,"\t --interaction_only: like previos but without covariate acting in interaction with SNP (default is no ineraction, 0)\n");
- fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix as input parameter\n");
+ fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix (for palinear) or inverse correlation (for palogist) as input parameter\n");
fprintf(stdout,"\t --robust : report robust (aka sandwich, aka Hubert-White) standard errors\n");
fprintf(stdout,"\t --help : print help\n");
exit(exit_code);
More information about the Genabel-commits
mailing list