[Genabel-commits] r762 - in pkg/ProbABEL: doc src

Tue Aug 23 21:48:36 CEST 2011

Author: yurii
Date: 2011-08-23 21:48:36 +0200 (Tue, 23 Aug 2011)
New Revision: 762

Modified:
   pkg/ProbABEL/doc/CHANGES.LOG
   pkg/ProbABEL/doc/ProbABEL_manual.tex
   pkg/ProbABEL/src/main.cpp
   pkg/ProbABEL/src/mematri1.h
   pkg/ProbABEL/src/reg1.h
   pkg/ProbABEL/src/usage.h
Log:
update manual; changes.log; add '--mmscore' option for binary traits

Modified: pkg/ProbABEL/doc/CHANGES.LOG
===================================================================

--- pkg/ProbABEL/doc/CHANGES.LOG	2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/doc/CHANGES.LOG	2011-08-23 19:48:36 UTC (rev 762)
@@ -1,3 +1,54 @@
+***** v.0.2-0 (2011.08.23)
+
+* ProbABEL can now (experimentally!) analyze binary traits accounting for relationship 
+structure (thanks to Yurii and Nicola Pirastu). This adds '--mmscore' 
+option for logistic regression. Important: on the contrast to the 
+'palinear' the matrix which should be supplied to palogist with 
+--mmscore should contain an inverse of the correlation matrix (not 
+the inverse of var-cov matrix, as is the case for 'palinear' with 
+--mmscore). This matrix can be obtained through (GenABEL notation):
+
+h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
+
+where h2.object is the object returned by 'polygenic()'. Documentation 
+explains this procedure in more details; a wrapper function to prepare 
+and export correct objects for ProbABEL is planned. (thanks to Yurii) 
+
+- in probabel.pl the location for the probabel config file is now set
+  to /etc/, the default location it will end up. Before it still
+  pointed to the location the probabel.pl file was in. However, a
+  better fix would be to somehow put in the actual value of the
+  --sysconfdir option to ./configure.
+- The PDF of the LaTeX documentation is now only generated if the
+  pdflatex binary can be found. So now we also build the documentation
+  using autotools.
+
+
+* ProbABEL uses autoconf and automake now (thanks to Lennart). Now the 
+compilation process is like this: first, from the ProbABEL directory, 
+generate documentation by
+
+cd doc
+make
+make clean_doc
+cd ..
+
+Then, run
+
+./configure
+make
+make install 
+
+To see options, run
+
+./configure --help
+
+In the ProbABEL autotools integration branch:
+- Removed the old Makefile
+- Added configure.ac that servers as input for auto(re)conf
+- Added the Makefile.am files needed by automake to generate the final Makefiles.
+
+
 ***** v.0.1-9e (2011.05.15)
 Previous version wouldn't compile because of missing frerror. file.
 Fixes bug #1339 (the bug fix was done a month ago, but no package had

Modified: pkg/ProbABEL/doc/ProbABEL_manual.tex
===================================================================
--- pkg/ProbABEL/doc/ProbABEL_manual.tex	2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/doc/ProbABEL_manual.tex	2011-08-23 19:48:36 UTC (rev 762)
@@ -1,6 +1,9 @@
 \title{ProbABEL manual}
-\author{Yurii Aulchenko, Maksim Struchalin, Lennart Karssen \\
-	Erasmus MC Rotterdam
+\author{
+Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\ 
+\\
+$^{1}$Erasmus MC Rotterdam \\ 
+$^{2}$Institute of Cytology and Genetics SD RAS 
 }
 \date{\today}
 
@@ -28,6 +31,10 @@
 
 \makeindex
 
+\newcommand{\PA}{\texttt{ProbABEL-package}}
+\newcommand{\GA}{\texttt{GenABEL-package}}
+\newcommand{\DA}{\texttt{DatABEL-package}}
+
 \begin{document}
 \maketitle
 \tableofcontents
@@ -76,16 +83,16 @@
 performing association analysis by means of regression of the
 outcome of interest onto estimated genotypic probabilities.
 
-The \texttt{ProbABEL} package was designed to perform such regression
+The \PA{} package was designed to perform such regression
 in a fast, memory-efficient and consequently genome-wide feasible manner.
-Currently, \texttt{ProbABEL} implements linear, logistic regression,
+Currently, \PA{} implements linear, logistic regression,
 and Cox proportional hazards models. The corresponding analysis
 programs are called \texttt{palinear},  \texttt{palogist},
 and \texttt{pacoxph}.
 
 
 \section{Input files}
-\texttt{ProbABEL} takes three files as input: a file containing SNP
+\PA{} takes three files as input: a file containing SNP
 information (e.g.~the MLINFO file of MACH), a file with genome- or
 chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
 and a file containing the phenotype of interest and covariates.
@@ -93,6 +100,13 @@
 Optionally, the map information can be supplied (e.g.~the "legend"
 files of HapMap).
 
+The dose/probability file may be supplied in filevector format
+in which case \PA{} will operate much faster, and
+in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
+\DA{} on how to convert MACH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
+
 \subsection{SNP information file}
 \label{ssec:infoin}
 In the simplest scenario, the SNP information file is an MLINFO
@@ -103,7 +117,7 @@
 ``Rsq'', the proportion of variance decrease after imputations).
 
 Actually,
-for \texttt{ProbABEL}, it does not matter what is written in this file --
+for \PA{}, it does not matter what is written in this file --
 this information is just brought forward to the output. However,
 \textbf{it is critical} that the number of columns is seven and the number
 of lines in the file is equal to the number of SNPs in the
@@ -239,7 +253,7 @@
 the program).
 
 There are in total 11 command line options you can specify to the
-\texttt{ProbABEL} analysis functions \texttt{palinear} or
+\PA{} analysis functions \texttt{palinear} or
 \texttt{palogist}. If you run either program without any argument, you
 will get a short explanation to command line options:
 \begin{verbatim}
@@ -350,25 +364,25 @@
 first column contains id names, the second the trait. The others are
 covariates.
 
-An example of how a polygenic object estimated by GenABEL can be used
+An example of how a polygenic object estimated by \GA{} can be used
 with ProbABEL is provided in \texttt{../example/mmscore.R}
 
 Though technically \texttt{--mmscore} allows for inclusion of multiple
 covariates, these should be kept to minimum as this is a score test. We
 suggest that any covariates explaining an essential proportion of
-variance should be fit as part of \texttt{GenABEL}'s
+variance should be fit as part of \GA{}'s
 \texttt{polygenic} procedure.
 
 %Option \texttt{--interaction\_only} is like \texttt{--interaction} but does not
 %include in the model the main effect of the covariate, which is acting in
 %interaction with SNP. This option is useful when running \texttt{--mmscore},
 %in whch case the main effect should normally estimated in the polygenic
-%model and only the interaction term in the \texttt{ProbABEL} analysis.
+%model and only the interaction term in the \PA{} analysis.
 
 \subsection{Running multiple analyses at once: \texttt{probabel.pl}}
 
 The Perl script \texttt{bin/probabel.pl\_example} represents a handy
-wraper for \texttt{ProbABEL} functions.  To start using it the
+wraper for \PA{} functions.  To start using it the
 configuration file \texttt{bin/probabel\_config.cfg\_example} needs to
 be edited. The configuration file consists of five columns. Each
 column except the first is a pattern for files produced by
@@ -383,7 +397,7 @@
 for the ``mlinfo'' files. Probably you also have to change the variable
 \texttt{\$config} in the script to point to the full path of the
 configuration file and the variable \texttt{@anprog} to point full
-path to the \texttt{ProbABEL} scripts.
+path to the \PA{} scripts.
 
 
 \section{Output file format}
@@ -412,7 +426,7 @@
 At this point, unless you have complete measurements on all
 subjects, you should feel alarmed if the number here is exactly the
 number of people in the file -- this probably indicates you did not code
-missing values according to \texttt{ProbABEL} format ('NA', 'NaN', or 'N').
+missing values according to \PA{} format ('NA', 'NaN', or 'N').
 
 The next column, nine (``Mean\_predictor\_allele''), gives the estimated
 frequency of the predictor allele in subjects with complete phenotypic data.
@@ -433,14 +447,24 @@
 
 \section{Memory use and performance}
 Maximum likelihood regression is implemented in
-\texttt{ProbABEL}. With 6,000 people and 2.5 millions SNPs, a
+\PA{}. With 6,000 people and 2.5 millions SNPs, a
 genome-wide scan is completed in less that an hour for a linear model
 with 1-2 covariates and overnight for logistic regression or the Cox
-proportional hazards model.
+proportional hazards model (figures for a PC bought back in 2007).
 
-Memory is an issue with \texttt{ProbABEL} -- large chromosomes,
+Memory may be an issue with \PA{} if you use 
+MACH text dose/probability files, e.g. for large chromosomes,
 such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
 
+We suggest that dose/probability file is to be supplied in filevector format
+in which case \PA{} will operate about 2-3 times faster, and
+in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
+\DA{} on how to convert MACH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
+
+When '--mmscore' option is used, the analysis may take quite some time. 
+
 \section{Methodology}
 \label{sec:methodology}
 \subsection{Analysis of population-based data}
@@ -598,7 +622,7 @@
 \index{Cox proportional hazards model}
 \index{proportional hazards model}
 \index{regression!Cox proportional hazards}
-in \texttt{ProbABEL} is entirely based on the code of \texttt{R}
+in \PA{} is entirely based on the code of \texttt{R}
 library \texttt{survival} developed by Thomas Lumley
 (function \texttt{coxfit2}), and is therefore not described here.
 
@@ -742,14 +766,14 @@
 
 \section{How to cite}
 
-If you used \texttt{ProbABEL} for
-your analysis please give a link to the \texttt{ABEL} home
+If you used \PA{} for
+your analysis please give a link to the \texttt{GenABEL project} home
 page
 
 \begin{quote}
 \url{http://www.genabel.org/}
 \end{quote}
-and cite the \texttt{ProbABEL} paper to give us some credit:
+and cite the \PA{} paper to give us some credit:
 \begin{quote}
 Aulchenko YS, Struchalin MV, van Duijn CM.\\
 \emph{ProbABEL package for genome-wide association analysis of imputed data.}\\
@@ -757,15 +781,15 @@
 \end{quote}
 A proper reference may look like
 \begin{quote}
-For the analysis of imputed data, we used the \texttt{ProbABEL} package
-from the GenABEL suite of programs (Aulchenko \emph{et al.}, 2010).
+For the analysis of imputed data, we used the \PA{} 
+from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
 \end{quote}
 
 If you have used the Cox proportional hazard model, please mention the
 \texttt{R} package \texttt{survival} by Thomas Lumley. Additionally
 to the above citation, please tell that
 \begin{quote}
-The Cox proportional hazards model implemented in \texttt{ProbABEL}
+The Cox proportional hazards model implemented in \PA{}
 makes use of the source code of the \texttt{R} package ''\texttt{survival}''
 as implemented by T. Lumley.
 \end{quote}

Modified: pkg/ProbABEL/src/main.cpp
===================================================================
--- pkg/ProbABEL/src/main.cpp	2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/main.cpp	2011-08-23 19:48:36 UTC (rev 762)
@@ -303,9 +303,12 @@
 
 	mematrix<double> invvarmatrix;
 
+/*
+ * now should be possible... delete this part later when everything works
 #if LOGISTIC		
 	if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for logistic regression\n";exit(1);}
 #endif
+*/
 
 #if COXPH
 	if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for cox regression\n";exit(1);}

Modified: pkg/ProbABEL/src/mematri1.h
===================================================================
--- pkg/ProbABEL/src/mematri1.h	2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/mematri1.h	2011-08-23 19:48:36 UTC (rev 762)
@@ -368,7 +368,7 @@
   {
 	if (M.ncol != M.nrow) 
 	{
-		fprintf(stderr,"invert: only suare matrices possible\n");
+		fprintf(stderr,"invert: only square matrices possible\n");
 		exit(1);
 	}
 	if (M.ncol == 1) 
@@ -376,12 +376,19 @@
 		mematrix<DT> temp(1,1);
 		temp[0] = 1./M[0];
 	}
+	/*
 	for (int i=0;i<M.ncol;i++) 
 		if (M.data[i*M.ncol+i]==0) 
 		{
 			fprintf(stderr,"invert: zero elements in diagonal\n");
-			exit(1);
+			mematrix<DT> temp = M;
+			for (int i = 0; i < M.ncol; i++)
+				for (int j = 0; j < M.ncol; j++)
+					temp.put(NAN,i,j);
+			return temp;
+			//exit(1);
 		}
+		*/
 	int actualsize = M.ncol;
 	int maxsize = M.ncol;
 	mematrix<DT> temp = M;

Modified: pkg/ProbABEL/src/reg1.h
===================================================================
--- pkg/ProbABEL/src/reg1.h	2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/reg1.h	2011-08-23 19:48:36 UTC (rev 762)
@@ -626,7 +626,7 @@
 		//Oct 26, 2009
 		residuals.reinit((rdata.X).nrow,1);
 		sigma2=-1.;
-		loglik=-9.999e+32;
+		loglik=-9.999e+32; // should actually be MAX of the corresponding type
 		niter=-1;
 		chi2_score=-1.;
 	}
@@ -635,10 +635,36 @@
 		//		delete beta;
 		//		delete sebeta;
 	}
-	void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int robust, int nullmodel=0)
+
+	void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol,
+			int model, int interaction, int ngpreds, mematrix<double> & invvarmatrixin, int robust,
+			int nullmodel=0)
 	{
+		// on the contrast to the 'linear' 'invvarmatrix' contains
+		// inverse of correlation matrix (not the inverse of var-cov matrix)
+		// h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
+		// the inverse of var-cov matrix scaled by total variance
 		regdata rdata = rdatain.get_unmasked_data();
 
+		// a lot of code duplicated between linear and logistic...
+		// e.g. a piece below...
+		mematrix<double> invvarmatrix;
+		if(invvarmatrixin.nrow!=0 && invvarmatrixin.ncol!=0)
+		{
+			invvarmatrix.reinit(rdata.nids,rdata.nids);
+			int i1=0, j1=0;
+			for (int i=0;i<rdata.nids;i++)
+				if (rdatain.masked_data[i]==0) {
+					j1 = 0;
+					for (int j=0;j<rdata.nids;j++) if (rdatain.masked_data[j]==0) {
+						invvarmatrix.put(invvarmatrixin.get(i,j),i1,j1);
+						j1++;
+					}
+					i1++;
+				}
+
+		}
+
 		mematrix<double> X = apply_model(rdata.X,model, interaction, ngpreds, false, nullmodel);
 		int length_beta = X.ncol;
 		beta.reinit(length_beta,1);
@@ -668,10 +694,20 @@
 
 		mematrix<double> tX = transpose(X);
 
-		//		if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0) tX = X*invvarmatrix;
+		if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0)
+		{
+			tX = tX*invvarmatrix;
+		}
+		/*
+		fprintf(stdout,"\n");
+		fprintf(stdout,"X %f %f %f\n",X.get(0,0),X.get(0,1),X.get(0,2));
+		if (X.ncol==4) fprintf(stdout,"X[4] %f\n",X.get(0,3));
+		fprintf(stdout,"Inv %f %f %f\n",invvarmatrix.get(0,0),invvarmatrix.get(0,1),invvarmatrix.get(0,2));
+		if (X.ncol==4) fprintf(stdout,"X[4] %f\n",invvarmatrix.get(0,3));
+		fprintf(stdout,"tXInv %f %f %f\n",tX.get(0,0),tX.get(1,0),tX.get(2,0));
+		if (X.ncol==4) fprintf(stdout,"X[4] %f\n",tX.get(3,0));
+		*/
 
-
-
 		double N;
 
 		niter=0;
@@ -700,21 +736,29 @@
 			N = tXWX.get(0,0);
 
 			if (verbose) {printf("tXWX:\n");tXWX.print();}
+			//printf("tXWX:\n");tXWX.print();
 			//
 			// use cholesky to invert
 			//
 			tXWX_i = tXWX;
-			cholesky2_mm(tXWX_i,tol_chol);
-			if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
-			chinv2_mm(tXWX_i);
+			//cholesky2_mm(tXWX_i,tol_chol);
+			//if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
+			//printf("chole tXWX:\n");tXWX_i.print();
+			//chinv2_mm(tXWX_i);
 			// was before
-			// tXWX_i=invert(tXWX);
+			tXWX_i=invert(tXWX);
 			if (verbose) {printf("tXWX-1:\n");tXWX_i.print();}
+			//fprintf(stdout,"*** tXWX_i\n");tXWX_i.print();
 			mematrix<double> tmp1 = productMatrDiag(tX,W);
 			mematrix<double> tXWz = tmp1*z;
 			if (verbose) {printf("tXWz:\n");tXWz.print();}
 			beta = tXWX_i*tXWz;
+			//fprintf(stdout,"*** res: %f %f %f\n",residuals[0],residuals[1],residuals[2]);
+			//mematrix<double> txres = tx * residuals;
+			//fprintf(stdout,"*** txres\n");txres.print();
+			//beta = txwx_i* txres;
 			if (verbose) {printf("beta:\n");beta.print();}
+			//printf("beta:\n");beta.print();
 			// compute likelihood
 			prevlik = loglik;
 			loglik=0.;
@@ -780,6 +824,9 @@
 			}
 		}
 		if (verbose) {printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();}
+		//printf("sebeta (%d):\n",beta.nrow);beta.print();
+		//printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();
+		//exit(1);
 	}
 	// just a stupid copy from linear_reg
 	void score(mematrix<double> &resid,regdata &rdata,int verbose, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int nullmodel=0)

Modified: pkg/ProbABEL/src/usage.h
===================================================================
--- pkg/ProbABEL/src/usage.h	2011-08-22 08:32:43 UTC (rev 761)
+++ pkg/ProbABEL/src/usage.h	2011-08-23 19:48:36 UTC (rev 762)
@@ -22,7 +22,7 @@
 	fprintf(stdout,"\t --allcov  : report estimates for all covariates (large outputs!)\n");
 	fprintf(stdout,"\t --interaction: Which covariate to use for interaction with SNP analysys (default is no ineraction, 0)\n");
 	fprintf(stdout,"\t --interaction_only: like previos but without covariate acting in interaction with SNP (default is no ineraction, 0)\n");
-	fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix as input parameter\n");
+	fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix (for palinear) or inverse correlation (for palogist) as input parameter\n");
 	fprintf(stdout,"\t --robust  : report robust (aka sandwich, aka Hubert-White) standard errors\n");
 	fprintf(stdout,"\t --help    : print help\n");
 	exit(exit_code);