[GenABEL-dev] [Genabel-commits] r762 - in pkg/ProbABEL: doc src

Yury Aulchenko yurii.aulchenko at gmail.com
Tue Aug 23 22:14:15 CEST 2011


Dear Lennart,

You did terrific job converting ProbABEL in quite professional-looking product!

I have just committed few changes related to documentation and adding '--mmscore' option for binary traits; also changed version to 0.2-0 as you suggested. Seems all worked all right, though I had to merge few files. 

One questions: what is the right way to clean up local source before committing? It seems that a number of newly generated files (Makefile's,srd/.deps ...) hang around after running 'make clean' (naturally). I can just force them to be ignored by commit, but wonder if there may be a more proper way to do this?

best wishes,
Yurii
-------------------------------------------------------
Yurii Aulchenko, PhD, Dr. Habil.
Independent researcher and consultant
yurii [dot] aulchenko [at] gmail [dot] com

On Aug 23, 2011, at 9:48 PM, noreply at r-forge.wu-wien.ac.at wrote:

> Author: yurii
> Date: 2011-08-23 21:48:36 +0200 (Tue, 23 Aug 2011)
> New Revision: 762
> 
> Modified:
>   pkg/ProbABEL/doc/CHANGES.LOG
>   pkg/ProbABEL/doc/ProbABEL_manual.tex
>   pkg/ProbABEL/src/main.cpp
>   pkg/ProbABEL/src/mematri1.h
>   pkg/ProbABEL/src/reg1.h
>   pkg/ProbABEL/src/usage.h
> Log:
> update manual; changes.log; add '--mmscore' option for binary traits
> 
> Modified: pkg/ProbABEL/doc/CHANGES.LOG
> ===================================================================
> --- pkg/ProbABEL/doc/CHANGES.LOG	2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/doc/CHANGES.LOG	2011-08-23 19:48:36 UTC (rev 762)
> @@ -1,3 +1,54 @@
> +***** v.0.2-0 (2011.08.23)
> +
> +* ProbABEL can now (experimentally!) analyze binary traits accounting for relationship 
> +structure (thanks to Yurii and Nicola Pirastu). This adds '--mmscore' 
> +option for logistic regression. Important: on the contrast to the 
> +'palinear' the matrix which should be supplied to palogist with 
> +--mmscore should contain an inverse of the correlation matrix (not 
> +the inverse of var-cov matrix, as is the case for 'palinear' with 
> +--mmscore). This matrix can be obtained through (GenABEL notation):
> +
> +h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
> +
> +where h2.object is the object returned by 'polygenic()'. Documentation 
> +explains this procedure in more details; a wrapper function to prepare 
> +and export correct objects for ProbABEL is planned. (thanks to Yurii) 
> +
> +- in probabel.pl the location for the probabel config file is now set
> +  to /etc/, the default location it will end up. Before it still
> +  pointed to the location the probabel.pl file was in. However, a
> +  better fix would be to somehow put in the actual value of the
> +  --sysconfdir option to ./configure.
> +- The PDF of the LaTeX documentation is now only generated if the
> +  pdflatex binary can be found. So now we also build the documentation
> +  using autotools.
> +
> +
> +* ProbABEL uses autoconf and automake now (thanks to Lennart). Now the 
> +compilation process is like this: first, from the ProbABEL directory, 
> +generate documentation by
> +
> +cd doc
> +make
> +make clean_doc
> +cd ..
> +
> +Then, run
> +
> +./configure
> +make
> +make install 
> +
> +To see options, run
> +
> +./configure --help
> +
> +In the ProbABEL autotools integration branch:
> +- Removed the old Makefile
> +- Added configure.ac that servers as input for auto(re)conf
> +- Added the Makefile.am files needed by automake to generate the final Makefiles.
> +
> +
> ***** v.0.1-9e (2011.05.15)
> Previous version wouldn't compile because of missing frerror. file.
> Fixes bug #1339 (the bug fix was done a month ago, but no package had
> 
> Modified: pkg/ProbABEL/doc/ProbABEL_manual.tex
> ===================================================================
> --- pkg/ProbABEL/doc/ProbABEL_manual.tex	2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/doc/ProbABEL_manual.tex	2011-08-23 19:48:36 UTC (rev 762)
> @@ -1,6 +1,9 @@
> \title{ProbABEL manual}
> -\author{Yurii Aulchenko, Maksim Struchalin, Lennart Karssen \\
> -	Erasmus MC Rotterdam
> +\author{
> +Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\ 
> +\\
> +$^{1}$Erasmus MC Rotterdam \\ 
> +$^{2}$Institute of Cytology and Genetics SD RAS 
> }
> \date{\today}
> 
> @@ -28,6 +31,10 @@
> 
> \makeindex
> 
> +\newcommand{\PA}{\texttt{ProbABEL-package}}
> +\newcommand{\GA}{\texttt{GenABEL-package}}
> +\newcommand{\DA}{\texttt{DatABEL-package}}
> +
> \begin{document}
> \maketitle
> \tableofcontents
> @@ -76,16 +83,16 @@
> performing association analysis by means of regression of the
> outcome of interest onto estimated genotypic probabilities.
> 
> -The \texttt{ProbABEL} package was designed to perform such regression
> +The \PA{} package was designed to perform such regression
> in a fast, memory-efficient and consequently genome-wide feasible manner.
> -Currently, \texttt{ProbABEL} implements linear, logistic regression,
> +Currently, \PA{} implements linear, logistic regression,
> and Cox proportional hazards models. The corresponding analysis
> programs are called \texttt{palinear},  \texttt{palogist},
> and \texttt{pacoxph}.
> 
> 
> \section{Input files}
> -\texttt{ProbABEL} takes three files as input: a file containing SNP
> +\PA{} takes three files as input: a file containing SNP
> information (e.g.~the MLINFO file of MACH), a file with genome- or
> chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
> and a file containing the phenotype of interest and covariates.
> @@ -93,6 +100,13 @@
> Optionally, the map information can be supplied (e.g.~the "legend"
> files of HapMap).
> 
> +The dose/probability file may be supplied in filevector format
> +in which case \PA{} will operate much faster, and
> +in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
> +\DA{} on how to convert MACH and IMPUTE files to
> +filevector format (functions: \texttt{mach2databel()} and
> +\texttt{impute2databel()}, respectively).
> +
> \subsection{SNP information file}
> \label{ssec:infoin}
> In the simplest scenario, the SNP information file is an MLINFO
> @@ -103,7 +117,7 @@
> ``Rsq'', the proportion of variance decrease after imputations).
> 
> Actually,
> -for \texttt{ProbABEL}, it does not matter what is written in this file --
> +for \PA{}, it does not matter what is written in this file --
> this information is just brought forward to the output. However,
> \textbf{it is critical} that the number of columns is seven and the number
> of lines in the file is equal to the number of SNPs in the
> @@ -239,7 +253,7 @@
> the program).
> 
> There are in total 11 command line options you can specify to the
> -\texttt{ProbABEL} analysis functions \texttt{palinear} or
> +\PA{} analysis functions \texttt{palinear} or
> \texttt{palogist}. If you run either program without any argument, you
> will get a short explanation to command line options:
> \begin{verbatim}
> @@ -350,25 +364,25 @@
> first column contains id names, the second the trait. The others are
> covariates.
> 
> -An example of how a polygenic object estimated by GenABEL can be used
> +An example of how a polygenic object estimated by \GA{} can be used
> with ProbABEL is provided in \texttt{../example/mmscore.R}
> 
> Though technically \texttt{--mmscore} allows for inclusion of multiple
> covariates, these should be kept to minimum as this is a score test. We
> suggest that any covariates explaining an essential proportion of
> -variance should be fit as part of \texttt{GenABEL}'s
> +variance should be fit as part of \GA{}'s
> \texttt{polygenic} procedure.
> 
> %Option \texttt{--interaction\_only} is like \texttt{--interaction} but does not
> %include in the model the main effect of the covariate, which is acting in
> %interaction with SNP. This option is useful when running \texttt{--mmscore},
> %in whch case the main effect should normally estimated in the polygenic
> -%model and only the interaction term in the \texttt{ProbABEL} analysis.
> +%model and only the interaction term in the \PA{} analysis.
> 
> \subsection{Running multiple analyses at once: \texttt{probabel.pl}}
> 
> The Perl script \texttt{bin/probabel.pl\_example} represents a handy
> -wraper for \texttt{ProbABEL} functions.  To start using it the
> +wraper for \PA{} functions.  To start using it the
> configuration file \texttt{bin/probabel\_config.cfg\_example} needs to
> be edited. The configuration file consists of five columns. Each
> column except the first is a pattern for files produced by
> @@ -383,7 +397,7 @@
> for the ``mlinfo'' files. Probably you also have to change the variable
> \texttt{\$config} in the script to point to the full path of the
> configuration file and the variable \texttt{@anprog} to point full
> -path to the \texttt{ProbABEL} scripts.
> +path to the \PA{} scripts.
> 
> 
> \section{Output file format}
> @@ -412,7 +426,7 @@
> At this point, unless you have complete measurements on all
> subjects, you should feel alarmed if the number here is exactly the
> number of people in the file -- this probably indicates you did not code
> -missing values according to \texttt{ProbABEL} format ('NA', 'NaN', or 'N').
> +missing values according to \PA{} format ('NA', 'NaN', or 'N').
> 
> The next column, nine (``Mean\_predictor\_allele''), gives the estimated
> frequency of the predictor allele in subjects with complete phenotypic data.
> @@ -433,14 +447,24 @@
> 
> \section{Memory use and performance}
> Maximum likelihood regression is implemented in
> -\texttt{ProbABEL}. With 6,000 people and 2.5 millions SNPs, a
> +\PA{}. With 6,000 people and 2.5 millions SNPs, a
> genome-wide scan is completed in less that an hour for a linear model
> with 1-2 covariates and overnight for logistic regression or the Cox
> -proportional hazards model.
> +proportional hazards model (figures for a PC bought back in 2007).
> 
> -Memory is an issue with \texttt{ProbABEL} -- large chromosomes,
> +Memory may be an issue with \PA{} if you use 
> +MACH text dose/probability files, e.g. for large chromosomes,
> such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
> 
> +We suggest that dose/probability file is to be supplied in filevector format
> +in which case \PA{} will operate about 2-3 times faster, and
> +in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
> +\DA{} on how to convert MACH and IMPUTE files to
> +filevector format (functions: \texttt{mach2databel()} and
> +\texttt{impute2databel()}, respectively).
> +
> +When '--mmscore' option is used, the analysis may take quite some time. 
> +
> \section{Methodology}
> \label{sec:methodology}
> \subsection{Analysis of population-based data}
> @@ -598,7 +622,7 @@
> \index{Cox proportional hazards model}
> \index{proportional hazards model}
> \index{regression!Cox proportional hazards}
> -in \texttt{ProbABEL} is entirely based on the code of \texttt{R}
> +in \PA{} is entirely based on the code of \texttt{R}
> library \texttt{survival} developed by Thomas Lumley
> (function \texttt{coxfit2}), and is therefore not described here.
> 
> @@ -742,14 +766,14 @@
> 
> \section{How to cite}
> 
> -If you used \texttt{ProbABEL} for
> -your analysis please give a link to the \texttt{ABEL} home
> +If you used \PA{} for
> +your analysis please give a link to the \texttt{GenABEL project} home
> page
> 
> \begin{quote}
> \url{http://www.genabel.org/}
> \end{quote}
> -and cite the \texttt{ProbABEL} paper to give us some credit:
> +and cite the \PA{} paper to give us some credit:
> \begin{quote}
> Aulchenko YS, Struchalin MV, van Duijn CM.\\
> \emph{ProbABEL package for genome-wide association analysis of imputed data.}\\
> @@ -757,15 +781,15 @@
> \end{quote}
> A proper reference may look like
> \begin{quote}
> -For the analysis of imputed data, we used the \texttt{ProbABEL} package
> -from the GenABEL suite of programs (Aulchenko \emph{et al.}, 2010).
> +For the analysis of imputed data, we used the \PA{} 
> +from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
> \end{quote}
> 
> If you have used the Cox proportional hazard model, please mention the
> \texttt{R} package \texttt{survival} by Thomas Lumley. Additionally
> to the above citation, please tell that
> \begin{quote}
> -The Cox proportional hazards model implemented in \texttt{ProbABEL}
> +The Cox proportional hazards model implemented in \PA{}
> makes use of the source code of the \texttt{R} package ''\texttt{survival}''
> as implemented by T. Lumley.
> \end{quote}
> 
> Modified: pkg/ProbABEL/src/main.cpp
> ===================================================================
> --- pkg/ProbABEL/src/main.cpp	2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/main.cpp	2011-08-23 19:48:36 UTC (rev 762)
> @@ -303,9 +303,12 @@
> 
> 	mematrix<double> invvarmatrix;
> 
> +/*
> + * now should be possible... delete this part later when everything works
> #if LOGISTIC		
> 	if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for logistic regression\n";exit(1);}
> #endif
> +*/
> 
> #if COXPH
> 	if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for cox regression\n";exit(1);}
> 
> Modified: pkg/ProbABEL/src/mematri1.h
> ===================================================================
> --- pkg/ProbABEL/src/mematri1.h	2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/mematri1.h	2011-08-23 19:48:36 UTC (rev 762)
> @@ -368,7 +368,7 @@
>   {
> 	if (M.ncol != M.nrow) 
> 	{
> -		fprintf(stderr,"invert: only suare matrices possible\n");
> +		fprintf(stderr,"invert: only square matrices possible\n");
> 		exit(1);
> 	}
> 	if (M.ncol == 1) 
> @@ -376,12 +376,19 @@
> 		mematrix<DT> temp(1,1);
> 		temp[0] = 1./M[0];
> 	}
> +	/*
> 	for (int i=0;i<M.ncol;i++) 
> 		if (M.data[i*M.ncol+i]==0) 
> 		{
> 			fprintf(stderr,"invert: zero elements in diagonal\n");
> -			exit(1);
> +			mematrix<DT> temp = M;
> +			for (int i = 0; i < M.ncol; i++)
> +				for (int j = 0; j < M.ncol; j++)
> +					temp.put(NAN,i,j);
> +			return temp;
> +			//exit(1);
> 		}
> +		*/
> 	int actualsize = M.ncol;
> 	int maxsize = M.ncol;
> 	mematrix<DT> temp = M;
> 
> Modified: pkg/ProbABEL/src/reg1.h
> ===================================================================
> --- pkg/ProbABEL/src/reg1.h	2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/reg1.h	2011-08-23 19:48:36 UTC (rev 762)
> @@ -626,7 +626,7 @@
> 		//Oct 26, 2009
> 		residuals.reinit((rdata.X).nrow,1);
> 		sigma2=-1.;
> -		loglik=-9.999e+32;
> +		loglik=-9.999e+32; // should actually be MAX of the corresponding type
> 		niter=-1;
> 		chi2_score=-1.;
> 	}
> @@ -635,10 +635,36 @@
> 		//		delete beta;
> 		//		delete sebeta;
> 	}
> -	void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int robust, int nullmodel=0)
> +
> +	void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol,
> +			int model, int interaction, int ngpreds, mematrix<double> & invvarmatrixin, int robust,
> +			int nullmodel=0)
> 	{
> +		// on the contrast to the 'linear' 'invvarmatrix' contains
> +		// inverse of correlation matrix (not the inverse of var-cov matrix)
> +		// h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
> +		// the inverse of var-cov matrix scaled by total variance
> 		regdata rdata = rdatain.get_unmasked_data();
> 
> +		// a lot of code duplicated between linear and logistic...
> +		// e.g. a piece below...
> +		mematrix<double> invvarmatrix;
> +		if(invvarmatrixin.nrow!=0 && invvarmatrixin.ncol!=0)
> +		{
> +			invvarmatrix.reinit(rdata.nids,rdata.nids);
> +			int i1=0, j1=0;
> +			for (int i=0;i<rdata.nids;i++)
> +				if (rdatain.masked_data[i]==0) {
> +					j1 = 0;
> +					for (int j=0;j<rdata.nids;j++) if (rdatain.masked_data[j]==0) {
> +						invvarmatrix.put(invvarmatrixin.get(i,j),i1,j1);
> +						j1++;
> +					}
> +					i1++;
> +				}
> +
> +		}
> +
> 		mematrix<double> X = apply_model(rdata.X,model, interaction, ngpreds, false, nullmodel);
> 		int length_beta = X.ncol;
> 		beta.reinit(length_beta,1);
> @@ -668,10 +694,20 @@
> 
> 		mematrix<double> tX = transpose(X);
> 
> -		//		if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0) tX = X*invvarmatrix;
> +		if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0)
> +		{
> +			tX = tX*invvarmatrix;
> +		}
> +		/*
> +		fprintf(stdout,"\n");
> +		fprintf(stdout,"X %f %f %f\n",X.get(0,0),X.get(0,1),X.get(0,2));
> +		if (X.ncol==4) fprintf(stdout,"X[4] %f\n",X.get(0,3));
> +		fprintf(stdout,"Inv %f %f %f\n",invvarmatrix.get(0,0),invvarmatrix.get(0,1),invvarmatrix.get(0,2));
> +		if (X.ncol==4) fprintf(stdout,"X[4] %f\n",invvarmatrix.get(0,3));
> +		fprintf(stdout,"tXInv %f %f %f\n",tX.get(0,0),tX.get(1,0),tX.get(2,0));
> +		if (X.ncol==4) fprintf(stdout,"X[4] %f\n",tX.get(3,0));
> +		*/
> 
> -
> -
> 		double N;
> 
> 		niter=0;
> @@ -700,21 +736,29 @@
> 			N = tXWX.get(0,0);
> 
> 			if (verbose) {printf("tXWX:\n");tXWX.print();}
> +			//printf("tXWX:\n");tXWX.print();
> 			//
> 			// use cholesky to invert
> 			//
> 			tXWX_i = tXWX;
> -			cholesky2_mm(tXWX_i,tol_chol);
> -			if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
> -			chinv2_mm(tXWX_i);
> +			//cholesky2_mm(tXWX_i,tol_chol);
> +			//if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
> +			//printf("chole tXWX:\n");tXWX_i.print();
> +			//chinv2_mm(tXWX_i);
> 			// was before
> -			// tXWX_i=invert(tXWX);
> +			tXWX_i=invert(tXWX);
> 			if (verbose) {printf("tXWX-1:\n");tXWX_i.print();}
> +			//fprintf(stdout,"*** tXWX_i\n");tXWX_i.print();
> 			mematrix<double> tmp1 = productMatrDiag(tX,W);
> 			mematrix<double> tXWz = tmp1*z;
> 			if (verbose) {printf("tXWz:\n");tXWz.print();}
> 			beta = tXWX_i*tXWz;
> +			//fprintf(stdout,"*** res: %f %f %f\n",residuals[0],residuals[1],residuals[2]);
> +			//mematrix<double> txres = tx * residuals;
> +			//fprintf(stdout,"*** txres\n");txres.print();
> +			//beta = txwx_i* txres;
> 			if (verbose) {printf("beta:\n");beta.print();}
> +			//printf("beta:\n");beta.print();
> 			// compute likelihood
> 			prevlik = loglik;
> 			loglik=0.;
> @@ -780,6 +824,9 @@
> 			}
> 		}
> 		if (verbose) {printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();}
> +		//printf("sebeta (%d):\n",beta.nrow);beta.print();
> +		//printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();
> +		//exit(1);
> 	}
> 	// just a stupid copy from linear_reg
> 	void score(mematrix<double> &resid,regdata &rdata,int verbose, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int nullmodel=0)
> 
> Modified: pkg/ProbABEL/src/usage.h
> ===================================================================
> --- pkg/ProbABEL/src/usage.h	2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/usage.h	2011-08-23 19:48:36 UTC (rev 762)
> @@ -22,7 +22,7 @@
> 	fprintf(stdout,"\t --allcov  : report estimates for all covariates (large outputs!)\n");
> 	fprintf(stdout,"\t --interaction: Which covariate to use for interaction with SNP analysys (default is no ineraction, 0)\n");
> 	fprintf(stdout,"\t --interaction_only: like previos but without covariate acting in interaction with SNP (default is no ineraction, 0)\n");
> -	fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix as input parameter\n");
> +	fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix (for palinear) or inverse correlation (for palogist) as input parameter\n");
> 	fprintf(stdout,"\t --robust  : report robust (aka sandwich, aka Hubert-White) standard errors\n");
> 	fprintf(stdout,"\t --help    : print help\n");
> 	exit(exit_code);
> 
> _______________________________________________
> Genabel-commits mailing list
> Genabel-commits at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20110823/fbf4f671/attachment-0001.htm>


More information about the genabel-devel mailing list