[GenABEL-dev] [Genabel-commits] r762 - in pkg/ProbABEL: doc src
Yury Aulchenko
yurii.aulchenko at gmail.com
Tue Aug 23 22:14:15 CEST 2011
Dear Lennart,
You did terrific job converting ProbABEL in quite professional-looking product!
I have just committed few changes related to documentation and adding '--mmscore' option for binary traits; also changed version to 0.2-0 as you suggested. Seems all worked all right, though I had to merge few files.
One questions: what is the right way to clean up local source before committing? It seems that a number of newly generated files (Makefile's,srd/.deps ...) hang around after running 'make clean' (naturally). I can just force them to be ignored by commit, but wonder if there may be a more proper way to do this?
best wishes,
Yurii
-------------------------------------------------------
Yurii Aulchenko, PhD, Dr. Habil.
Independent researcher and consultant
yurii [dot] aulchenko [at] gmail [dot] com
On Aug 23, 2011, at 9:48 PM, noreply at r-forge.wu-wien.ac.at wrote:
> Author: yurii
> Date: 2011-08-23 21:48:36 +0200 (Tue, 23 Aug 2011)
> New Revision: 762
>
> Modified:
> pkg/ProbABEL/doc/CHANGES.LOG
> pkg/ProbABEL/doc/ProbABEL_manual.tex
> pkg/ProbABEL/src/main.cpp
> pkg/ProbABEL/src/mematri1.h
> pkg/ProbABEL/src/reg1.h
> pkg/ProbABEL/src/usage.h
> Log:
> update manual; changes.log; add '--mmscore' option for binary traits
>
> Modified: pkg/ProbABEL/doc/CHANGES.LOG
> ===================================================================
> --- pkg/ProbABEL/doc/CHANGES.LOG 2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/doc/CHANGES.LOG 2011-08-23 19:48:36 UTC (rev 762)
> @@ -1,3 +1,54 @@
> +***** v.0.2-0 (2011.08.23)
> +
> +* ProbABEL can now (experimentally!) analyze binary traits accounting for relationship
> +structure (thanks to Yurii and Nicola Pirastu). This adds '--mmscore'
> +option for logistic regression. Important: on the contrast to the
> +'palinear' the matrix which should be supplied to palogist with
> +--mmscore should contain an inverse of the correlation matrix (not
> +the inverse of var-cov matrix, as is the case for 'palinear' with
> +--mmscore). This matrix can be obtained through (GenABEL notation):
> +
> +h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
> +
> +where h2.object is the object returned by 'polygenic()'. Documentation
> +explains this procedure in more details; a wrapper function to prepare
> +and export correct objects for ProbABEL is planned. (thanks to Yurii)
> +
> +- in probabel.pl the location for the probabel config file is now set
> + to /etc/, the default location it will end up. Before it still
> + pointed to the location the probabel.pl file was in. However, a
> + better fix would be to somehow put in the actual value of the
> + --sysconfdir option to ./configure.
> +- The PDF of the LaTeX documentation is now only generated if the
> + pdflatex binary can be found. So now we also build the documentation
> + using autotools.
> +
> +
> +* ProbABEL uses autoconf and automake now (thanks to Lennart). Now the
> +compilation process is like this: first, from the ProbABEL directory,
> +generate documentation by
> +
> +cd doc
> +make
> +make clean_doc
> +cd ..
> +
> +Then, run
> +
> +./configure
> +make
> +make install
> +
> +To see options, run
> +
> +./configure --help
> +
> +In the ProbABEL autotools integration branch:
> +- Removed the old Makefile
> +- Added configure.ac that servers as input for auto(re)conf
> +- Added the Makefile.am files needed by automake to generate the final Makefiles.
> +
> +
> ***** v.0.1-9e (2011.05.15)
> Previous version wouldn't compile because of missing frerror. file.
> Fixes bug #1339 (the bug fix was done a month ago, but no package had
>
> Modified: pkg/ProbABEL/doc/ProbABEL_manual.tex
> ===================================================================
> --- pkg/ProbABEL/doc/ProbABEL_manual.tex 2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/doc/ProbABEL_manual.tex 2011-08-23 19:48:36 UTC (rev 762)
> @@ -1,6 +1,9 @@
> \title{ProbABEL manual}
> -\author{Yurii Aulchenko, Maksim Struchalin, Lennart Karssen \\
> - Erasmus MC Rotterdam
> +\author{
> +Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\
> +\\
> +$^{1}$Erasmus MC Rotterdam \\
> +$^{2}$Institute of Cytology and Genetics SD RAS
> }
> \date{\today}
>
> @@ -28,6 +31,10 @@
>
> \makeindex
>
> +\newcommand{\PA}{\texttt{ProbABEL-package}}
> +\newcommand{\GA}{\texttt{GenABEL-package}}
> +\newcommand{\DA}{\texttt{DatABEL-package}}
> +
> \begin{document}
> \maketitle
> \tableofcontents
> @@ -76,16 +83,16 @@
> performing association analysis by means of regression of the
> outcome of interest onto estimated genotypic probabilities.
>
> -The \texttt{ProbABEL} package was designed to perform such regression
> +The \PA{} package was designed to perform such regression
> in a fast, memory-efficient and consequently genome-wide feasible manner.
> -Currently, \texttt{ProbABEL} implements linear, logistic regression,
> +Currently, \PA{} implements linear, logistic regression,
> and Cox proportional hazards models. The corresponding analysis
> programs are called \texttt{palinear}, \texttt{palogist},
> and \texttt{pacoxph}.
>
>
> \section{Input files}
> -\texttt{ProbABEL} takes three files as input: a file containing SNP
> +\PA{} takes three files as input: a file containing SNP
> information (e.g.~the MLINFO file of MACH), a file with genome- or
> chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
> and a file containing the phenotype of interest and covariates.
> @@ -93,6 +100,13 @@
> Optionally, the map information can be supplied (e.g.~the "legend"
> files of HapMap).
>
> +The dose/probability file may be supplied in filevector format
> +in which case \PA{} will operate much faster, and
> +in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
> +\DA{} on how to convert MACH and IMPUTE files to
> +filevector format (functions: \texttt{mach2databel()} and
> +\texttt{impute2databel()}, respectively).
> +
> \subsection{SNP information file}
> \label{ssec:infoin}
> In the simplest scenario, the SNP information file is an MLINFO
> @@ -103,7 +117,7 @@
> ``Rsq'', the proportion of variance decrease after imputations).
>
> Actually,
> -for \texttt{ProbABEL}, it does not matter what is written in this file --
> +for \PA{}, it does not matter what is written in this file --
> this information is just brought forward to the output. However,
> \textbf{it is critical} that the number of columns is seven and the number
> of lines in the file is equal to the number of SNPs in the
> @@ -239,7 +253,7 @@
> the program).
>
> There are in total 11 command line options you can specify to the
> -\texttt{ProbABEL} analysis functions \texttt{palinear} or
> +\PA{} analysis functions \texttt{palinear} or
> \texttt{palogist}. If you run either program without any argument, you
> will get a short explanation to command line options:
> \begin{verbatim}
> @@ -350,25 +364,25 @@
> first column contains id names, the second the trait. The others are
> covariates.
>
> -An example of how a polygenic object estimated by GenABEL can be used
> +An example of how a polygenic object estimated by \GA{} can be used
> with ProbABEL is provided in \texttt{../example/mmscore.R}
>
> Though technically \texttt{--mmscore} allows for inclusion of multiple
> covariates, these should be kept to minimum as this is a score test. We
> suggest that any covariates explaining an essential proportion of
> -variance should be fit as part of \texttt{GenABEL}'s
> +variance should be fit as part of \GA{}'s
> \texttt{polygenic} procedure.
>
> %Option \texttt{--interaction\_only} is like \texttt{--interaction} but does not
> %include in the model the main effect of the covariate, which is acting in
> %interaction with SNP. This option is useful when running \texttt{--mmscore},
> %in whch case the main effect should normally estimated in the polygenic
> -%model and only the interaction term in the \texttt{ProbABEL} analysis.
> +%model and only the interaction term in the \PA{} analysis.
>
> \subsection{Running multiple analyses at once: \texttt{probabel.pl}}
>
> The Perl script \texttt{bin/probabel.pl\_example} represents a handy
> -wraper for \texttt{ProbABEL} functions. To start using it the
> +wraper for \PA{} functions. To start using it the
> configuration file \texttt{bin/probabel\_config.cfg\_example} needs to
> be edited. The configuration file consists of five columns. Each
> column except the first is a pattern for files produced by
> @@ -383,7 +397,7 @@
> for the ``mlinfo'' files. Probably you also have to change the variable
> \texttt{\$config} in the script to point to the full path of the
> configuration file and the variable \texttt{@anprog} to point full
> -path to the \texttt{ProbABEL} scripts.
> +path to the \PA{} scripts.
>
>
> \section{Output file format}
> @@ -412,7 +426,7 @@
> At this point, unless you have complete measurements on all
> subjects, you should feel alarmed if the number here is exactly the
> number of people in the file -- this probably indicates you did not code
> -missing values according to \texttt{ProbABEL} format ('NA', 'NaN', or 'N').
> +missing values according to \PA{} format ('NA', 'NaN', or 'N').
>
> The next column, nine (``Mean\_predictor\_allele''), gives the estimated
> frequency of the predictor allele in subjects with complete phenotypic data.
> @@ -433,14 +447,24 @@
>
> \section{Memory use and performance}
> Maximum likelihood regression is implemented in
> -\texttt{ProbABEL}. With 6,000 people and 2.5 millions SNPs, a
> +\PA{}. With 6,000 people and 2.5 millions SNPs, a
> genome-wide scan is completed in less that an hour for a linear model
> with 1-2 covariates and overnight for logistic regression or the Cox
> -proportional hazards model.
> +proportional hazards model (figures for a PC bought back in 2007).
>
> -Memory is an issue with \texttt{ProbABEL} -- large chromosomes,
> +Memory may be an issue with \PA{} if you use
> +MACH text dose/probability files, e.g. for large chromosomes,
> such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
>
> +We suggest that dose/probability file is to be supplied in filevector format
> +in which case \PA{} will operate about 2-3 times faster, and
> +in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
> +\DA{} on how to convert MACH and IMPUTE files to
> +filevector format (functions: \texttt{mach2databel()} and
> +\texttt{impute2databel()}, respectively).
> +
> +When '--mmscore' option is used, the analysis may take quite some time.
> +
> \section{Methodology}
> \label{sec:methodology}
> \subsection{Analysis of population-based data}
> @@ -598,7 +622,7 @@
> \index{Cox proportional hazards model}
> \index{proportional hazards model}
> \index{regression!Cox proportional hazards}
> -in \texttt{ProbABEL} is entirely based on the code of \texttt{R}
> +in \PA{} is entirely based on the code of \texttt{R}
> library \texttt{survival} developed by Thomas Lumley
> (function \texttt{coxfit2}), and is therefore not described here.
>
> @@ -742,14 +766,14 @@
>
> \section{How to cite}
>
> -If you used \texttt{ProbABEL} for
> -your analysis please give a link to the \texttt{ABEL} home
> +If you used \PA{} for
> +your analysis please give a link to the \texttt{GenABEL project} home
> page
>
> \begin{quote}
> \url{http://www.genabel.org/}
> \end{quote}
> -and cite the \texttt{ProbABEL} paper to give us some credit:
> +and cite the \PA{} paper to give us some credit:
> \begin{quote}
> Aulchenko YS, Struchalin MV, van Duijn CM.\\
> \emph{ProbABEL package for genome-wide association analysis of imputed data.}\\
> @@ -757,15 +781,15 @@
> \end{quote}
> A proper reference may look like
> \begin{quote}
> -For the analysis of imputed data, we used the \texttt{ProbABEL} package
> -from the GenABEL suite of programs (Aulchenko \emph{et al.}, 2010).
> +For the analysis of imputed data, we used the \PA{}
> +from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
> \end{quote}
>
> If you have used the Cox proportional hazard model, please mention the
> \texttt{R} package \texttt{survival} by Thomas Lumley. Additionally
> to the above citation, please tell that
> \begin{quote}
> -The Cox proportional hazards model implemented in \texttt{ProbABEL}
> +The Cox proportional hazards model implemented in \PA{}
> makes use of the source code of the \texttt{R} package ''\texttt{survival}''
> as implemented by T. Lumley.
> \end{quote}
>
> Modified: pkg/ProbABEL/src/main.cpp
> ===================================================================
> --- pkg/ProbABEL/src/main.cpp 2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/main.cpp 2011-08-23 19:48:36 UTC (rev 762)
> @@ -303,9 +303,12 @@
>
> mematrix<double> invvarmatrix;
>
> +/*
> + * now should be possible... delete this part later when everything works
> #if LOGISTIC
> if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for logistic regression\n";exit(1);}
> #endif
> +*/
>
> #if COXPH
> if(inverse_filename != NULL) {std::cerr<<"ERROR: mmscore is forbidden for cox regression\n";exit(1);}
>
> Modified: pkg/ProbABEL/src/mematri1.h
> ===================================================================
> --- pkg/ProbABEL/src/mematri1.h 2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/mematri1.h 2011-08-23 19:48:36 UTC (rev 762)
> @@ -368,7 +368,7 @@
> {
> if (M.ncol != M.nrow)
> {
> - fprintf(stderr,"invert: only suare matrices possible\n");
> + fprintf(stderr,"invert: only square matrices possible\n");
> exit(1);
> }
> if (M.ncol == 1)
> @@ -376,12 +376,19 @@
> mematrix<DT> temp(1,1);
> temp[0] = 1./M[0];
> }
> + /*
> for (int i=0;i<M.ncol;i++)
> if (M.data[i*M.ncol+i]==0)
> {
> fprintf(stderr,"invert: zero elements in diagonal\n");
> - exit(1);
> + mematrix<DT> temp = M;
> + for (int i = 0; i < M.ncol; i++)
> + for (int j = 0; j < M.ncol; j++)
> + temp.put(NAN,i,j);
> + return temp;
> + //exit(1);
> }
> + */
> int actualsize = M.ncol;
> int maxsize = M.ncol;
> mematrix<DT> temp = M;
>
> Modified: pkg/ProbABEL/src/reg1.h
> ===================================================================
> --- pkg/ProbABEL/src/reg1.h 2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/reg1.h 2011-08-23 19:48:36 UTC (rev 762)
> @@ -626,7 +626,7 @@
> //Oct 26, 2009
> residuals.reinit((rdata.X).nrow,1);
> sigma2=-1.;
> - loglik=-9.999e+32;
> + loglik=-9.999e+32; // should actually be MAX of the corresponding type
> niter=-1;
> chi2_score=-1.;
> }
> @@ -635,10 +635,36 @@
> // delete beta;
> // delete sebeta;
> }
> - void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int robust, int nullmodel=0)
> +
> + void estimate(regdata &rdatain, int verbose, int maxiter, double eps, double tol_chol,
> + int model, int interaction, int ngpreds, mematrix<double> & invvarmatrixin, int robust,
> + int nullmodel=0)
> {
> + // on the contrast to the 'linear' 'invvarmatrix' contains
> + // inverse of correlation matrix (not the inverse of var-cov matrix)
> + // h2.object$InvSigma * h.object2$h2an$estimate[length(h2$h2an$estimate)]
> + // the inverse of var-cov matrix scaled by total variance
> regdata rdata = rdatain.get_unmasked_data();
>
> + // a lot of code duplicated between linear and logistic...
> + // e.g. a piece below...
> + mematrix<double> invvarmatrix;
> + if(invvarmatrixin.nrow!=0 && invvarmatrixin.ncol!=0)
> + {
> + invvarmatrix.reinit(rdata.nids,rdata.nids);
> + int i1=0, j1=0;
> + for (int i=0;i<rdata.nids;i++)
> + if (rdatain.masked_data[i]==0) {
> + j1 = 0;
> + for (int j=0;j<rdata.nids;j++) if (rdatain.masked_data[j]==0) {
> + invvarmatrix.put(invvarmatrixin.get(i,j),i1,j1);
> + j1++;
> + }
> + i1++;
> + }
> +
> + }
> +
> mematrix<double> X = apply_model(rdata.X,model, interaction, ngpreds, false, nullmodel);
> int length_beta = X.ncol;
> beta.reinit(length_beta,1);
> @@ -668,10 +694,20 @@
>
> mematrix<double> tX = transpose(X);
>
> - // if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0) tX = X*invvarmatrix;
> + if(invvarmatrix.nrow!=0 && invvarmatrix.ncol!=0)
> + {
> + tX = tX*invvarmatrix;
> + }
> + /*
> + fprintf(stdout,"\n");
> + fprintf(stdout,"X %f %f %f\n",X.get(0,0),X.get(0,1),X.get(0,2));
> + if (X.ncol==4) fprintf(stdout,"X[4] %f\n",X.get(0,3));
> + fprintf(stdout,"Inv %f %f %f\n",invvarmatrix.get(0,0),invvarmatrix.get(0,1),invvarmatrix.get(0,2));
> + if (X.ncol==4) fprintf(stdout,"X[4] %f\n",invvarmatrix.get(0,3));
> + fprintf(stdout,"tXInv %f %f %f\n",tX.get(0,0),tX.get(1,0),tX.get(2,0));
> + if (X.ncol==4) fprintf(stdout,"X[4] %f\n",tX.get(3,0));
> + */
>
> -
> -
> double N;
>
> niter=0;
> @@ -700,21 +736,29 @@
> N = tXWX.get(0,0);
>
> if (verbose) {printf("tXWX:\n");tXWX.print();}
> + //printf("tXWX:\n");tXWX.print();
> //
> // use cholesky to invert
> //
> tXWX_i = tXWX;
> - cholesky2_mm(tXWX_i,tol_chol);
> - if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
> - chinv2_mm(tXWX_i);
> + //cholesky2_mm(tXWX_i,tol_chol);
> + //if (verbose) {printf("chole tXWX:\n");tXWX_i.print();}
> + //printf("chole tXWX:\n");tXWX_i.print();
> + //chinv2_mm(tXWX_i);
> // was before
> - // tXWX_i=invert(tXWX);
> + tXWX_i=invert(tXWX);
> if (verbose) {printf("tXWX-1:\n");tXWX_i.print();}
> + //fprintf(stdout,"*** tXWX_i\n");tXWX_i.print();
> mematrix<double> tmp1 = productMatrDiag(tX,W);
> mematrix<double> tXWz = tmp1*z;
> if (verbose) {printf("tXWz:\n");tXWz.print();}
> beta = tXWX_i*tXWz;
> + //fprintf(stdout,"*** res: %f %f %f\n",residuals[0],residuals[1],residuals[2]);
> + //mematrix<double> txres = tx * residuals;
> + //fprintf(stdout,"*** txres\n");txres.print();
> + //beta = txwx_i* txres;
> if (verbose) {printf("beta:\n");beta.print();}
> + //printf("beta:\n");beta.print();
> // compute likelihood
> prevlik = loglik;
> loglik=0.;
> @@ -780,6 +824,9 @@
> }
> }
> if (verbose) {printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();}
> + //printf("sebeta (%d):\n",beta.nrow);beta.print();
> + //printf("sebeta (%d):\n",sebeta.nrow);sebeta.print();
> + //exit(1);
> }
> // just a stupid copy from linear_reg
> void score(mematrix<double> &resid,regdata &rdata,int verbose, double tol_chol, int model, int interaction, int ngpreds, mematrix<double> & invvarmatrix, int nullmodel=0)
>
> Modified: pkg/ProbABEL/src/usage.h
> ===================================================================
> --- pkg/ProbABEL/src/usage.h 2011-08-22 08:32:43 UTC (rev 761)
> +++ pkg/ProbABEL/src/usage.h 2011-08-23 19:48:36 UTC (rev 762)
> @@ -22,7 +22,7 @@
> fprintf(stdout,"\t --allcov : report estimates for all covariates (large outputs!)\n");
> fprintf(stdout,"\t --interaction: Which covariate to use for interaction with SNP analysys (default is no ineraction, 0)\n");
> fprintf(stdout,"\t --interaction_only: like previos but without covariate acting in interaction with SNP (default is no ineraction, 0)\n");
> - fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix as input parameter\n");
> + fprintf(stdout,"\t --mmscore : score test in samples of related individuals. File with inverse of variance-covariance matrix (for palinear) or inverse correlation (for palogist) as input parameter\n");
> fprintf(stdout,"\t --robust : report robust (aka sandwich, aka Hubert-White) standard errors\n");
> fprintf(stdout,"\t --help : print help\n");
> exit(exit_code);
>
> _______________________________________________
> Genabel-commits mailing list
> Genabel-commits at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/genabel-commits
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/genabel-devel/attachments/20110823/fbf4f671/attachment-0001.htm>
More information about the genabel-devel
mailing list