[Genabel-commits] r1015 - branches/ProbABEL-refactoring/ProbABEL/doc

Thu Nov 8 09:39:01 CET 2012

Author: lckarssen
Date: 2012-11-08 09:39:01 +0100 (Thu, 08 Nov 2012)
New Revision: 1015

Modified:
   branches/ProbABEL-refactoring/ProbABEL/doc/CHANGES.LOG
   branches/ProbABEL-refactoring/ProbABEL/doc/ProbABEL_manual.tex
   branches/ProbABEL-refactoring/ProbABEL/doc/packaging.txt
Log:
In the ProbABEL refactoring branch: merged the docs from trunk into the branch (r986 - r1014)

Modified: branches/ProbABEL-refactoring/ProbABEL/doc/CHANGES.LOG
===================================================================

--- branches/ProbABEL-refactoring/ProbABEL/doc/CHANGES.LOG	2012-11-06 23:03:55 UTC (rev 1014)
+++ branches/ProbABEL-refactoring/ProbABEL/doc/CHANGES.LOG	2012-11-08 08:39:01 UTC (rev 1015)
@@ -1,10 +1,18 @@
-*****
-* Fixed bug #2295: the inverse variance-covariance matrix (used with the
-  --mmscore option) was incorrectly subsetted when NAs are present for
-  one or more SNP dosages. As a result the invvarmatrix that was
-  actually used in the regression contained rows and columns of
-  zeroes. Thanks to Maarten Kooyman for reporting this bug.
+***** v.0.2.2 (2012.11.05)
+* No change in the code compared to v.0.2.1. Due to a mistake with the
+  Ubuntu packaging (which was based on SVN r997, which contained a
+  major bug in the tests and which was uploaded to the GenABEL PPA)
+  I'm releasing a new package based on the same source code as
+  ProbABEL v.0.2.1 (except for the version numbers of course).
 
+***** v.0.2.1 (2012.11.05)
+* Fixed bug #2295: the inverse variance-covariance matrix (used with
+  the --mmscore option) was incorrectly subsetted when NAs are present
+  for one or more SNP dosages (so this is not an issue for people using
+  imputed data). As a result the invvarmatrix that was actually used in
+  the regression contained rows and columns of zeroes. Thanks to Maarten
+  Kooyman for reporting this bug.
+
 * Fixed bug #1186: When .map file is missing (but --map option was
   given), the wrong error message was displayed. Thanks to Nicola
   Pirastu for reporting this bug.
@@ -18,7 +26,16 @@
   probabel.pl can now also run Y chromosome analysis and the help
   message has been updated.
 
+* probabel.pl and probabel_config.cfg now also accept chunks, where
+  dose, prob, info and map files are split into multiple chunks. This
+  is now the default for people following the 1000 genomes imputation
+  cookbook for MaCH/minimac (the recipe uses the chunkchromosome tool
+  to split the data into smaller pieces, speeding up imputation on
+  computer clusters). See probabel_config.cfg.example for an
+  example. (Lennart)
 
+
+
 ***** v.0.2.0 (2012.06.10)
 * The v.0.1-9e fix for working with prob files in pacoxph has been
   forward-ported to this branch as well (Lennart and Yurii).

Modified: branches/ProbABEL-refactoring/ProbABEL/doc/ProbABEL_manual.tex
===================================================================
--- branches/ProbABEL-refactoring/ProbABEL/doc/ProbABEL_manual.tex	2012-11-06 23:03:55 UTC (rev 1014)
+++ branches/ProbABEL-refactoring/ProbABEL/doc/ProbABEL_manual.tex	2012-11-08 08:39:01 UTC (rev 1015)
@@ -1,13 +1,14 @@
-\title{Manual for ProbABEL v0.2.0}
+\documentclass[12pt,a4paper]{article}
+
+\title{Manual for ProbABEL v0.2.2}
 \author{
-Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\
+Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{2}$ \\
 \\
 $^{1}$Erasmus MC Rotterdam \\
 $^{2}$Institute of Cytology and Genetics SD RAS
 }
 \date{\today}
 
-\documentclass[12pt,a4paper]{article}
 \usepackage{verbatim}
 \usepackage{titleref}
 \usepackage{amsmath}
@@ -56,14 +57,14 @@
 confidence, and, in case of biallelic marker, can state whether
 genotype is $AA$, $AB$ or $BB$.
 
-On the contrary, when dealing with imputed or
-high-throughput sequencing data, for many of the genomic loci
-we are quite uncertain about the genotypic status of the person.
-Instead of dealing with known genotypes we work with a probability
-distribution that is based on observed information, and we have estimates that true underlying
-genotype is either $AA$, $AB$ or $BB$. The degree of confidence
-about the real status is measured with the
-probability distribution $\{P(AA), P(AB), P(BB)\}$.
+On the contrary, when dealing with imputed or high-throughput
+sequencing data, for many of the genomic loci we are quite uncertain
+about the genotypic status of the person. Instead of dealing with
+known genotypes we work with a probability distribution that is based
+on observed information, and we have estimates that true underlying
+genotype is either $AA$, $AB$ or $BB$. The degree of confidence about
+the real status is measured with the probability distribution
+$\{P(AA), P(AB), P(BB)\}$.
 
 Several techniques may be applied to analyse such data. The most
 simplistic approach would be to pick up the genotype with highest
@@ -88,7 +89,7 @@
 Currently, \PA{} implements linear, logistic regression,
 and Cox proportional hazards models. The corresponding analysis
 programs are called \texttt{palinear},  \texttt{palogist},
-and \texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.0 the
+and \texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.2 the
   \texttt{pacoxph} program is not built by default because it is still
 too buggy for production use. Instructions on how to compile the
 \texttt{pacoxph} module can be found in the \texttt{CHANGES.LOG} file
@@ -97,8 +98,9 @@
 
 \section{Input files}
 \PA{} takes three files as input: a file containing SNP
-information (e.g.~the MLINFO file of MACH), a file with genome- or
-chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
+information (e.g.~the MLINFO file of MaCH), a file with genome- or
+chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file
+of MaCH or \texttt{minimac}),
 and a file containing the phenotype of interest and covariates.
 
 Optionally, the map information can be supplied (e.g.~the "legend"
@@ -107,14 +109,14 @@
 The dose/probability file may be supplied in filevector format
 in which case \PA{} will operate much faster, and
 in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
-\DA{} on how to convert MACH and IMPUTE files to
+\DA{} on how to convert MaCH and IMPUTE files to
 filevector format (functions: \texttt{mach2databel()} and
 \texttt{impute2databel()}, respectively).
 
 \subsection{SNP information file}
 \label{ssec:infoin}
 In the simplest scenario, the SNP information file is an MLINFO
-file generated by MACH. This must be a space or tab-delimited file
+file generated by MaCH/\texttt{minimac}. This must be a space or tab-delimited file
 containing SNP name, coding for allele 1 and 2 (e.g.~A, T, G or C),
 frequency of allele 1, minor allele frequency and two quality
 metrics (``Quality'', the average maximum posterior probability and
@@ -122,11 +124,13 @@
 
 Actually, for \PA{}, it (almost) does not matter what is written in
 this file -- this information is simply copied to the output. However,
-\textbf{it is critical} that the number of columns is seven and the
-number of lines in the file is equal to the number of SNPs in the
-corresponding DOSE file (plus one for the header line). Also make sure
-that the ``Rsq'' column contains values $>0$, otherwise you will end
-up with $\beta$'s set to \texttt{nan}.
+\textbf{it is critical} that the number of columns is
+seven\footnote{This means that for \texttt{minimac} output files the number of
+  columns needs to be reduced. This can be done using e.g.~GAWK or
+  \texttt{cut}.} and the number of lines in the file is equal to the
+number of SNPs in the corresponding DOSE file (plus one for the header
+line). Also make sure that the ``Rsq'' column contains values $>0$,
+otherwise you will end up with $\beta$'s set to \texttt{nan}.
 
 The example of SNP information file content follows here (also
 to be found in \texttt{ProbABEL/examples/test.mlinfo})
@@ -139,14 +143,15 @@
 \subsection{Genomic predictor file}
 \label{ssec:dosein}
 
-Again, in the simplest scenario this is an MLDOSE or MLPROB file generated by MACH.
-Such file starts with two special columns plus, for each of the SNPs
-under consideration, a column containing the estimated allele 1 dose (MLDOSE).
-In an MLPROB file, two columns for each SNP correspond to posterior probability
-that person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele 1.
-The first ``special'' column is made of the sequential id,
+Again, in the simplest scenario this is an MLDOSE or MLPROB file
+generated by MaCH and \texttt{minimac}.  Such file starts with two special
+columns plus, for each of the SNPs under consideration, a column
+containing the estimated allele 1 dose (MLDOSE).  In an MLPROB file,
+two columns for each SNP correspond to posterior probability that
+person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele
+1.  The first ``special'' column is made of the sequential id,
 followed by an arrow followed by study ID (the one specified in the
-MACH input files). The second column contains the method  keyword
+MaCH input files). The second column contains the method keyword
 (e.g.~``MLDOSE'').
 
 An example of the few first lines of an MLDOSE file for
@@ -157,8 +162,9 @@
 %\immediate\write18{head -n 10 INSTALL > tmp.txt}
 
 
-\textbf{The order of SNPs in the SNP information file and DOSE-file
-must be the same}. This should be the case if you just used MACH outputs.
+\textbf{The order of SNPs in the SNP information file and DOSE or PROB
+  file must be the same}. This should be the case if you just used
+MaCH/\texttt{minimac} outputs.
 
 Therefore, by all means, the number of columns in the genomic predictor file
 must be the same as the number of lines in the SNP information file plus one.
@@ -170,21 +176,22 @@
 file as argument for the \texttt{--dose} option
 (cf.~section~\ref{sec:runanalysis} for more information on the options
 accepted by \texttt{ProbABEL}). See the R libraries GenABEL and
-DatABEL on how to convert MACH and IMPUTE files to filevector format
-(functions: \texttt{mach2databel()} and \texttt{impute2databel()},
-respectively).
+DatABEL on how to convert MaCH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
 
 
 \subsection{Phenotypic file}
 \label{ssec:phenoin}
 
-The phenotypic data file contains phenotypic data, but also specifies the
-analysis model. There is a header line, specifying the variable names.
-The first column should contain personal study IDs. It is assumed
-that \textbf{both the total number and the order of these IDs are
-exactly the same as in the genomic predictor (MLDOSE) file described in
-previous section}. This is not difficult to arrange using e.g.~\texttt{R};
-an example is given in the \texttt{examples} directory.
+The phenotypic data file contains phenotypic data, but also specifies
+the analysis model. There is a header line, specifying the variable
+names.  The first column should contain personal study IDs. It is
+assumed that \textbf{both the total number and the order of these IDs
+  are exactly the same as in the genomic predictor (DOSE/PROB) file
+  described in previous section}. This is not difficult to arrange
+using e.g.~\texttt{R}; an example is given in the \texttt{examples}
+directory.
 
 \textbf{Missing data should be coded with 'NA', 'N' or 'NaN' codes.} Any
 other coding will be converted to some number which will be used in
@@ -267,7 +274,7 @@
 To run linear regression, you should use the program called
 \texttt{palinear}; for logistic analysis use \texttt{palogist}, and
 for the Cox proportional hazards model use
-\texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.0 the
+\texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.2 the
   \texttt{pacoxph} program is not built by default because it is still
   too buggy for production use. Instructions on how to compile the
   \texttt{pacoxph} module can be found in the \texttt{CHANGES.LOG}
@@ -341,7 +348,7 @@
 \end{verbatim}
 
 To run a Cox proportional hazards model\footnote{Please note that in
-  ProbABEL v.0.2.0 the \texttt{pacoxph} program is not built by
+  ProbABEL v.0.2.2 the \texttt{pacoxph} program is not built by
   default because it is still too buggy for production
   use. Instructions on how to compile the \texttt{pacoxph} module can
   be found in the \texttt{CHANGES.LOG} file in the \texttt{doc/}
@@ -421,25 +428,31 @@
 
 The Perl script \texttt{bin/probabel.pl} represents a handy wraper for
 \PA{} functions.  To start using it the configuration file
-\texttt{etc/probabel\_config.cfg.example} needs to be edited. The
-configuration file consists of five columns. Each column except the
-first is a pattern for files produced by \texttt{MACH} (imputation
-software). The column named ``cohort'' is an identifying name of a
-population (``ERGO'' in this example), the column ``mlinfo\_path'' is
-the full path to mlinfo files, including a pattern where the
-chromosome number has been replaced by \texttt{\_.\_chr\_.\_}. The
-columns ``mldose\_path'', ``mlprobe\_path'' and ``legend\_path'' are
-paths and patterns for ``mldose'', ``mlprob'' and ``legend'' files,
-respectively. These also need to include the pattern for the
-chromosome as used in the column for the ``mlinfo'' files. The
-\texttt{make install} installation procedure should have set all paths
-in the script correctly. If that is not the case you will have to
-change the variable \texttt{\$config} in the script to point to the
-full path of the configuration file and the variables
-\texttt{\$base\_path} and \texttt{@anprog} to point the full path to
-the \PA{} scripts.
+\texttt{etc/probabel\_config.cfg.example} needs to be edited and
+renamed to \texttt{etc/probabel\_config.cfg}. The configuration file
+consists of five columns, separated by commas. Each column except the
+first is a pattern for files produced by MaCH or \texttt{minimac}
+(imputation tools). The column named ``cohort'' is an identifying name
+of a population (``STUDY\_1'' in the example), the column
+``info\_path'' is the full path to ``info'' files, including a pattern
+where the chromosome number has been replaced by
+\texttt{\_.\_chr\_.\_}. In case the imputations were run on chunks of
+chromosomes, the pattern \texttt{\_.\_chunk\_.\_} will be replaced
+with the corresponding chunk number. Chunk numbers should start at 1
+for each chromosome. The columns ``dose\_path'', ``prob\_path''
+and ``legend\_path'' are paths and patterns for ``dose'', ``prob'' and
+``legend'' files, respectively. These also need to include the pattern
+for the chromosome as used in the column for the ``info'' files.
+Empty lines and lines starting with a \texttt{\#} are ignored.
 
+The \texttt{make install} installation procedure should have set all
+paths in the \texttt{probabel.pl} script correctly. If that is not the
+case you will have to change the variable \texttt{\$config} in the
+script to point to the full path of the configuration file and the
+variables \texttt{\$base\_path} and \texttt{@anprog} to point the full
+path to the \PA{} scripts.
 
+
 \section{Output file format}
 Let us consider what comes out of the linear regression analysis
 described in the previous section. After the analysis has run, in
@@ -492,14 +505,14 @@
 with 1-2 covariates and overnight for logistic regression or the Cox
 proportional hazards model (figures for a PC bought back in 2007).
 
-Memory may be an issue with \PA{} if you use
-MACH text dose/probability files, e.g. for large chromosomes,
-such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
+Memory may be an issue with \PA{} if you use MaCH/\texttt{minimac}
+text dose/probability files, e.g. for large chromosomes, such as
+chromosome one consumed up to 5 GB of RAM with 6,000 people.
 
 We suggest that dose/probability file is to be supplied in filevector format
 in which case \PA{} will operate about 2-3 times faster, and
 in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
-\DA{} on how to convert MACH and IMPUTE files to
+\DA{} on how to convert MaCH and IMPUTE files to
 filevector format (functions: \texttt{mach2databel()} and
 \texttt{impute2databel()}, respectively).
 
@@ -825,7 +838,7 @@
 \end{quote}
 A proper reference may look like
 \begin{quote}
-For the analysis of imputed data, we used the \PA{} v.0.2.0
+For the analysis of imputed data, we used the \PA{} v.0.2.2
 from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
 \end{quote}
 

Modified: branches/ProbABEL-refactoring/ProbABEL/doc/packaging.txt
===================================================================
--- branches/ProbABEL-refactoring/ProbABEL/doc/packaging.txt	2012-11-06 23:03:55 UTC (rev 1014)
+++ branches/ProbABEL-refactoring/ProbABEL/doc/packaging.txt	2012-11-08 08:39:01 UTC (rev 1015)
@@ -8,6 +8,7 @@
    - dh-make
    - fakeroot
    - lintian
+   - devscripts (not necessary, but has some nice utilities like dch)
 ** Building the package for the first time
    First check to see if everything compiles and all files are included
    in the automake files:
@@ -30,7 +31,7 @@
    Hit the enter key to confirm the settings. Several files need to be
    edited.
    - debian/control
-   - debian/changelog
+   - debian/changelog (this file can be edited with 'dch' for convenience)
    - debian/copyright
    - debian/README.Debian
    dh_make also creates several example scripts in the debian/