[Genabel-commits] r1003 - in pkg/ProbABEL: doc src

Mon Nov 5 11:16:14 CET 2012

Author: lckarssen
Date: 2012-11-05 11:16:13 +0100 (Mon, 05 Nov 2012)
New Revision: 1003

Modified:
   pkg/ProbABEL/doc/ProbABEL_manual.tex
   pkg/ProbABEL/src/probabel_config.cfg.example
Log:
Updated documentation for upcoming v0.2.1 release of ProbABEL. Copyright of this change lies with the Erasmus MC, Rotterdam, NL.

Modified: pkg/ProbABEL/doc/ProbABEL_manual.tex
===================================================================

--- pkg/ProbABEL/doc/ProbABEL_manual.tex	2012-11-05 00:27:46 UTC (rev 1002)
+++ pkg/ProbABEL/doc/ProbABEL_manual.tex	2012-11-05 10:16:13 UTC (rev 1003)
@@ -1,13 +1,14 @@
-\title{Manual for ProbABEL v0.2.0}
+\documentclass[12pt,a4paper]{article}
+
+\title{Manual for ProbABEL v0.2.1}
 \author{
-Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\
+Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{2}$ \\
 \\
 $^{1}$Erasmus MC Rotterdam \\
 $^{2}$Institute of Cytology and Genetics SD RAS
 }
 \date{\today}
 
-\documentclass[12pt,a4paper]{article}
 \usepackage{verbatim}
 \usepackage{titleref}
 \usepackage{amsmath}
@@ -56,14 +57,14 @@
 confidence, and, in case of biallelic marker, can state whether
 genotype is $AA$, $AB$ or $BB$.
 
-On the contrary, when dealing with imputed or
-high-throughput sequencing data, for many of the genomic loci
-we are quite uncertain about the genotypic status of the person.
-Instead of dealing with known genotypes we work with a probability
-distribution that is based on observed information, and we have estimates that true underlying
-genotype is either $AA$, $AB$ or $BB$. The degree of confidence
-about the real status is measured with the
-probability distribution $\{P(AA), P(AB), P(BB)\}$.
+On the contrary, when dealing with imputed or high-throughput
+sequencing data, for many of the genomic loci we are quite uncertain
+about the genotypic status of the person. Instead of dealing with
+known genotypes we work with a probability distribution that is based
+on observed information, and we have estimates that true underlying
+genotype is either $AA$, $AB$ or $BB$. The degree of confidence about
+the real status is measured with the probability distribution
+$\{P(AA), P(AB), P(BB)\}$.
 
 Several techniques may be applied to analyse such data. The most
 simplistic approach would be to pick up the genotype with highest
@@ -88,7 +89,7 @@
 Currently, \PA{} implements linear, logistic regression,
 and Cox proportional hazards models. The corresponding analysis
 programs are called \texttt{palinear},  \texttt{palogist},
-and \texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.0 the
+and \texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.1 the
   \texttt{pacoxph} program is not built by default because it is still
 too buggy for production use. Instructions on how to compile the
 \texttt{pacoxph} module can be found in the \texttt{CHANGES.LOG} file
@@ -97,8 +98,9 @@
 
 \section{Input files}
 \PA{} takes three files as input: a file containing SNP
-information (e.g.~the MLINFO file of MACH), a file with genome- or
-chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
+information (e.g.~the MLINFO file of MaCH), a file with genome- or
+chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file
+of MaCH or \texttt{minimac}),
 and a file containing the phenotype of interest and covariates.
 
 Optionally, the map information can be supplied (e.g.~the "legend"
@@ -107,14 +109,14 @@
 The dose/probability file may be supplied in filevector format
 in which case \PA{} will operate much faster, and
 in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
-\DA{} on how to convert MACH and IMPUTE files to
+\DA{} on how to convert MaCH and IMPUTE files to
 filevector format (functions: \texttt{mach2databel()} and
 \texttt{impute2databel()}, respectively).
 
 \subsection{SNP information file}
 \label{ssec:infoin}
 In the simplest scenario, the SNP information file is an MLINFO
-file generated by MACH. This must be a space or tab-delimited file
+file generated by MaCH/\texttt{minimac}. This must be a space or tab-delimited file
 containing SNP name, coding for allele 1 and 2 (e.g.~A, T, G or C),
 frequency of allele 1, minor allele frequency and two quality
 metrics (``Quality'', the average maximum posterior probability and
@@ -122,11 +124,13 @@
 
 Actually, for \PA{}, it (almost) does not matter what is written in
 this file -- this information is simply copied to the output. However,
-\textbf{it is critical} that the number of columns is seven and the
-number of lines in the file is equal to the number of SNPs in the
-corresponding DOSE file (plus one for the header line). Also make sure
-that the ``Rsq'' column contains values $>0$, otherwise you will end
-up with $\beta$'s set to \texttt{nan}.
+\textbf{it is critical} that the number of columns is
+seven\footnote{This means that for \texttt{minimac} output files the number of
+  columns needs to be reduced. This can be done using e.g.~GAWK or
+  \texttt{cut}.} and the number of lines in the file is equal to the
+number of SNPs in the corresponding DOSE file (plus one for the header
+line). Also make sure that the ``Rsq'' column contains values $>0$,
+otherwise you will end up with $\beta$'s set to \texttt{nan}.
 
 The example of SNP information file content follows here (also
 to be found in \texttt{ProbABEL/examples/test.mlinfo})
@@ -139,14 +143,15 @@
 \subsection{Genomic predictor file}
 \label{ssec:dosein}
 
-Again, in the simplest scenario this is an MLDOSE or MLPROB file generated by MACH.
-Such file starts with two special columns plus, for each of the SNPs
-under consideration, a column containing the estimated allele 1 dose (MLDOSE).
-In an MLPROB file, two columns for each SNP correspond to posterior probability
-that person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele 1.
-The first ``special'' column is made of the sequential id,
+Again, in the simplest scenario this is an MLDOSE or MLPROB file
+generated by MaCH and \texttt{minimac}.  Such file starts with two special
+columns plus, for each of the SNPs under consideration, a column
+containing the estimated allele 1 dose (MLDOSE).  In an MLPROB file,
+two columns for each SNP correspond to posterior probability that
+person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele
+1.  The first ``special'' column is made of the sequential id,
 followed by an arrow followed by study ID (the one specified in the
-MACH input files). The second column contains the method  keyword
+MaCH input files). The second column contains the method keyword
 (e.g.~``MLDOSE'').
 
 An example of the few first lines of an MLDOSE file for
@@ -157,8 +162,9 @@
 %\immediate\write18{head -n 10 INSTALL > tmp.txt}
 
 
-\textbf{The order of SNPs in the SNP information file and DOSE-file
-must be the same}. This should be the case if you just used MACH outputs.
+\textbf{The order of SNPs in the SNP information file and DOSE or PROB
+  file must be the same}. This should be the case if you just used
+MaCH/\texttt{minimac} outputs.
 
 Therefore, by all means, the number of columns in the genomic predictor file
 must be the same as the number of lines in the SNP information file plus one.
@@ -170,21 +176,22 @@
 file as argument for the \texttt{--dose} option
 (cf.~section~\ref{sec:runanalysis} for more information on the options
 accepted by \texttt{ProbABEL}). See the R libraries GenABEL and
-DatABEL on how to convert MACH and IMPUTE files to filevector format
-(functions: \texttt{mach2databel()} and \texttt{impute2databel()},
-respectively).
+DatABEL on how to convert MaCH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
 
 
 \subsection{Phenotypic file}
 \label{ssec:phenoin}
 
-The phenotypic data file contains phenotypic data, but also specifies the
-analysis model. There is a header line, specifying the variable names.
-The first column should contain personal study IDs. It is assumed
-that \textbf{both the total number and the order of these IDs are
-exactly the same as in the genomic predictor (MLDOSE) file described in
-previous section}. This is not difficult to arrange using e.g.~\texttt{R};
-an example is given in the \texttt{examples} directory.
+The phenotypic data file contains phenotypic data, but also specifies
+the analysis model. There is a header line, specifying the variable
+names.  The first column should contain personal study IDs. It is
+assumed that \textbf{both the total number and the order of these IDs
+  are exactly the same as in the genomic predictor (DOSE/PROB) file
+  described in previous section}. This is not difficult to arrange
+using e.g.~\texttt{R}; an example is given in the \texttt{examples}
+directory.
 
 \textbf{Missing data should be coded with 'NA', 'N' or 'NaN' codes.} Any
 other coding will be converted to some number which will be used in
@@ -267,7 +274,7 @@
 To run linear regression, you should use the program called
 \texttt{palinear}; for logistic analysis use \texttt{palogist}, and
 for the Cox proportional hazards model use
-\texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.0 the
+\texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.1 the
   \texttt{pacoxph} program is not built by default because it is still
   too buggy for production use. Instructions on how to compile the
   \texttt{pacoxph} module can be found in the \texttt{CHANGES.LOG}
@@ -341,7 +348,7 @@
 \end{verbatim}
 
 To run a Cox proportional hazards model\footnote{Please note that in
-  ProbABEL v.0.2.0 the \texttt{pacoxph} program is not built by
+  ProbABEL v.0.2.1 the \texttt{pacoxph} program is not built by
   default because it is still too buggy for production
   use. Instructions on how to compile the \texttt{pacoxph} module can
   be found in the \texttt{CHANGES.LOG} file in the \texttt{doc/}
@@ -421,25 +428,31 @@
 
 The Perl script \texttt{bin/probabel.pl} represents a handy wraper for
 \PA{} functions.  To start using it the configuration file
-\texttt{etc/probabel\_config.cfg.example} needs to be edited. The
-configuration file consists of five columns. Each column except the
-first is a pattern for files produced by \texttt{MACH} (imputation
-software). The column named ``cohort'' is an identifying name of a
-population (``ERGO'' in this example), the column ``mlinfo\_path'' is
-the full path to mlinfo files, including a pattern where the
-chromosome number has been replaced by \texttt{\_.\_chr\_.\_}. The
-columns ``mldose\_path'', ``mlprobe\_path'' and ``legend\_path'' are
-paths and patterns for ``mldose'', ``mlprob'' and ``legend'' files,
-respectively. These also need to include the pattern for the
-chromosome as used in the column for the ``mlinfo'' files. The
-\texttt{make install} installation procedure should have set all paths
-in the script correctly. If that is not the case you will have to
-change the variable \texttt{\$config} in the script to point to the
-full path of the configuration file and the variables
-\texttt{\$base\_path} and \texttt{@anprog} to point the full path to
-the \PA{} scripts.
+\texttt{etc/probabel\_config.cfg.example} needs to be edited and
+renamed to \texttt{etc/probabel\_config.cfg}. The configuration file
+consists of five columns, separated by commas. Each column except the
+first is a pattern for files produced by MaCH or \texttt{minimac}
+(imputation tools). The column named ``cohort'' is an identifying name
+of a population (``STUDY\_1'' in the example), the column
+``info\_path'' is the full path to ``info'' files, including a pattern
+where the chromosome number has been replaced by
+\texttt{\_.\_chr\_.\_}. In case the imputations were run on chunks of
+chromosomes, the pattern \texttt{\_.\_chunk\_.\_} will be replaced
+with the corresponding chunk number. Chunk numbers should start at 1
+for each chromosome. The columns ``dose\_path'', ``prob\_path''
+and ``legend\_path'' are paths and patterns for ``dose'', ``prob'' and
+``legend'' files, respectively. These also need to include the pattern
+for the chromosome as used in the column for the ``info'' files.
+Empty lines and lines starting with a \texttt{\#} are ignored.
 
+The \texttt{make install} installation procedure should have set all
+paths in the \texttt{probabel.pl} script correctly. If that is not the
+case you will have to change the variable \texttt{\$config} in the
+script to point to the full path of the configuration file and the
+variables \texttt{\$base\_path} and \texttt{@anprog} to point the full
+path to the \PA{} scripts.
 
+
 \section{Output file format}
 Let us consider what comes out of the linear regression analysis
 described in the previous section. After the analysis has run, in
@@ -492,14 +505,14 @@
 with 1-2 covariates and overnight for logistic regression or the Cox
 proportional hazards model (figures for a PC bought back in 2007).
 
-Memory may be an issue with \PA{} if you use
-MACH text dose/probability files, e.g. for large chromosomes,
-such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
+Memory may be an issue with \PA{} if you use MaCH/\texttt{minimac}
+text dose/probability files, e.g. for large chromosomes, such as
+chromosome one consumed up to 5 GB of RAM with 6,000 people.
 
 We suggest that dose/probability file is to be supplied in filevector format
 in which case \PA{} will operate about 2-3 times faster, and
 in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
-\DA{} on how to convert MACH and IMPUTE files to
+\DA{} on how to convert MaCH and IMPUTE files to
 filevector format (functions: \texttt{mach2databel()} and
 \texttt{impute2databel()}, respectively).
 
@@ -825,7 +838,7 @@
 \end{quote}
 A proper reference may look like
 \begin{quote}
-For the analysis of imputed data, we used the \PA{} v.0.2.0
+For the analysis of imputed data, we used the \PA{} v.0.2.1
 from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
 \end{quote}
 

Modified: pkg/ProbABEL/src/probabel_config.cfg.example
===================================================================
--- pkg/ProbABEL/src/probabel_config.cfg.example	2012-11-05 00:27:46 UTC (rev 1002)
+++ pkg/ProbABEL/src/probabel_config.cfg.example	2012-11-05 10:16:13 UTC (rev 1003)
@@ -1,4 +1,4 @@
-cohort,mlinfo_path,mldose_path,mlprob_path,legend_path
+cohort,info_path,dose_path,prob_path,legend_path
 # Configuration file for the probabel.pl wrapper script
 #
 # This file contains the location of the files with imputed data for the