[Genabel-commits] r1003 - in pkg/ProbABEL: doc src
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Nov 5 11:16:14 CET 2012
Author: lckarssen
Date: 2012-11-05 11:16:13 +0100 (Mon, 05 Nov 2012)
New Revision: 1003
Modified:
pkg/ProbABEL/doc/ProbABEL_manual.tex
pkg/ProbABEL/src/probabel_config.cfg.example
Log:
Updated documentation for upcoming v0.2.1 release of ProbABEL. Copyright of this change lies with the Erasmus MC, Rotterdam, NL.
Modified: pkg/ProbABEL/doc/ProbABEL_manual.tex
===================================================================
--- pkg/ProbABEL/doc/ProbABEL_manual.tex 2012-11-05 00:27:46 UTC (rev 1002)
+++ pkg/ProbABEL/doc/ProbABEL_manual.tex 2012-11-05 10:16:13 UTC (rev 1003)
@@ -1,13 +1,14 @@
-\title{Manual for ProbABEL v0.2.0}
+\documentclass[12pt,a4paper]{article}
+
+\title{Manual for ProbABEL v0.2.1}
\author{
-Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{1,2}$ \\
+Maksim Struchalin$^{1}$, Lennart Karssen$^{1}$, Yurii Aulchenko$^{2}$ \\
\\
$^{1}$Erasmus MC Rotterdam \\
$^{2}$Institute of Cytology and Genetics SD RAS
}
\date{\today}
-\documentclass[12pt,a4paper]{article}
\usepackage{verbatim}
\usepackage{titleref}
\usepackage{amsmath}
@@ -56,14 +57,14 @@
confidence, and, in case of biallelic marker, can state whether
genotype is $AA$, $AB$ or $BB$.
-On the contrary, when dealing with imputed or
-high-throughput sequencing data, for many of the genomic loci
-we are quite uncertain about the genotypic status of the person.
-Instead of dealing with known genotypes we work with a probability
-distribution that is based on observed information, and we have estimates that true underlying
-genotype is either $AA$, $AB$ or $BB$. The degree of confidence
-about the real status is measured with the
-probability distribution $\{P(AA), P(AB), P(BB)\}$.
+On the contrary, when dealing with imputed or high-throughput
+sequencing data, for many of the genomic loci we are quite uncertain
+about the genotypic status of the person. Instead of dealing with
+known genotypes we work with a probability distribution that is based
+on observed information, and we have estimates that true underlying
+genotype is either $AA$, $AB$ or $BB$. The degree of confidence about
+the real status is measured with the probability distribution
+$\{P(AA), P(AB), P(BB)\}$.
Several techniques may be applied to analyse such data. The most
simplistic approach would be to pick up the genotype with highest
@@ -88,7 +89,7 @@
Currently, \PA{} implements linear, logistic regression,
and Cox proportional hazards models. The corresponding analysis
programs are called \texttt{palinear}, \texttt{palogist},
-and \texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.0 the
+and \texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.1 the
\texttt{pacoxph} program is not built by default because it is still
too buggy for production use. Instructions on how to compile the
\texttt{pacoxph} module can be found in the \texttt{CHANGES.LOG} file
@@ -97,8 +98,9 @@
\section{Input files}
\PA{} takes three files as input: a file containing SNP
-information (e.g.~the MLINFO file of MACH), a file with genome- or
-chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
+information (e.g.~the MLINFO file of MaCH), a file with genome- or
+chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file
+of MaCH or \texttt{minimac}),
and a file containing the phenotype of interest and covariates.
Optionally, the map information can be supplied (e.g.~the "legend"
@@ -107,14 +109,14 @@
The dose/probability file may be supplied in filevector format
in which case \PA{} will operate much faster, and
in low-RAM mode (approx. $\approx$ 128 MB). See the R libraries \GA{} and
-\DA{} on how to convert MACH and IMPUTE files to
+\DA{} on how to convert MaCH and IMPUTE files to
filevector format (functions: \texttt{mach2databel()} and
\texttt{impute2databel()}, respectively).
\subsection{SNP information file}
\label{ssec:infoin}
In the simplest scenario, the SNP information file is an MLINFO
-file generated by MACH. This must be a space or tab-delimited file
+file generated by MaCH/\texttt{minimac}. This must be a space or tab-delimited file
containing SNP name, coding for allele 1 and 2 (e.g.~A, T, G or C),
frequency of allele 1, minor allele frequency and two quality
metrics (``Quality'', the average maximum posterior probability and
@@ -122,11 +124,13 @@
Actually, for \PA{}, it (almost) does not matter what is written in
this file -- this information is simply copied to the output. However,
-\textbf{it is critical} that the number of columns is seven and the
-number of lines in the file is equal to the number of SNPs in the
-corresponding DOSE file (plus one for the header line). Also make sure
-that the ``Rsq'' column contains values $>0$, otherwise you will end
-up with $\beta$'s set to \texttt{nan}.
+\textbf{it is critical} that the number of columns is
+seven\footnote{This means that for \texttt{minimac} output files the number of
+ columns needs to be reduced. This can be done using e.g.~GAWK or
+ \texttt{cut}.} and the number of lines in the file is equal to the
+number of SNPs in the corresponding DOSE file (plus one for the header
+line). Also make sure that the ``Rsq'' column contains values $>0$,
+otherwise you will end up with $\beta$'s set to \texttt{nan}.
The example of SNP information file content follows here (also
to be found in \texttt{ProbABEL/examples/test.mlinfo})
@@ -139,14 +143,15 @@
\subsection{Genomic predictor file}
\label{ssec:dosein}
-Again, in the simplest scenario this is an MLDOSE or MLPROB file generated by MACH.
-Such file starts with two special columns plus, for each of the SNPs
-under consideration, a column containing the estimated allele 1 dose (MLDOSE).
-In an MLPROB file, two columns for each SNP correspond to posterior probability
-that person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele 1.
-The first ``special'' column is made of the sequential id,
+Again, in the simplest scenario this is an MLDOSE or MLPROB file
+generated by MaCH and \texttt{minimac}. Such file starts with two special
+columns plus, for each of the SNPs under consideration, a column
+containing the estimated allele 1 dose (MLDOSE). In an MLPROB file,
+two columns for each SNP correspond to posterior probability that
+person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele
+1. The first ``special'' column is made of the sequential id,
followed by an arrow followed by study ID (the one specified in the
-MACH input files). The second column contains the method keyword
+MaCH input files). The second column contains the method keyword
(e.g.~``MLDOSE'').
An example of the few first lines of an MLDOSE file for
@@ -157,8 +162,9 @@
%\immediate\write18{head -n 10 INSTALL > tmp.txt}
-\textbf{The order of SNPs in the SNP information file and DOSE-file
-must be the same}. This should be the case if you just used MACH outputs.
+\textbf{The order of SNPs in the SNP information file and DOSE or PROB
+ file must be the same}. This should be the case if you just used
+MaCH/\texttt{minimac} outputs.
Therefore, by all means, the number of columns in the genomic predictor file
must be the same as the number of lines in the SNP information file plus one.
@@ -170,21 +176,22 @@
file as argument for the \texttt{--dose} option
(cf.~section~\ref{sec:runanalysis} for more information on the options
accepted by \texttt{ProbABEL}). See the R libraries GenABEL and
-DatABEL on how to convert MACH and IMPUTE files to filevector format
-(functions: \texttt{mach2databel()} and \texttt{impute2databel()},
-respectively).
+DatABEL on how to convert MaCH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
\subsection{Phenotypic file}
\label{ssec:phenoin}
-The phenotypic data file contains phenotypic data, but also specifies the
-analysis model. There is a header line, specifying the variable names.
-The first column should contain personal study IDs. It is assumed
-that \textbf{both the total number and the order of these IDs are
-exactly the same as in the genomic predictor (MLDOSE) file described in
-previous section}. This is not difficult to arrange using e.g.~\texttt{R};
-an example is given in the \texttt{examples} directory.
+The phenotypic data file contains phenotypic data, but also specifies
+the analysis model. There is a header line, specifying the variable
+names. The first column should contain personal study IDs. It is
+assumed that \textbf{both the total number and the order of these IDs
+ are exactly the same as in the genomic predictor (DOSE/PROB) file
+ described in previous section}. This is not difficult to arrange
+using e.g.~\texttt{R}; an example is given in the \texttt{examples}
+directory.
\textbf{Missing data should be coded with 'NA', 'N' or 'NaN' codes.} Any
other coding will be converted to some number which will be used in
@@ -267,7 +274,7 @@
To run linear regression, you should use the program called
\texttt{palinear}; for logistic analysis use \texttt{palogist}, and
for the Cox proportional hazards model use
-\texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.0 the
+\texttt{pacoxph}\footnote{Please note that in ProbABEL v.0.2.1 the
\texttt{pacoxph} program is not built by default because it is still
too buggy for production use. Instructions on how to compile the
\texttt{pacoxph} module can be found in the \texttt{CHANGES.LOG}
@@ -341,7 +348,7 @@
\end{verbatim}
To run a Cox proportional hazards model\footnote{Please note that in
- ProbABEL v.0.2.0 the \texttt{pacoxph} program is not built by
+ ProbABEL v.0.2.1 the \texttt{pacoxph} program is not built by
default because it is still too buggy for production
use. Instructions on how to compile the \texttt{pacoxph} module can
be found in the \texttt{CHANGES.LOG} file in the \texttt{doc/}
@@ -421,25 +428,31 @@
The Perl script \texttt{bin/probabel.pl} represents a handy wraper for
\PA{} functions. To start using it the configuration file
-\texttt{etc/probabel\_config.cfg.example} needs to be edited. The
-configuration file consists of five columns. Each column except the
-first is a pattern for files produced by \texttt{MACH} (imputation
-software). The column named ``cohort'' is an identifying name of a
-population (``ERGO'' in this example), the column ``mlinfo\_path'' is
-the full path to mlinfo files, including a pattern where the
-chromosome number has been replaced by \texttt{\_.\_chr\_.\_}. The
-columns ``mldose\_path'', ``mlprobe\_path'' and ``legend\_path'' are
-paths and patterns for ``mldose'', ``mlprob'' and ``legend'' files,
-respectively. These also need to include the pattern for the
-chromosome as used in the column for the ``mlinfo'' files. The
-\texttt{make install} installation procedure should have set all paths
-in the script correctly. If that is not the case you will have to
-change the variable \texttt{\$config} in the script to point to the
-full path of the configuration file and the variables
-\texttt{\$base\_path} and \texttt{@anprog} to point the full path to
-the \PA{} scripts.
+\texttt{etc/probabel\_config.cfg.example} needs to be edited and
+renamed to \texttt{etc/probabel\_config.cfg}. The configuration file
+consists of five columns, separated by commas. Each column except the
+first is a pattern for files produced by MaCH or \texttt{minimac}
+(imputation tools). The column named ``cohort'' is an identifying name
+of a population (``STUDY\_1'' in the example), the column
+``info\_path'' is the full path to ``info'' files, including a pattern
+where the chromosome number has been replaced by
+\texttt{\_.\_chr\_.\_}. In case the imputations were run on chunks of
+chromosomes, the pattern \texttt{\_.\_chunk\_.\_} will be replaced
+with the corresponding chunk number. Chunk numbers should start at 1
+for each chromosome. The columns ``dose\_path'', ``prob\_path''
+and ``legend\_path'' are paths and patterns for ``dose'', ``prob'' and
+``legend'' files, respectively. These also need to include the pattern
+for the chromosome as used in the column for the ``info'' files.
+Empty lines and lines starting with a \texttt{\#} are ignored.
+The \texttt{make install} installation procedure should have set all
+paths in the \texttt{probabel.pl} script correctly. If that is not the
+case you will have to change the variable \texttt{\$config} in the
+script to point to the full path of the configuration file and the
+variables \texttt{\$base\_path} and \texttt{@anprog} to point the full
+path to the \PA{} scripts.
+
\section{Output file format}
Let us consider what comes out of the linear regression analysis
described in the previous section. After the analysis has run, in
@@ -492,14 +505,14 @@
with 1-2 covariates and overnight for logistic regression or the Cox
proportional hazards model (figures for a PC bought back in 2007).
-Memory may be an issue with \PA{} if you use
-MACH text dose/probability files, e.g. for large chromosomes,
-such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
+Memory may be an issue with \PA{} if you use MaCH/\texttt{minimac}
+text dose/probability files, e.g. for large chromosomes, such as
+chromosome one consumed up to 5 GB of RAM with 6,000 people.
We suggest that dose/probability file is to be supplied in filevector format
in which case \PA{} will operate about 2-3 times faster, and
in low-RAM mode (approx.~128 MB). See the R libraries \GA{} and
-\DA{} on how to convert MACH and IMPUTE files to
+\DA{} on how to convert MaCH and IMPUTE files to
filevector format (functions: \texttt{mach2databel()} and
\texttt{impute2databel()}, respectively).
@@ -825,7 +838,7 @@
\end{quote}
A proper reference may look like
\begin{quote}
-For the analysis of imputed data, we used the \PA{} v.0.2.0
+For the analysis of imputed data, we used the \PA{} v.0.2.1
from the \texttt{GenABEL} suite of programs (Aulchenko \emph{et al.}, 2010).
\end{quote}
Modified: pkg/ProbABEL/src/probabel_config.cfg.example
===================================================================
--- pkg/ProbABEL/src/probabel_config.cfg.example 2012-11-05 00:27:46 UTC (rev 1002)
+++ pkg/ProbABEL/src/probabel_config.cfg.example 2012-11-05 10:16:13 UTC (rev 1003)
@@ -1,4 +1,4 @@
-cohort,mlinfo_path,mldose_path,mlprob_path,legend_path
+cohort,info_path,dose_path,prob_path,legend_path
# Configuration file for the probabel.pl wrapper script
#
# This file contains the location of the files with imputed data for the
More information about the Genabel-commits
mailing list