[Genabel-commits] r646 - pkg/ProbABEL/doc

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Thu Feb 10 20:58:13 CET 2011


Author: yurii
Date: 2011-02-10 20:58:07 +0100 (Thu, 10 Feb 2011)
New Revision: 646

Added:
   pkg/ProbABEL/doc/ProbABEL_manual.tex
Removed:
   pkg/ProbABEL/doc/manual.dvi
   pkg/ProbABEL/doc/manual.pdf
   pkg/ProbABEL/doc/manual.tex
Log:
fixing documentation for ProbABEL

Copied: pkg/ProbABEL/doc/ProbABEL_manual.tex (from rev 641, pkg/ProbABEL/doc/manual.tex)
===================================================================
--- pkg/ProbABEL/doc/ProbABEL_manual.tex	                        (rev 0)
+++ pkg/ProbABEL/doc/ProbABEL_manual.tex	2011-02-10 19:58:07 UTC (rev 646)
@@ -0,0 +1,798 @@
+\title{ProbABEL manual}
+\author{Yurii Aulchenko, Maksim Struchalin, Lennart Karssen \\
+	Erasmus MC Rotterdam
+}
+\date{\today}
+
+\documentclass[12pt,a4paper]{article}
+\usepackage{verbatim}
+\usepackage{titleref}
+\usepackage{amsmath}
+\usepackage{makeidx}
+\usepackage[pdftex,hyperfootnotes=false,pdfpagelabels]{hyperref}
+\hypersetup{%
+  linktocpage=false, % If true the page numbers in the toc are links
+  % instead of the section headings.
+  pdfstartview=FitH,% pdfstartpage=3,
+  breaklinks=true, pageanchor=true, %
+  pdfpagemode=UseOutlines, plainpages=false, bookmarksnumbered, %
+  bookmarksopen=true, bookmarksopenlevel=1, hypertexnames=true, %
+  pdfhighlight=/O, %hyperfootnotes=true,%nesting=true,%frenchlinks,%
+  pdfauthor={\textcopyright\ Y.~Aulchenko, M.~Struchalin, L.C.~Karssen},
+  pdfsubject={ProbABEL manual}
+}
+% get the links to the figures and tables right:
+\usepackage[all]{hypcap} % to be loaded after hyperref package
+
+\DeclareMathOperator{\var}{\mathbf{var}}
+
+\makeindex
+
+\begin{document}
+\maketitle
+\tableofcontents
+
+%\begin{abstract}
+%This document describes ProbABEL package for analysis
+%of genome-wide imputed data \ldots
+%\end{abstract}
+
+\section{Motivation}
+
+Many statistical and experimental techniques, such as imputations and
+high-throughput sequencing, generate data which are informative for
+genome-wide association analysis and are probabilistic in the nature.
+
+When we work with directly genotyped markers using such techniques as
+SNP or microsatellite typing, we would normally know the genotype of
+a particular person at a particular locus with very high degree of
+confidence, and, in case of biallelic marker, can state whether
+genotype is $AA$, $AB$ or $BB$.
+
+On the contrary, when dealing with imputed or
+high-throughput sequencing data, for many of the genomic loci
+we are quite uncertain about the genotypic status of the person.
+Instead of dealing with known genotypes we work with a probability
+distribution that is based on observed information, and we have estimates that true underlying
+genotype is either $AA$, $AB$ or $BB$. The degree of confidence
+about the real status is measured with the
+probability distribution $\{P(AA), P(AB), P(BB)\}$.
+
+Several techniques may be applied to analyse such data. The most
+simplistic approach would be to pick up the genotype with highest
+probability, i.e.~$\max_g[P(g=AA), P(g=AB), P(g=BB)]$ and then
+analyse the data as if directly typed markers were used. The
+disadvantage of this approach is that it does not take into
+account the probability distribution -- i.e.~the uncertainty
+about the true genotypic status. Such
+analysis is statistically wrong: the estimates of association
+parameters (regression coefficients, odds or hazard ratios, etc.)
+are biased, and the bias becomes more pronounced with greater
+probability distribution uncertainty (entropy).
+
+One of the solutions that generate unbiased estimates
+of association parameters and takes the
+probability distribution into account is achieved by
+performing association analysis by means of regression of the
+outcome of interest onto estimated genotypic probabilities.
+
+The \texttt{ProbABEL} package was designed to perform such regression
+in a fast, memory-efficient and consequently genome-wide feasible manner.
+Currently, \texttt{ProbABEL} implements linear, logistic regression,
+and Cox proportional hazards models. The corresponding analysis
+programs are called \texttt{palinear},  \texttt{palogist},
+and \texttt{pacoxph}.
+
+
+\section{Input files}
+\texttt{ProbABEL} takes three files as input: a file containing SNP
+information (e.g.~the MLINFO file of MACH), a file with genome- or
+chromosome-wide predictor information (e.g.~the MLDOSE or MLPROB file of MACH),
+and a file containing the phenotype of interest and covariates.
+
+Optionally, the map information can be supplied (e.g.~the "legend"
+files of HapMap).
+
+The dose/probbaility file may be supplied in filevector format
+in which case \texttt{ProbABEL} will operate much faster, and
+in low-RAM mode (approx.~128 MB). See the R libraries GenABEL and
+DatABEL on how to convert MACH and IMPUTE files to
+filevector format (functions: \texttt{mach2databel()} and
+\texttt{impute2databel()}, respectively).
+
+\subsection{SNP information file}
+\label{ssec:infoin}
+In the simplest scenario, the SNP information file is an MLINFO
+file generated by MACH. This must be a space or tab-delimited file
+containing SNP name, coding for allele 1 and 2 (e.g.~A, T, G or C),
+frequency of allele 1, minor allele frequency and two quality
+metrics (``Quality'', the average maximum posterior probability and
+``Rsq'', the proportion of variance decrease after imputations).
+
+Actually,
+for \texttt{ProbABEL}, it does not matter what is written in this file --
+this information is just brought forward to the output. However,
+\textbf{it is critical} that the number of columns is seven and the number
+of lines in the file is equal to the number of SNPs in the
+corresponding DOSE file (plus one for the header line).
+
+The example of SNP information file content follows here (also
+to be found in \texttt{ProbABEL/example/test.mlinfo})
+
+\verbatiminput{../example/test.mlinfo}
+
+Note that header line is present in the file. The file describes
+five SNPs.
+
+\subsection{Genomic predictor file}
+\label{ssec:dosein}
+
+Again, in the simplest scenario this is an MLDOSE or MLPROB file generated by MACH.
+Such file starts with two special columns plus, for each of the SNPs
+under consideration, a column containing the estimated allele 1 dose (MLDOSE).
+In an MLPROB file, two columns for each SNP correspond to posterior probability
+that person has two ($P_{A_1A_1}$) or one ($P_{A_1A_2}$) copies of allele 1.
+The first ``special'' column is made of the sequential id,
+followed by an arrow followed by study ID (the one specified in the
+MACH input files). The second column contains the method  keyword
+(e.g.~``MLDOSE'').
+
+An example of the few first lines of an MLDOSE file for
+five SNPs described in SNP information file follows here (also
+to be found in the file \texttt{../example/test.mldose})
+
+\verbatiminput{short_test.mldose}
+
+\textbf{The order of SNPs in the SNP information file and DOSE-file
+must be the same}. This should be the case if you just used MACH outputs.
+
+Therefore, by all means, the number of columns in the genomic predictor file
+must be the same as the number of lines in the SNP information file plus one.
+
+\subsection{Phenotypic file}
+\label{ssec:phenoin}
+
+The phenotypic data file contains phenotypic data, but also specifies the
+analysis model. There is a header line, specifying the variable names.
+The first column should contain personal study IDs. It is assumed
+that \textbf{both the total number and the order of these IDs are
+exactly the same as in the genomic predictor (MLDOSE) file described in
+previous section}. This is not difficult to arrange using e.g.~\texttt{R};
+an example is given in the \texttt{ProbABEL/examples} directory.
+
+\textbf{Missing data should be coded with 'NA', 'N' or 'NaN' codes.} Any
+other coding will be converted to some number which will be used in
+analysis! E.g.~coding missing as '-999.9' will result in an analysis which
+will consider -999.9 as indeed a true measurements of the trait/covariates.
+
+In the case of linear or logistic regression (programs \texttt{palinear} and
+\texttt{palogist}, respectively), the second column specifies the trait
+under analysis, while the third, fourth, etc.~provide information on
+covariates to be included into analysis.
+An example few lines of phenotypic information file designed for
+linear regression analysis follow here (also
+to be found in \texttt{../example/height.txt})
+
+\verbatiminput{short_height.txt}
+
+Note again that the order of IDs is the same between the MLDOSE file
+and the phenotypic data file. The model specified by this file is
+$height \sim \mu + \textrm{sex} + \textrm{age}$, where $\mu$ is the intercept.
+
+Clearly, you can for example include \texttt{sex $\times$ age} interaction terms by
+specifying another column having a product of sex and age here.
+
+For logistic regression, it is assumed that in the second column cases are
+coded as ``1'' and controls as ``0''. An couple of example lines of a phenotypic
+information file designed for logistic regression analysis follow here (also
+to be found in \texttt{../example/logist\_data.txt})
+
+\verbatiminput{short_logist_data.txt}
+
+You can see that in the first 10 people, there are three cases, as indicated
+by ''chd'' equal to one. The model specified by this file
+is $\textrm{chd} \sim \mu + \textrm{sex} + \textrm{age} + \textrm{othercov}$.
+
+In case of the Cox proportional hazards model, the composition of the
+phenotypic input file is a bit different. In the second column and
+third column, you need to specify the outcome in terms of follow-up
+time (column two) and event (column three, ``1'' if an event occurred
+and zero if censored). Columns starting from four (inclusive) specify
+covariates to be included into the analysis. An example few lines of
+a phenotypic information file designed for the Cox proportional hazards model
+analysis follow here (also to be found in
+\texttt{../example/coxph\_data.txt})
+
+\verbatiminput{short_coxph_data.txt}
+
+You can see that for the first ten people, the event occurs for three of
+them, while for the other seven there is no event during the follow-up
+time, as indicated by the ``chd'' column. Follow-up time is specified in the preceding
+column. The covariates included into the model are age (presumably
+at baseline), sex and ``othercov''; thus the model, in terms of
+\texttt{R/survival} is \\ $\textrm{Surv}(\textrm{fuptime\_chd},
+\textrm{chd}) \sim \textrm{sex} + \textrm{age} + \textrm{othercov}$.
+
+\subsection{Optional map file}
+If you would like map information (e.g.~base pair position) to
+be included in your outputs, you can supply a map file. These follow
+HapMap "legend" file format. For example, for the five SNPs we considered
+the map-file may look like
+
+\verbatiminput{../example/test.map}
+
+The order of the SNPs in the map file should follow that in the SNP information
+file. Only information from the second column -- the SNP location -- is
+actually used to generate the output.
+
+\section{Running an analysis}
+To run linear regression, you should use the program called
+\texttt{palinear}; for logistic analysis use \texttt{palogist}, and
+for the Cox proportional hazards model use \texttt{pacoxph} (all are
+found in the \texttt{ProbABEL/bin/} directory after you have compiled
+the program).
+
+There are in total 11 command line options you can specify to the
+\texttt{ProbABEL} analysis functions \texttt{palinear} or
+\texttt{palogist}. If you run either program without any argument, you
+will get a short explanation to command line options:
+\begin{verbatim}
+user at server:~$ palogist
+
+Usage: ../bin/palogist options
+
+Options:
+		--pheno       : phenotype file name
+		--info        : information (e.g. MLINFO) file name
+		--dose        : predictor (e.g. MLDOSE/MLPROB) file name
+		--map         : [optional] map file name
+		--nids        : [optional] number of people to analyse
+		--chrom       : [optional] chromosome (to be passed to output)
+		--out         : [optional] output file name (default is regression.out.txt)
+		--skipd       : [optional] how many columns to skip in predictor
+								    (dose/prob) file (default 2)
+		--ntraits     : [optional] how many traits are analysed (default 1)
+		--ngpreds     : [optional] how many predictor columns per marker
+								    (default 1 = MLDOSE; else use 2 for MLPROB)
+		--separat     : [optional] character to separate fields (default is space)
+		--score       : use score test
+		--no-head     : do not report header line
+		--allcov      : report estimates for all covariates (large outputs!)
+		--interaction : which covariate to use for interaction with SNP
+									  (default is no ineraction, 0)
+		--mmscore     : score test for association between a trait and genetic
+		    polymorphism, in samples of related individuals
+		--robust      : report robust (aka sandwich, aka Hubert-White) standard
+		    errors
+		--help        : print help
+\end{verbatim}
+%%		--interaction_only: like previos but without covariate acting in
+%%                    interaction with SNP
+%%									  (default is no ineraction, 0)
+
+
+\subsection{Basic analysis options}
+However, for a simple run only three options are mandatory, which
+specify the necessary files needed to run the regression analysis.
+
+These options are
+\texttt{--dose} (or \texttt{-d}),
+specifying the genomic predictor/MLDOSE file described in sub-section \ref{ssec:dosein};
+\texttt{--pheno} (or \texttt{-p}),
+specifying the phenotypic data file described in sub-section \ref{ssec:phenoin}; and
+\texttt{--info} (or \texttt{-i}),
+specifying the SNP information file described in sub-section \ref{ssec:infoin}.
+
+If you change to the \texttt{ProbABEL/example} directory you can run
+an analysis of height by running
+\begin{verbatim}
+user at server:~/ProbABEL/example/$ ../bin/palinear -p height.txt
+-d test.mldose -i test.mlinfo
+\end{verbatim}
+Output from the analysis will be directed to the
+\texttt{regression.out.csv} file.
+
+The analysis of a binary trait (e.g.~chd) can be run with
+\begin{verbatim}
+user at server:~/ProbABEL/example/$ ../bin/palogist -p logist_data.txt
+				 -d test.mldose -i test.mlinfo
+\end{verbatim}
+
+To run a Cox proportional hazards model, try
+\begin{verbatim}
+user at server:~/ProbABEL/example/$ ../bin/pacoxph -p coxph_data.txt
+				 -d test.mldose -i test.mlinfo
+\end{verbatim}
+
+Please have a look at the shell script files \texttt{example\_qt.sh},
+\texttt{example\_bt.sh} and \texttt{example\_all.sh} to have
+a better overview of the analysis options.
+
+To run an analysis with MLPROB files, you need specify the MLPROB file
+with the \texttt{-d} option and also specify that there are two
+genetic predictors per SNP, e.g.~you can run linear model with
+\begin{verbatim}
+user at server:~/ProbABEL/example/$ ../bin/palinear -p height.txt
+				 -d test.mlprob -i test.mlinfo
+				 --ngpreds=2
+\end{verbatim}
+
+\subsection{Advanced analysis options}
+The option \texttt{--interaction} allows you to include interaction
+between SNPs and any covariate. If for example your model is
+\begin{equation*}
+  \textrm{trait} \sim \textrm{sex} + \textrm{age} + \textrm{SNP},
+\end{equation*}
+running the program with the option \texttt{--interaction=2} will model
+\begin{equation*}
+  \textrm{trait} \sim \textrm{sex} + \textrm{age} + \textrm{SNP} +
+  \textrm{age} \times \mathrm{SNP}.
+\end{equation*}
+The option \texttt{--robust} allows you to compute so-called
+``robust'' (a.k.a.~``sandwich'', a.k.a.~Hubert-White) standard errors
+(cf.~section \ref{sec:methodology} ``Methodology'' for details).
+
+With the option \texttt{--mmscore} a score test for association
+between a trait and genetic polymorphisms in samples of related
+individuals is performed. A file with the inverse of the
+variance-covarince matrix goes as input parameter with that option,
+e.g.~\texttt{--mmscore <filename>}. The file has to contain the first
+column with id names exactly like in phenotype file, BUT OMITTING
+people with no measured phenotype. The rest is a matrix. The phenotype
+file in case of using the \texttt{--mmscore} argument may contain any
+amount of covariates (this is different from previous versions). The
+first column contains id names, the second the trait. The others are
+covariates.
+
+An example of how a polygenic object estimated by GenABEL can be used
+with ProbABEL is provided in \texttt{../example/mmscore.R}
+
+Though technically \texttt{--mmscore} allows for inclusion of multiple
+covariates, these should be kept to minimum as this is a score test. We
+suggest that any covariates explaining an essential proportion of
+variance should be fit as part of \texttt{GenABEL}'s
+\texttt{polygenic} procedure.
+
+%Option \texttt{--interaction\_only} is like \texttt{--interaction} but does not
+%include in the model the main effect of the covariate, which is acting in
+%interaction with SNP. This option is useful when running \texttt{--mmscore},
+%in whch case the main effect should normally estimated in the polygenic
+%model and only the interaction term in the \texttt{ProbABEL} analysis.
+
+\subsection{Running multiple analyses at once: \texttt{probabel.pl}}
+
+The Perl script \texttt{bin/probabel.pl\_example} represents a handy
+wraper for \texttt{ProbABEL} functions.  To start using it the
+configuration file \texttt{bin/probabel\_config.cfg\_example} needs to
+be edited. The configuration file consists of five columns. Each
+column except the first is a pattern for files produced by
+\texttt{MACH} (imputation software). The column named ``cohort'' is an
+identifying name of a population (``ERGO'' in this example), the
+column ``mlinfo\_path'' is the full path to mlinfo files, including a
+pattern where the chromosome number has been replaced by
+\texttt{\_.\_chr\_.\_}. The columns ``mldose\_path'',
+``mlprobe\_path'' and ``legend\_path'' are paths and patterns for
+``mldose'', ``mlprob'' and ``legend'' files, respectively. These also
+need to include the pattern for the chromosome as used in the column
+for the ``mlinfo'' files. Probably you also have to change the variable
+\texttt{\$config} in the script to point to the full path of the
+configuration file and the variable \texttt{@anprog} to point full
+path to the \texttt{ProbABEL} scripts.
+
+
+\section{Output file format}
+Let us consider what comes out of the linear regression analysis
+described in the previous section. After the analysis has run, in
+the output file you will find something like
+\begin{small}
+\verbatiminput{short_height.out.csv}
+\end{small}
+
+Here, only the first three lines of output have been shown. Note that lines
+starting with \texttt{...} are actually the ones continuing the
+previous line -- they have just been wrapped so we can see
+these long lines.
+
+The header provides a short description of what can be found in a
+specific column. The first column provides the SNP name and
+next six are descriptions which were taken directly from the
+SNP information file. Therefore, these describe allele frequencies and
+the quality in your total imputations, not necessarily in the data under
+analysis.
+
+In contrast, starting with the next column, named \texttt{n},
+the output concerns the data analysed. Column 8 (\texttt{n}) tells the
+number of subjects for whom complete phenotypic information was available.
+At this point, unless you have complete measurements on all
+subjects, you should feel alarmed if the number here is exactly the
+number of people in the file -- this probably indicates you did not code
+missing values according to \texttt{ProbABEL} format ('NA', 'NaN', or 'N').
+
+The next column, nine (``Mean\_predictor\_allele''), gives the estimated
+frequency of the predictor allele in subjects with complete phenotypic data.
+
+If the \texttt{--chrom} option was used, in the next column you will
+find the value specified by this option. If \texttt{--map} option was
+used, in the subsequent column you will find map location taken from
+the map-file. The subsequent columns provide coefficients of
+regression of the phenotype onto genotype corresponding standard
+errors, and log-likelihood % $\chi^2$
+of the model at the point of MLEs.
+
+
+\section{Preparing input files}
+In the \texttt{ProbABEL/bin} directory you can find the
+\texttt{prepare\_data.R} file -- an R script that arranges phenotypic
+data in right format. Please read this script for details.
+
+\section{Memory use and performance}
+Maximum likelihood regression is implemented in
+\texttt{ProbABEL}. With 6,000 people and 2.5 millions SNPs, a
+genome-wide scan is completed in less that an hour for a linear model
+with 1-2 covariates and overnight for logistic regression or the Cox
+proportional hazards model.
+
+Memory is an issue with \texttt{ProbABEL} -- large chromosomes,
+such as chromosome one consumed up to 5 GB of RAM with 6,000 people.
+
+\section{Methodology}
+\label{sec:methodology}
+\subsection{Analysis of population-based data}
+\subsubsection{Linear regression assuming normal distribution}
+Standard linear regression theory is used to estimate coefficients of
+regression and their standard errors. We assume a linear model with
+expectation
+\begin{equation}
+  E[\mathbf{Y}] = \mathbf{X}\, \boldsymbol{\beta}
+\label{expectation}
+\end{equation}
+and variance-covariance matrix
+$$
+\mathbf{V} = \sigma^2 \mathbf{I},
+$$
+where $\mathbf{Y}$ is the vector of phenotypes of interest,
+$\mathbf{X}$ is the design matrix, $\boldsymbol{\beta}$ is the vector
+of regression parameters, $\sigma^2$ is the variance and $\mathbf{I}$
+is the identity matrix.
+
+The maximum likelihood estimates (MLEs) for the regression parameters
+are given by
+\begin{equation}
+  \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}
+\end{equation}
+and the MLE of the residual variance is
+\begin{equation}
+  \hat{\sigma}^2 = \frac{(\mathbf{Y} - \mathbf{X}\hat{\beta})^T
+    (\mathbf{Y} - \mathbf{X}\hat{\beta})} {N-r_X},
+\end{equation}
+where $N$ is the number of observations and $r_X$ is the rank of
+$\mathbf{X}$ (i.e.~the number of columns of the design matrix).
+
+The variance-covariance matrix for the parameter estimates under
+alternative hypothesis can be computed as
+\begin{equation}
+\var_{\hat{\beta}} = \hat{\sigma}^2 (\mathbf{X}^T\mathbf{X})^{-1}.
+\end{equation}
+
+For the $j$-the element $\hat{\beta}(j)$ of the vector of estimates the standard
+error under the alternative hypothesis is given by the square root of the
+corresponding diagonal element of the above matrix, $\var_{\hat{\beta}}(jj)$,
+and the Wald test can be computed with
+$$
+T^2(j) = \frac{ \hat{\beta}(j)^2 }{ \var_{\hat{\beta}}(jj) },
+$$
+which asymptotically follows the $\chi^2$ distribution with one degree of
+freedom under the null hypothesis.
+
+When testing significance for more than one parameter simultaneously,
+several alternatives are available. Let us first partition the vector of
+parameters into two components, $\beta = (\beta_g,\beta_x)$, and our
+interest is testing the parameters contained in $\beta_g$ (SNP effects),
+while $\beta_x$ (e.g. effects of sex, age, etc.) are considered nuisance
+parameters. Let us define the vector of the parameters of interest
+which are fixed to certain values under the null hypothesis as
+$\beta_{g,0}$.
+
+Firstly, the likelihood ratio test can be obtained with
+$$
+LRT = 2 \left(\mathrm{logLik}(\hat{\beta}_g,\hat{\beta}_x) -
+\mathrm{logLik}(\beta_{g,0},\hat{\beta}_x) \right),
+$$
+which under the null hypothesis is asymptotically distributed as
+$\chi^2$ with the
+number of degrees of freedom equal to the number of parameters specified
+by $\beta_g$. Assuming the normal distribution, the log-likelihood of a
+model specified by the vector of parameters $\beta$ and residual variance
+$\sigma^2$ can be computed as
+$$
+\mathrm{logLik}(\beta,\sigma^2) =
+-\frac{1}{2} \left( N \cdot \log_e \sigma^2 +
+(\mathbf{Y} - \beta \mathbf{X})^T (\mathbf{I}/\sigma^2) (\mathbf{Y} -
+\beta \mathbf{X}) \right).
+$$
+
+Secondly, the Wald test can be used; for that the inverse
+variance-covariance matrix of $\hat{\beta}_g$ should be computed as
+$$
+\var_{\hat{\beta}_g}^{-1} = \var_{\hat{\beta}}^{-1}(g,g) -
+\var_{\hat{\beta}}^{-1}(g,x) \left(
+  \var_{\hat{\beta}}^{-1}(x,x) \right)^{-1}
+  \var_{\hat{\beta}}^{-1}(x,g),
+$$
+where $\var_{\hat{\beta}}^{-1}(a,b)$ correspond to sub-matrices of the
+inverse of the variance-covariance matrix of $\hat{\beta}$, involving
+either only parameters of interest $(g,g)$, nuisance parameters
+$(x,x)$ or combination of these $(x,g)$, $(g,x)$.
+
+The Wald test statistics is then computed as
+$$
+W^2 = (\hat{\beta}_g - \beta_{g,0})^T \,
+\var_{\hat{\beta}_g}^{-1} (\hat{\beta}_g - \beta_{g,0}),
+$$
+which asymptotically follows the $\chi^2$ distribution with the number
+of degrees of freedom equal to the number of parameters specified by
+$\beta_g$. The Wald test generally is computationally easier than the
+LRT, because it avoids estimation of the model specified by the
+parameter's vector $(\beta_{g,0},\hat{\beta}_x)$.
+
+Lastly, similar to the Wald test, the score test can be performed by use
+of $\var_{(\beta_{g,0},\hat{\beta}_x)}$ instead of $\var_{\hat{\beta}}$.
+
+%% Comparative advantages of these testing approaches in the context of GWAS will be discussed in
+%% ''\titleref{implementation}'' section (\ref{implementation}).
+
+\subsubsection{Logistic regression}
+For logistic regression, the procedure to obtain
+parameters estimates, their variance-covariance matrix, and tests are
+similar to these outlined above with several modifications.
+
+The expectation of the binary trait is defined as the expected
+probability of the event as defined by the logistic
+function
+$$
+E[\mathbf{Y}] = \pi = \frac{ 1 }{ 1 + e^{-(\mathbf{X}\beta)} }.
+$$
+The estimates of the parameters are obtained not in one
+step, as is the case of the linear model, but using an iterative
+procedure (iteratively re-weighted least squares). This
+procedure is not described here for the sake of brevity.
+
+The log-likelihood of the data is computed using the
+binomial probability formula:
+$$
+\mathrm{logLik}(\beta) = \mathbf{Y}^T \log_e \pi + (\mathbf{1} -
+\mathbf{Y})^T \log_e (\mathbf{1}-\pi),
+$$
+where $\log_e \pi$ is a vector obtained by taking the natural
+logarithm of every value contained in the vector $\pi$.
+
+\subsubsection{Robust variance-covariance matrix of parameter estimates}
+For a linear model, these are computed using formula
+$$
+\var_r = (\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{X}^T\mathbf{R}\mathbf{X})
+(\mathbf{X}^T\mathbf{X})^{-1},
+$$
+where $\mathbf{R}$ is a diagonal matrix containing squares of residuals
+of $\mathbf{Y}$. The
+same formula may be used for ``standard'' analysis, in which case
+the elements of the $\mathbf{R}$ matrix are constant, namely mean
+residual sum of squares (the estimate of $\sigma^2$).
+
+Similar to that, the robust matrix is computed for logistic regression with
+$$
+\var_r = (\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1} (\mathbf{X}^T\mathbf{R}\mathbf{X})
+(\mathbf{X}^T \mathbf{W} \mathbf{X})^{-1},
+$$
+where $\mathbf{1}$ is the vector of ones and $\mathbf{W}$ is the diagonal matrix
+of ''weights'' used in logistic regression.
+
+
+\subsubsection{Cox proportional hazards model}
+The implementation of the Cox proportional hazard model used
+\index{Cox proportional hazards model}
+\index{proportional hazards model}
+\index{regression!Cox proportional hazards}
+in \texttt{ProbABEL} is entirely based on the code of \texttt{R}
+library \texttt{survival} developed by Thomas Lumley
+(function \texttt{coxfit2}), and is therefore not described here.
+
+Many thanks to Thomas for making his code available under GNU GPL!
+
+\subsection{Analysis of pedigree data}
+The framework for analysis of pedigree data follows the two-step logic
+developed in the works of Aulchenko \emph{et al}. (2007) and Chen and
+Abecasis (2007). General analysis model is a linear mixed model which
+defines the expectation of the trait as
+$$
+E[\mathbf{Y}] = \mathbf{X} \mathbf{\beta},
+$$
+identical to that defined for linear model (cf.~section~\ref{expectation}).
+To account for correlations between the phenotypes of
+relatives which may be induced by family relations the variance-covariance
+matrix is defined to be proportional to the linear combination of the
+identity matrix $\mathbf{I}$ and the relationship matrix $\mathbf{\Phi}$:
+$$
+\mathbf{V}_{\sigma^2,h^2} = \sigma^2 \left( 2 h^2 \mathbf{\Phi} + (1-h^2)
+\mathbf{I} \right),
+$$
+where $h^2$ is the heritability of the trait.
+The relationship matrix $\mathbf{\Phi}$ is twice the matrix containing
+the coefficients of kinship between all pairs of individuals under consideration;
+its estimation is discussed in a separate section ''\titleref{kinship}'' (\ref{kinship}).
+
+Estimation of a model defined in such a way is possible by numerical
+maximization of the likelihood function, however, the estimation of
+this model for large pedigrees is laborious, and is not
+computationally feasible for hundreds of thousands to millions of SNPs
+to be tested in the context of GWAS, as we have demonstrated
+previously (Aulchenko \emph{et al}., 2007).
+
+\subsubsection{Two-step score test for association}
+A two-step score test approach is therefore used to decrease the computational
+burden. Let us first re-define the expectation of the trait by splitting the
+design matrix in two parts, the ''base'' part $\mathbf{X}_x$, which includes all
+terms not changing across all SNP models fit in GWAS (e.g. effects of sex, age, etc.),
+and the part including SNP information, $\mathbf{X_g}$:
+$$
+E[\mathbf{Y}] = \mathbf{X}_x \mathbf{\beta}_x +
+\mathbf{X}_g \mathbf{\beta}_g.
+$$
+Note that the latter design matrix may include not only the main SNP effect, but
+e.g.\ SNP by environment interaction terms.
+
+At the first step, a linear mixed model not including SNP effects
+$$
+E[\mathbf{Y}] = \mathbf{X}_x \mathbf{\beta}_x
+$$
+is fitted. The maximum likelihood estimates (MLEs) of the model parameters (regression coefficients for
+the fixed effects $\hat{\mathbf{\beta}}_x$, the
+residual variance $\hat{\sigma}^2_x$ and the heritability $\hat{h}^2_x$) can
+be obtained by numerical maximization of the likelihood function
+$$
+\mathrm{logLik}(\beta_x,h^2,\sigma^2) = -\frac{1}{2} \left(
+  \log_e|\mathbf{V}_{\sigma^2,h^2}| + (\mathbf{Y} - \beta_x
+  \mathbf{X}_x)^T \, \mathbf{V}_{\sigma^2,h^2}^{-1} \, (\mathbf{Y} -
+  \beta_x \mathbf{X}_x) \right ),
+$$
+where $\mathbf{V}_{\sigma^2,h^2}^{-1}$ is the inverse and
+$|\mathbf{V}_{\sigma^2,h^2}|$ is the determinant of the variance-covariance matrix.
+
+At the second step, the unbiased estimates of the fixed effects of the terms
+involving SNP are obtained with
+$$
+\hat{\beta}_g = (\mathbf{X}^T_g
+\mathbf{V}^{-1}_{\hat{\sigma}^2,\hat{h}^2}
+\mathbf{X}_g)^{-1}
+\mathbf{X}^T_g \mathbf{V}^{-1}_{\hat{\sigma}^2,\hat{h}^2}
+\mathbf{R}_{\hat{\beta}_x},
+$$
+where $\mathbf{V}^{-1}_{\hat{\sigma}^2,\hat{h}^2}$ is the variance-covariance matrix at the point
+of the MLE estimates of $\hat{h}^2_x$ and $\hat{\sigma}^2_x$ and
+$\mathbf{R}_{\hat{\beta}_x} = \mathbf{Y} - \hat{\beta}_x \mathbf{X}_x$ is the
+vector of residuals obtained from the base regression model. Under the null
+model, the inverse variance-covariance matrix of the parameter's estimates is defined
+as
+$$
+\var_{\hat{\beta}_g} = \hat{\sigma}^2_x (\mathbf{X}^T_g \mathbf{V}^{-1}_{\hat{\sigma}^2,\hat{h}^2} \mathbf{X}_g)^{-1}.
+$$
+Thus the score test for joint significance of the terms involving SNP can be obtained with
+$$
+T^2 = (\hat{\beta}_g - \beta_{g,0})^T \,
+\var_{\hat{\beta}_g}^{-1} (\hat{\beta}_g - \beta_{g,0}),
+$$
+where $\beta_{g,0}$ are the values of parameters fixed under the null model.
+This test statistics under the null hypothesis asymptotically follows the
+$\chi^2$ distribution with the number of degrees
+of freedom equal to the number of parameters tested.
+The significance of an individual $j$-the elements of the vector
+$\hat{\beta}_g$ can be tested with
+$$
+T^2_j = \hat{\beta}_{g}^2(j) \var_{\hat{\beta}_g}^{-1}(jj),
+$$
+where $\hat{\beta}_{g}^2(j)$ is the square of the $j$-th element of
+the vector of estimates $\hat{\beta}_{g}$, and
+$\var_{\hat{\beta}_g}^{-1}(jj)$ corresponds to the $j$-th diagonal
+element of $\var_{\hat{\beta}_g}^{-1}$. The latter statistics
+asymptotically follows $\chi^2_1$.
+
+\subsubsection{Estimation of the kinship matrix}
+\label{kinship}
+
+The relationship matrix $\mathbf{\Phi}$ used in estimation of the
+linear mixed model for pedigree data is twice the matrix containing
+the coefficients of kinship between all pairs of individuals under consideration.
+This coefficient is defined as the probability that two gametes randomly sampled
+from each member of the pair are identical-by-descent (IBD), that is they are copies
+of exactly the same ancestral allele. The expectation of kinship
+can be estimated from pedigree data using standard methods, for example the
+kinship for two outbred sibs is $1/4$, for grandchild-grandparent is $1/8$, etc.
+For an outbred person, the kinship coefficient is $1/2$ -- that is two gametes
+sampled from this person at random are IBD only if the same gamete is
+sampled. However, if the person is inbred, there is a chance that a maternal
+and paternal chromosomes are also IBD. The probability of this is characterized
+by kinship between individual's parents, which is defined as the individual's
+inbreeding coefficient, $F$. In this case, the kinship coefficient for the
+individual is $F + 1/2$. Similar logic applies to computation of the kinship
+coefficient for other types of pairs in inbred pedigrees.
+
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/genabel -r 646


More information about the Genabel-commits mailing list