[Depmix-commits] r280 - papers/jss

Wed Jul 8 13:27:09 CEST 2009

Author: ingmarvisser
Date: 2009-07-08 13:27:08 +0200 (Wed, 08 Jul 2009)
New Revision: 280

Added:
   papers/jss/depmix-manual.Rnw
Log:
Added .Rnw file of the jss paper

Added: papers/jss/depmix-manual.Rnw
===================================================================

--- papers/jss/depmix-manual.Rnw	                        (rev 0)
+++ papers/jss/depmix-manual.Rnw	2009-07-08 11:27:08 UTC (rev 280)
@@ -0,0 +1,882 @@
+
+% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
+%&LaTeX
+
+\documentclass[]{report}
+
+\usepackage[]{amsmath, amsfonts, amstext, amsthm}
+
+\batchmode
+
+\usepackage{Sweave}
+
+\usepackage{Rd}
+
+\usepackage[round]{natbib}
+
+\title{depmix: An R-package for fitting mixture models on mixed
+multivariate data with Markov dependencies}
+
+\author{Ingmar Visser\thanks{
+Correspondence concerning this manual should be adressed to:
+Ingmar Visser, Department of Psychology, University of Amsterdam,
+Roetersstraat 15, 1018 WB, Amsterdam, The Netherlands}\\
+Department of Psychology, University of Amsterdam\\
+i.visser at uva.nl}
+
+\date{\today}
+
+%%\VignetteIndexEntry{depmix: An R-package for fitting mixture models on mixed multivariate data with Markov dependencies}
+%%\VignetteDepends{depmix}
+%%\VignetteKeywords{R, latent/hidden Markov model, latent class model, finite mixture model, direct optimization, gradients}
+%%\VignettePackage{depmix}
+
+\SweaveOpts{prefix.string=graphs/depmix}
+
+\newcommand{\vc}{\mathbf}
+\newcommand{\mat}{\mathbf}
+
+% \newcommand{\pkg}{\texttt}
+% \newcommand{\code}{\texttt}
+
+\begin{document}
+
+\bibliographystyle{plainnat}
+
+\maketitle
+
+\begin{abstract} 
+	
+	\pkg{depmix} implements a general class of mixture models with
+	Markovian dependencies between them in the R programming language
+	\citep{R2007}.  This includes standard Markov models, latent/hidden
+	Markov models, and latent class and finite mixture distribution models.
+	The models can be fitted on mixed multivariate data with multinomial
+	and/or gaussian distributions.  Parameters can be estimated subject to
+	general linear constraints, and with optional inclusion of regression
+	on (time-dependent) covariates.  Parameters estimation is done through
+	a direct optimization approach with gradients using the \code{Rdonlp2}
+	optimization routine.  A number of illustrative examples are included.
+
+\end{abstract}
+
+\tableofcontents
+
+\chapter{Introduction}
+
+Markov and latent Markov models are frequently used in the social
+sciences, in different areas and applications.  In psychology, they
+are used for modelling learning processes, see \citet{Wickens1982}, for an
+overview, and \citet{Schmittmann2006} for a recent application.  In
+economics, latent Markov models are commonly used as regime switching
+models, see e.g.\ \citet{Kim1994} and \citet{Ghysels1994}.  Further
+applications include speech recognition \citep{Rabiner1989}, EEG analysis
+\citep{Rainer2000}, and genetics \citep{Krogh1998}.  In those latter areas
+of application, latent Markov models are usually referred to as hidden
+Markov models.
+
+The \pkg{depmix} package was motivated by the fact that Markov models
+are used commonly in the social sciences, but no comprehensive package
+was available for fitting such models.  Common programs for Markovian
+models include Panmark \citep{Panmark3-96}, and for latent class models
+Latent Gold 
+\citep{LatentGold3-2003}.  Those programs are lacking a number of
+important features, besides not being freely available.  In
+particular, \pkg{depmix}: 1) handles multiple case, or multiple group,
+analysis; 2) handles arbitrarily long time series; 3) estimates models
+with general linear constraints between parameters; 4) analyzes mixed
+distributions, i.e., combinations of categorical and continuous
+observed variables; 5) fits mixtures of latent Markov models to deal
+with population heterogeneity; 6) can fit models with covariates.
+Although \pkg{depmix} is specifically meant for dealing with
+longitudinal or time series data, for say $T>100$, it can also handle
+the limit case with $T=1$.  In those cases, there are no time
+dependencies between observed data, and the model reduces to a finite
+mixture model, or a latent class model.  In the next chapter, an outline
+is provided of the model and the likelihood equations.  In the
+chapters after that a number of examples are presented.
+
+
+\section*{Acknowledgements} 
+
+I am indebted to many people for providing help in writing this package.
+First and foremost Maartje Raijmakers and Verena Schmittmann tested
+countless earlier versions, spotted bugs and suggested many features.
+Moreover, Maartje Raijmakers provided the discrimination data set.  Han van
+der Maas provided the speed-accuracy data and thereby neccessitated
+implementing models with time-dependent covariates.  Conor Dolan and Raoul
+Grasman both provided valuable advice on statistics in general and
+optimization in particular.
+
+
+\chapter{Dependent mixture models}
+
+The data considered here, has the general form $O_{1}^{1}, \ldots,
+O_{1}^{m}$, $O_{2}^{1}, \ldots, O_{2}^{m}$, \ldots, $O_{T}^{1}, \ldots,
+O_{T}^{m}$ for an $m$-variate time series of length $T$.  As an
+example, consider a time series of responses generated by a single
+subject in a reaction time experiment.  The data consists of three
+variables, reaction time, accuracy and a covariate which is a pay-off
+factor which determines the reward for speed and accuracy.  These
+variables are measured at 168, 134 and 137 occasions respectively.
+Below, a summary is provided for these data, as well as a plot of the
+first timeseries, which is selected by \code{nind=1}.
+
+<<echo=FALSE, results=hide>>=
+.libPaths(new="/Users/ivisser/Library/R/library/")
+library(depmix)
+load("models.Rda")
+@
+
+<<>>=
+data(speed)
+summary(speed)
+@
+\begin{figure}[htbp]
+  \begin{center}
+<<fig=TRUE>>=
+plot(speed,nind=1) 
+@
+	\caption{Reaction times, accuracy and pay-off values for
+	the first series of responses in dataset \texttt{speed}.}
+  \end{center}
+\end{figure}
+
+The latent Markov model is commonly associated with data of this type,
+albeit usually only multinomial variables are considered.  However,
+common estimation procedures, such as those implemented in
+\citet{Panmark3-96} are not suitable for long time series due to
+underflow problems.  In contrast, the hidden Markov model is typically
+only used for `long' univariate time series.  In the next section, the
+likelihood and estimation procedure for the hidden Markov model is
+described, given data of the above form.
+
+The dependent mixture model is defined by the following elements:
+\begin{enumerate}
+	
+	\item a set $\vc{S}$ of latent classes or states $S_{i},\, i=1,
+	\ldots , n$,
+	
+	\item a matrix $\mat{A}$ of transition probabilities $a_{ij}$ for
+	the transition from state $S_{i}$ to state $S_{j}$,
+	
+	\item a set $\vc{B}$ of observation functions $b_{j}(\cdot)$ that
+	provide the conditional probabilities associated with latent state
+	$S_{j}$,
+	
+	\item a vector $\pmb{\pi}$ of latent state initial probabilities
+	$\pi_{i}$
+\end{enumerate}
+When transitions are added to the latent class model, it is more 
+appropriate to refer to the classes as states. The word class is 
+rather more associated with a stable trait-like attribute whereas a 
+state can change over time. 
+
+
+\section{Likelihood}
+
+The loglikelihood of hidden Markov models is usually computed by the
+so-called forward-backward algorithm \citep{Bau66,Rab89}, or rather by
+the forward part of this algorithm.  \cite{Lys2002} changed the forward
+algorithm in such a way as to allow computing the gradients of the
+loglikelihood at the same time.  They start by rewriting the
+likelihood as follows (for ease of exposition the dependence on the
+model parameters is dropped here):
+\begin{equation}
+	L_{T} = Pr(\vc{O}_{1}, \ldots, \vc{O}_{T}) = \prod_{t=1}^{T} 
+Pr(\vc{O}_{t}|\vc{O}_{1}, 
+	\ldots, \vc{O}_{t-1}), 
+	\label{condLike}
+\end{equation}
+where $Pr(\vc{O}_{1}|\vc{O}_{0}):=Pr(\vc{O}_{1})$. Note that for a 
+simple, i.e.\ observed, Markov chain these probabilities reduce to 
+$Pr(\vc{O}_{t}|\vc{O}_{1},\ldots, 
+\vc{O}_{t-1})=Pr(\vc{O}_{t}|\vc{O}_{t-1})$.
+The log-likelihood can now be expressed as:
+\begin{equation}
+	l_{T} = \sum_{t=1}^{T} \log[Pr(\vc{O}_{t}|\vc{O}_{1}, \ldots, 
+\vc{O}_{t-1})].
+	\label{eq:condLogl}
+\end{equation}
+
+To compute the log-likelihood, \cite{Lys2002} define the following 
+(forward) recursion:
+\begin{align}
+	\phi_{1}(j) &:= Pr(\vc{O}_{1}, S_{1}=j) = \pi_{j} b_{j}(\vc{O}_{1}) 
+	\label{eq:fwd1} \\
+\begin{split}
+	\phi_{t}(j) &:= Pr(\vc{O}_{t}, S_{t}=j|\vc{O}_{1}, \ldots, 
+\vc{O}_{t-1}) \\
+	&= \sum_{i=1}^{N} [\phi_{t-1}(i)a_{ij}b_{j}(\vc{O}_{t})] \times 
+(\Phi_{t-1})^{-1},
+	\label{eq:fwdt} 
+\end{split} 
+\end{align}
+where $\Phi_{t}=\sum_{i=1}^{N} \phi_{t}(i)$. Combining 
+$\Phi_{t}=Pr(\vc{O}_{t}|\vc{O}_{1}, \ldots, \vc{O}_{t-1})$, and 
+equation~(\ref{eq:condLogl}) gives the following expression for the 
+log-likelihood:
+\begin{equation}
+	l_{T} = \sum_{t=1}^{T} \log \Phi_{t}.
+	\label{eq:logl}
+\end{equation}
+
+The above forward recursion can readily be generalized to mixture 
+models, in which it is assumed that the data are realizations of a 
+number of different LMMs and the goal is to assign posterior 
+probabilities to sequences of observations. This situation occurs, 
+for example, in learning data where different learning strategies may 
+lead to different answer patterns. From an observed sequence of 
+responses, it may not be immediately clear from which learning 
+process they stem. Hence, it is interesting to consider a mixture of 
+latent Markov models which incorporate restrictions that are 
+consistent with each of the learning strategies. 
+
+To compute the likelihood of a mixture of $K$ models, define the 
+forward recursion variables as follows (these variables now have an 
+extra index $k$ indicating that observation and transition 
+probabilities are from latent model $k$):
+\begin{align}
+\begin{split}
+\phi_{1}(j_{k}) &=  Pr(\vc{O}_{1}, 
+S_{1}=j_{k})=p_{k}\pi_{j_{k}}b_{j_{k}}(\vc{O}_{1}).
+\end{split}\label{eq:fwd1mix} \\
+\begin{split}
+\phi_{t}(j_{k})   &=   Pr(\vc{O}_{t}, S_{t}=j_{k}|\vc{O}_{1}, \ldots, 
+\vc{O}_{t-1}) \\
+			&= \left[ \sum_{k=1}^{K} \sum_{i=1}^{n_{k}} \phi_{t-1}(i_{k}) 
+			a_{ij_{k}}b_{j_{k}}(\vc{O}_{t}) \right] \times (\Phi_{t-1})^{-1},
+\end{split}\label{eq:fwdtmix} 
+\end{align}
+where $\Phi_{t} = \sum_{k=1}^{K}\sum_{i=1}^{n_{k}} \phi_{t}(j_{k})$.
+Note that the double sum over $k$ and $n_{k}$ is simply an enumeration
+of all the states of the model.  Now, because $a_{ij_{k}}=0$ whenever
+$S_{i}$ is not part of component $k$, the sum over $k$ can be dropped
+and hence equation~\ref{eq:fwdtmix} reduces to:
+\begin{equation}
+	\phi_{t}(j_{k}) = \left[ \sum_{i=1}^{n_{k}} \phi_{t-1}(i_{k}) 
+			a_{ij_{k}}b_{j_{k}}(\vc{O}_{t}) \right] \times (\Phi_{t-1})^{-1}
+\end{equation}
+The loglikelihood is computed by applying equation~\ref{eq:logl} on
+these terms.  For multiple cases, the log-likelihood is simply the sum
+over the individual log-likelihoods. 
+
+\paragraph{Computational considerations} From
+equations~(\ref{eq:fwd1}--\ref{eq:fwdt}), it can be seen that
+computing the forward variables, and hence the log-likelihood, takes
+$O(Tn^{2})$ computations, for an $n$-state model and a time series of
+length $T$.  Consider a mixture of two components, one with two states
+and the other with three states.  Using
+equations~(\ref{eq:fwd1}--\ref{eq:fwdt}) to compute the log-likelihood
+of this model one needs $O(Tn^{2})=O(T\times 25)$ computations whereas
+with the mixture equations~(\ref{eq:fwd1mix}--\ref{eq:fwdtmix}),
+$\sum_{n_{i}} O(n_{i}^{2}T)$ computations are needed, in this case
+$O(T \times 13)$.  So, it can be seen that in this easy example the
+computational cost is almost halved. 
+
+
+\section{Gradients}
+
+\newcommand{\fpp}{\frac{\partial} {\partial \lambda_{1}}}
+
+See equations 10--12  in \cite{Lys2002} for the score recursion 
+functions of the hidden Markov model for a univariate time series. 
+Here the corresponding score recursion for the multivariate mixture 
+case are provided. The $t=1$ components of this score recursion are 
+defined as (for an arbitrary parameter $\lambda_{1}$):
+\begin{align}
+\psi_{1}(j_{k};\lambda_{1}) &:=  \fpp Pr(\vc{O}_{1}|S_{1}=j_{k}) \\
+\begin{split} 
+	&= \left[  \fpp p_{k} \right] \pi_{j_{k}}b_{j_{k}}(\vc{O}_{1}) + 
+	p_{k}\left[ \fpp \pi_{j_{k}} \right] b_{j_{k}}(\vc{O}_{1}) \\
+	& \qquad  + p_{k}\pi_{j_{k}} \left[ \fpp 
+b_{j_{k}}(\vc{O}_{1})\right],
+\end{split} \label{eq:psi1}
+\end{align}
+and for $t>1$ the definition is:
+\begin{align}
+\psi_{t}(j_{k};\lambda_{1})  & =  \frac{\fpp Pr(\vc{O}_{1}, \ldots, 
+\vc{O}_{t}, S_{t}=j_{k})}
+			{Pr(\vc{O}_{1}, \ldots, \vc{O}_{t-1})}  \\
+\begin{split} 
+	& =  
+			 \sum_{i=1}^{n_{k}} \Bigg\{ \psi_{t-1}(i;\lambda_{1})a_{ij_{k}} 
+			 b_{j_{k}}(\vc{O}_{t}) \\ 
+			 &\qquad +\phi_{t-1}(i) \left[ \fpp a_{ij_{k}} \right] b_{j_{k}} 
+(\vc{O}_{t}) \\
+			&\qquad +\phi_{t-1}(i)a_{ij_{k}}  \left[ \fpp b_{j_{k}} 
+(\vc{O}_{t}) \right] \Bigg\} 
+			\times (\Phi_{t-1})^{-1}.
+\end{split} \label{eq:psit}
+\end{align}
+
+Using above equations, \cite{Lys2002} derive the following equation 
+for the partial derivative of the likelihood:
+\begin{equation}
+	\fpp l_{T}= 	
+		\frac{\mathbf{\Psi}_{T}(\lambda_{1})}{\mathbf{\Phi}_{T}},
+\end{equation}
+where $\Psi_{t}=\sum_{k=1}^{K} \sum_{i=1}^{n_{k}} 
+\psi_{t}(j_{k};\lambda_{1})$. 
+Starting from the equation from the logarithm of the likelihood, this 
+is easily seen to be correct: 
+\begin{eqnarray*}
+	\fpp \log Pr(\vc{O}_{1}, \ldots, \vc{O}_{T}) &=& Pr(\vc{O}_{1}, 
+\ldots, \vc{O}_{T})^{-1} 
+	\fpp Pr(\vc{O}_{1}, \ldots, \vc{O}_{T}) \\
+	&=&  \frac{Pr(\vc{O}_{1}, \ldots, \vc{O}_{T-1})}{Pr(\vc{O}_{1}, 
+\ldots, \vc{O}_{T})}  \Psi_{T} (\lambda_{1}) \\
+	&=&  \frac{\mathbf{\Psi}_{T}(\lambda_{1})}{\mathbf{\Phi}_{T}}.
+\end{eqnarray*}
+
+Further, to actually compute the gradients, the partial derivatives of
+the parameters and observation distribution functions are neccessary,
+i.e., $\fpp p_{k}$, $\fpp \pi_{i}$, $\fpp a_{ij}$, and $\fpp
+\vc{b}_{i}(\vc{O}_{t})$.  Only the latter case requires some
+attention.  We need the following derivatives $\fpp
+\vc{b}_{j}(\vc{O}_{t})=\fpp \vc{b}_{j}(O_{t}^{1}, \ldots, O_{t}^{m})$, for
+arbitrary parameters $\lambda_{1}$. To stress that $\vc{b}_{j}$ is a
+vector of functions, we here used boldface. First note that because of local
+independence we can write:
+\begin{equation*}
+	\fpp \left[ b_{j}(O_{t}^{1}, \ldots, O_{t}^{m}) \right] = \frac{\partial} 
+{\partial \lambda_{1} } \left[ b_{j}(O_{t}^{1}) \right] \times  
+\left[ b_{j}(O_{t}^{2}) \right], \ldots,  \left[ b_{j}(O_{t}^{m}) \right].  
+\end{equation*}
+Applying the chain rule for products we get:
+\begin{equation}
+	\fpp [b_{j}(O_{t}^{1}, \ldots, O_{t}^{m})] =
+	\sum_{l=1}^{m} \left[ \prod_{i=1, \ldots, \hat{l}, \ldots, m} 
+b_{j}(O_{t}^{i}) \right] \times
+	\fpp  [b_{j}(O_{t}^{l})],
+	\label{partialProd}
+\end{equation}
+where $\hat{l}$ means that that term is left out of the product. 
+These latter terms, $\frac{\partial} {\partial \lambda_{1} }  
+[b_{j}(O_{t}^{k})]$, are easy to compute given either multinomial or 
+gaussian observation densities $b_{j}(\dot)$
+
+
+\section{Parameter estimation}
+
+Parameters are estimated in \pkg{depmix} using a direct optimization
+approach instead of the EM algorithm which is frequently used for this type
+of model.  The EM algorithm however has some drawbacks.  First, it can be
+slow to converge.  Second, applying constraints to parameters can be
+problmatic.  The EM algorithm can sometimes lead to incorrect estimates
+when constraints are applied to parameters in the M-step of the algorithm.
+The package was designed to be used with the \code{npsol}-library, the
+main reason being that it handles general linear (in-)equality
+constraints very well. Unfortunately, \code{npsol} is not freeware and
+hence is not distributed with \pkg{depmix}. Two other options are
+available for optimization using  \code{nlm} and \code{optim}
+respectively. Linear equality constraints are fitted through
+reparametrization. Inequality constraints are fitted through adding a 
+penalty to the likelihood depending on the amount by which a
+constraint is not satisfied. The argument \code{vfactor} to the
+fitting function can be used to control this bahavior. See details of 
+this in the chapter on fitting models. 
+
+\chapter{Using depmix}
+
+Three steps are involved in using \pkg{depmix} which are illustrated
+below with examples:
+\begin{enumerate}
+	\item  data specification with function \code{markovdata}
+	
+	\item  model specification with function \code{dmm}
+	
+	\item  model fitting with function \code{fitdmm}
+\end{enumerate}
+
+To be able to fit models, data need to in a specific format created
+for this package.  Basically, data should be in the form of a matrix
+with each row corresponding to measures taken at a single measurement
+occasion for a single subject.  The function \code{markovdata} further
+only requires one argument providing the itemtypes, being one of
+categorical, continuous or covariate.  A markovdata object is a matrix
+with a number of attributes.  
+
+\section{Creating data sets}
+
+As an example we make a dataset with two variables measured at two
+times 50 occasions.
+
+<<>>=
+x=rnorm(100,10,2)
+y=ifelse(runif(100)<0.5,0,1)
+z=matrix(c(x,y),100,2)
+md=markovdata(z,itemtypes=c("cont","cat"),ntimes=c(50,50))
+md[1:10,]
+@
+
+In the example below, we split the dataset \code{speed} into three
+separate datasets, which we later use as an example to do multi-group
+analysis.
+
+<<>>=
+data(speed)
+r1=markovdata(dat=speed[1:168,],itemt=itemtypes(speed))
+r2=markovdata(dat=speed[169:302,],itemt=itemtypes(speed))
+r3=markovdata(dat=speed[303:439,],itemt=itemtypes(speed))
+summary(r2)
+@
+
+
+Here is the full specification of the \code{markovdata} function. 
+
+\input{markovdata}
+
+
+\section{Data set speed}
+
+Throughout this manual we will use a data set called speed, and hence
+we provide some background information on how these data were
+gathered.
+
+\input{speed}
+
+Interesting hypotheses to test are: is the switching regime symmetric?
+Is there evidence for two states or does one state suffice?  Is the
+guessing state actually a guessing state, i.e., is the probability
+correct at chance level of 0.5?
+
+
+\section{Defining models}
+
+A dependent mixture model is defined by the number of states, and by
+the item distribution functions, and can be created with the
+\code{dmm}-function as follows:
+
+<<>>= 
+mod <-dmm(nstates=2,itemtypes=c("gaus",2))
+summary(mod)
+@
+
+Here \code{itemtypes} is a vector of length the number of items
+measured at each occasion specifying the desired distributions, in
+this case the first item is to follow a normal distribution, and the
+second item follows a bernouilli distribution.  Allowable
+distributions are listed in Table~\ref{tbl:dist}, along with their
+internal code, and the parameter names.  The R-internal code is used
+for estimating these parameters.  Specifics of these distributions and
+their estimation can be found in their respective help files.
+Itemtypes can be specified by their name or by their internal code,
+except in the case of multinomial items, which have to be specified by
+a number.
+
+\begin{table}[tbp]
+	\centering
+	\begin{tabular}{lcc}
+		\hline
+		distribution & code & parameters \\
+		\hline
+		\hline
+		multinomial & $2,3,4, \ldots $ & $p_1, p_2, p_3, \ldots $  \\
+		\hline
+		gaussian, normal & 1  & $ \mu, \sigma $ \\
+		\hline
+		lognormal & -21 & $l\mu$ , $l\sigma$  \\
+		\hline
+		weibull & -22 & shape (a), scale (b) \\
+		\hline
+		gamma & -23 & shape (a), scale (s)  \\
+		\hline
+		3lognormal & -31 &  $l\mu$ , $l\sigma$, shift \\
+		\hline
+		3weibull & -32 & shape (a), scale (b), shift \\
+		\hline
+		3gamma & -33 & shape (a), scale (s), shift \\
+		\hline
+	\end{tabular}
+	\caption{Allowable distribution names, internal codes, and number 
+	of parameters.}
+	\label{tbl:dist}
+\end{table}
+
+The function \code{dmm} returns an object of class \code{dmm} which
+has its own summary function providing the parameter values of the
+model. See the help files for further details. Except in simple 
+cases, starting values can be a problem in latent Markov models, and
+so in general it's best to provide them if you have a fairly good idea
+of what to expect. Providing starting values is done through the stval
+argument: 
+
+<<>>=
+st <- c(1,0.9,0.1,0.2,0.8,2,1,0.7,0.3,5,2,0.2,0.8,0.5,0.5)
+mod <- dmm(nsta=2,itemt=c(1,2), stval=st)
+summary(mod)
+@
+
+\input{dmm}
+
+\subsection{Generating data}
+
+The \code{dmm}-class has a \code{generate} method that can be used to 
+generate data according to a specified model. 
+
+<<>>=
+gen<-generate(c(100,50),mod)
+summary(gen)
+@
+\begin{figure}[htbp]
+  \begin{center}
+<<fig=TRUE>>=
+plot(gen)
+@
+	\caption{Two timeseries generated by 2-state model with one 
+	gaussian item and one binary item.}
+  \end{center}
+\end{figure}
+
+\input{generate}
+
+\section{Fitting models\label{fitmod}}
+
+Fitting models is done using the function \code{fitdmm}. The standard
+call only requires a dataset and a model as in:
+
+<<>>= 
+data(speed)
+mod <- dmm(nstates=2,itemtypes=c(1,2))
+fitex <- fitdmm(speed,mod)
+@
+
+Calling \code{fitdmm} produces some online ouput about the progress of
+the optimization which can be controlled with the \code{printlevel}
+argument.  Its default value of 1 just gives the first and the last
+iteration of the optimization; 0 gives no output, and setting it to 5
+or higher values will produce output at each iteration.  These values
+correspond with the 0,1, and 2 printlevel of \code{nlm}.  When using
+\code{optim}, the \code{printlevel} argument is used to set the
+\code{REPORT} argument of \code{optim} (see its help page for
+details).  Printlevel 0 gives report 0, printlevel 1 gives report 10,
+printlevels 2--4 give report 5 and printlevel>4 gives report 1,
+producing output at every iteration.  Printlevels starting from 15 and
+higher produce increasingly annoying output from the C-routines that
+compute the loglikelihood.
+
+\code{Fitdmm} returns an object of class \code{fit} which has a
+summary method showing the estimated parameter values, along with
+standard errors, and t-ratios whenever those are available.  Along
+with the log-likelihood, the AIC and BIC values are provided.  Apart
+from the printed values (see summary below), a \code{fit}-object has a
+number of other fields.  Most importantly, it contains a copy of the
+fitted model in \code{mod} and it has a field \code{post} containing
+posterior state estimates.  That is, for each group $g$,
+\code{post\$states[[g]]} is a matrix with dimensions the number of
+states of the model + 2, and \code{sum(ntimes(dat))}.  The first
+column contains the a posteriori component model, the second column
+has the state number within the component, and the other columns are
+used for the a posteriori probabilities of each of the states.
+
+<<>>=
+summary(fitex)
+@
+
+
+\input{fitdmm}
+
+\chapter{Extending and constraining models}
+
+\section{Fixing and constraining parameters}
+
+Continuing the example from above, it can be seen that in one of the
+states, the probability of a correct answer is about .5, as is the
+probability of an incorrect answer, i.e., these are parameters Item2,p1 and
+Item2,p2. This latent state, is supposed to be a guessing state, and hence 
+it makes sense to constrain these parameters to their theoretical values of 
+.5. Similarly, the initial state probability for the slow state is one, and
+zero for the other state, and hence it makes sense to fix these parameters.
+The third constraint that we consider here is an equality constraint
+between the transition parameters. Using this constraint, we can test the
+hypothesis whether the switching between states is a symmetric process or
+not. Hence, we constrain the transition parameters $a_{11}$ and $a_{22}$.
+
+Constraining and fixing parameters is done in a similar fashion as the
+\code{pa} command that is used in LISREL \citep{Lis99}. The
+\code{conpat} argument to the \code{fitdmm}-function specifies for
+each parameter in the model whether it's fixed (0) or free (1 or
+higher). Equality constraints can be imposed by having two parameters have 
+the same number in the \code{conpat} vector. When only fixed values are
+required the \code{fixed} argument can be used instead of \code{conpat},
+with zeroes for fixed parameters and other values (ones e.g.) for non-fixed
+parameters. 
+
+Fitting the models subject to these constraints is mostly done through
+reparametrization.  Inequality constraints are enforced by adding a
+penalty to the loglikelihood when the constraint is not satisfied.
+The penalty is linear in the amount by which the constraint is not
+satisfied, and not logarithmic or something similar which is often
+used (see e.g.\ the documentation for \code{constrOptim} which uses a
+logarithmic boundary for inequality constraints).  This has advantages
+and disadvantages.  There are two marked disadvantages.  First, the
+loglikelihood is not smooth at the boundary of the paramter space.
+Second, it can happen that the constraint is not satisfied.  Whenever
+cosntraints are not satisfied \code{fitdmm} exits with a warning
+stating the amount by which it is not satisfied.  This can be remedied
+by upping the \code{vfactor} argument which simply increases the
+penalty by this factor (its default value is 5).  An advantage is that
+using a linear penalty, it is possible that the parameter is estimated
+at the boundary, which is prohibited with logarithmic boundaries.
+
+
+\paragraph{Parameter numbering} When using the \code{conpat} and
+\code{fixed} arguments, complete vectors should be supplied, i.e., these
+vectors should have length of the number of parameters of the model.
+Parameters are numbered in the following order:
+\begin{enumerate}
+	\item  the mixing proportions of a mixture of latent Markov
+	models, i.e., just one parameter for a single component model
+	which has value 1 and is fixed
+
+	\item  the component parameters for each component consisting of
+	the following:
+	\begin{enumerate}
+		\item  transition parameters in row major order, $a_{11},
+		a_{12}, a_{13}, \ldots, a_{21}, a_{22}, a_{23}, \ldots $
+	
+		\item  the observation parameters per state and per item, in
+		the order listed in Table~\ref{tbl:dist}
+	
+		\item  the initial state probabilities per state
+	\end{enumerate}
+
+\end{enumerate}
+
+<<>>=
+conpat=rep(1,15)
+conpat[1]=0 
+conpat[14:15]=0 
+conpat[8:9]=0
+conpat[2]=conpat[5]=2
+stv=c(1,.896,.104,.084,.916,5.52,.20,.5,.5,6.39,.24,.098,.902,0,1)
+mod=dmm(nstates=2,itemt=c("n",2),stval=stv,conpat=conpat)
+@
+
+In the example above \code{conpat} is used to specify a number of
+constraints.  First, \code{conpat[1]=0} specifies that the mixing
+proportion of the model should be fixed (at its starting value of 1),
+which is always the case for single component models.  Second,
+\code{conpat[14:15]=0} fixes the initial state probabilities to zero
+and one respectively.  Similarly, for \code{conpat[8:9]=0}, which are
+the guessing state parameters for the accuracy scores.  They are both
+fixed at 0.5 so as to make the guessing state an actual guessing
+state.  Finally, by invoking \code{conpat[2]=conpat[5]=2}, transition
+parameters $a_{11}$ and $a_{22}$ are set to be equal.  Whenever
+equality constraints are not sufficient, general linear constraints
+can be specified using the \code{conrows} argument. 
+
+The constrained model has the following estimated
+parameters\footnote{Note that in running this example with the
+starting values from the unconstrained model, the initial
+loglikelihood is worse than the final loglikelihood because the
+initial likelihood is based on parameters that do not satisfy the
+constraints.}:
+
+% <<results=hide>>=
+% fitfix <- fitdmm(dat=speed,dmm=mod)
+% @
+% results in models.Rda
+
+<<>>=
+summary(fitfix)
+@
+
+
+\section{Multi group/case analysis}
+
+<<echo=FALSE>>=
+conpat=rep(1,15)
+conpat[1]=0
+conpat[8:9]=0
+conpat[14:15]=0
+stv=c(1,0.9,0.1,0.1,0.9,5.5,0.2,0.5,0.5,6.4,0.25,0.9,0.1,0.5,0.5)
+mod<-dmm(nstates=2,itemt=c("n",2),stval=stv,conpat=conpat)
+@
+
+\code{depmix} can handle multiple cases or multiple groups. A
+multigroup model is specified using the function \code{mgdmm} as
+follows:
+
+<<>>=
+mgr <- mgdmm(dmm=mod,ng=3,trans=TRUE,obser=FALSE)
+mgrfree <- mgdmm(dmm=mod,ng=3,trans=FALSE)
+@
+
+The \code{ng} argument specifies the number of groups, and the
+\code{dmm} argument specifies the model for each group.  \code{dmm}
+can be either a single model or list of models of \code{length(ng)}.
+If it is a single model, each group has an identical structural model
+(same fixed and constrained parameters), and else each group has its
+model. Three further arguments can be used to constrain parameters
+between groups, \code{trans}, \code{obser}, and \code{init}
+respectively. By setting either of these to \code{TRUE}, the
+corresponding transition, observation, and initial state parameters
+are estimated equal between groups\footnote{There is at this moment no way of 
+fine-tuning this to restrict equalities to individual parameters.
+However, this can be accomplished by manually changing the linear
+constraint matrix, and the corresponding upper and lower boundaries.}.
+
+In this example, the model from above was used and fitted on the three
+observed series, and the \code{trans=TRUE} ensures that the
+transition matrix parameters are constrained to be equal between the
+models for these series, whereas the observation parameters are freely
+estimated, i.e.\ to capture learning effects. The resulting
+parameters are: 
+
+% <<>>=
+% fitmg <- fitdmm(dat=list(r1,r2,r3),dmm=mgr,ses=FALSE)
+% fitmgfree <- fitdmm(dat=list(r1,r2,r3),dmm=mgrfree,ses=FALSE)
+% @
+
+<<>>=
+summary(fitmg)
+@
+
+The loglikelihood ratio statistic can be used to test whether
+constraining these transition parameters significantly reduces the
+goodness-of-fit of the model.  The statistic has value
+$LR=\Sexpr{round(2*(fitmgfree$loglike-fitmg$loglike),3)}$, and it has
+an approximate $\chi^{2}$ distribution with $df=4$ because in each but
+the first model, two transition matrix parameters were estimated equal
+to the parameters in the first model (note that the other two
+transition parameters were already had to be constrained to ensure
+that the rows of the transition matrices sum to 1).  The associated
+$p$-value for the statistic is
+$p=\Sexpr{round(pchisq(2*(fitmgfree$loglike-fitmg$loglike),4,lower.tail=FALSE),3)}$,
+indicating that constraining the transition matrix parameters does not
+significantly worsen the goodness-of-fit of the model.
+
+
+\input{mgdmm}
+
+
+\section{Models with time-dependent covariates}
+
+Specifying a model with covariates is done by including two arguments 
+in a call to \code{dmm} called \code{tdfix} and \code{tdst}, where
+\code{td} means time dependent. \code{tdfix} is a logical vector of
+length the number of parameters of the model, specifying which
+parameters are to be estimated time-dependent. For an arbitrary
+parameter $\lambda$, the model that is estimated has the form:
+\begin{equation}
+	\lambda_{t}=\lambda_{0}+\beta x_{t},
+\end{equation}
+where $\lambda_{0}$ is the intercept of the parameter, $\beta$ is the 
+regression coefficient, and $x_{t}$ is the time-dependent covariate.
+The covariate has to be scaled to lie between 0 and 1; this is
+neccessary to be able to impose the right constraints on $\beta$ in
+order to ensure that $\lambda_{t}$ is always appropriate, ie within
+its lower and upper bounds (mostly 0 and 1 for multinomial item
+parameters and transition parameters etc). The current version of
+\code{depmix} does not have non-time-dependent covariates, which can
+simply be faked by having $x_{t}$ be constant, and there is only
+support for a single covariate. 
+
+In the example below, the transition parameters (numbers 2--5) are
+defined to depend on the covariate which is the pay-off for accuracy. 
+Providing starting values for the covariates is optional. If not
+provided they are chosen at random around 0 which usually works just
+fine. 
+
+<<>>=
+conpat=rep(1,15)
+conpat[1]=0
+conpat[8:9]=0
+conpat[14:15]=0
+conpat[2]=2
+conpat[5]=2
+stv=c(1,0.9,0.1,0.1,0.9,5.5,0.2,0.5,0.5,6.4,0.25,0.9,0.1,0,1)
+tdfix=rep(0,15)
+tdfix[2:5]=1
+tdst=rep(0,15)
+tdst[2:5]=c(-0.4,0.4,0.15,-0.15)
+
+mod<-dmm(nstates=2,itemt=c("n",2),stval=stv,conpat=conpat,tdfix=tdfix,tdst=tdst,modname="twoboth+cov")
+@
+
+% <<results=hide>>=
+% fittd <-
+% fitdmm(dat=speed,dmm=mod,tdcov=1,print=5,der=0,ses=0,vf=130)
+% @
+% results in models.Rda
+
+<<>>=
+summary(fittd)
+@
+
+
+\chapter{Special topics}
+
+\section{Starting values}
+
+Although providing your own starting values is preferable,
+\pkg{depmix} has a routine for generating starting values using the
+\code{kmeans}-function from the \pkg{stats}-package.  This will
+usually provide reasonable starting values, but can be way off in a
+number of cases.  First, for univariate categorical time series,
+\code{kmeans} does not work at all, and \pkg{depmix} will provide a
+warning.  Second, for multivariate series with unordered categorical
+items with more than 2 categories, \code{kmeans} may provide good
+starting values, but they may similarly be completely off, due to the
+implicit assumption in \code{kmeans} that the categories are
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/depmix -r 280