[Mattice-commits] r107 - in pkg: inst misc

Tue Dec 23 00:00:17 CET 2008

Author: andrew_hipp
Date: 2008-12-23 00:00:16 +0100 (Tue, 23 Dec 2008)
New Revision: 107

Added:
   pkg/inst/doc/
   pkg/misc/maticce.Rnw
Log:
drafting maticce vignette

Added: pkg/misc/maticce.Rnw
===================================================================

--- pkg/misc/maticce.Rnw	                        (rev 0)
+++ pkg/misc/maticce.Rnw	2008-12-22 23:00:16 UTC (rev 107)
@@ -0,0 +1,387 @@
+\documentclass{article}
+%\VignettePackage{maticce}
+% \VignetteIndexEntry{maticce: Mapping Transitions in Continuous Character Evolution}
+\usepackage{graphicx}
+\usepackage[colorlinks=true,urlcolor=blue]{hyperref}
+\usepackage{array}
+\usepackage{color}
+
+\usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote()
+\newcommand{\code}[1]{{{\tt #1}}}
+\title{Utilizing \pkg{maticce} to estimate transitions in continuous character evolution}
+\author{Andrew Hipp}
+\date{\today}
+\begin{document}
+\maketitle
+\tableofcontents
+
+\section{Introduction}
+
+This document provides an overview of the \pkg{maticce} package, which serves three primary purposes. First, it implements an information-theoretic approach to estimating where on a phylogeny there has been a transition in a continuous character. As currently implemented, the approach assumes that (1) such transitions are appropriately modeled as shifts in optimum / equilibrium of a character evolving according to an Ornstein-Uhlenbeck process; (2) strength of constraint / rate of evolution toward the optimum is constant over the tree, as is variance; and (3) all branches on which a change could occur are identified. These assumptions can be relaxed in future versions if needed. Second, the package provides helper functions for the \pkg{ouch} package, in which all likelihood calculations are performed. For example, the package automates the process of painting regimes for the \code{hansen} function of \pkg{ouch}, specifying nodes at which the regime changes. It also provides functions for identifying most recent common ancestors and all descendents of a particular node. Users of \pkg{ouch} who want to handle large numbers of analyses may find the routines for summarizing analyses over trees and over regimes useful as well. Finally, \pkg{maticce} provides a flexible set of simulation functions for visualizing how different model parameters affect (i.e., what they \dquote{say} about) our inference of the evolution of a continuous character on a phylogenetic tree.
+
+This document also provides a worked example of analyzing a continuous character dataset that illustrates most of the \pkg{maticce} features. Working through this example will I expect address most questions that should come up during a typical analysis.
+
+XXXXXXXXXXXXXXXXXXX
+
+\section{Package Overview}
+
+The phylobase package currently implements the following functions and data structures:
+
+\begin{itemize}
+  \item Data structures for storing a single tree and multiple trees: \code{phylo4} and \code{multiPhylo4}?
+  \item A data structure for storing a tree with associated tip and node data: \code{phylo4d}
+  \item A data structure for storing multiple trees with one set of tip data: \code{multiPhylo4d}
+  \item Functions for reading nexus files into the above data structures
+  \item Functions for converting between the above data structures and \code{ape phylo} objects as well as \code{ade4 phylog} objects
+  \item Functions for editing trees and data (i.e., subsetting and replacing)
+  \item Functions for plotting trees and trees with data
+\end{itemize}
+
+\section{Using the S4 help system}
+
+The \code{S4} help system works similarly to the \code{S3} help system with some small differences relating to how \code{S4} methods are written.  The \code{plot()} function is a good example.  When we type \code{?plot} we are provided the help for the default plotting function which expects \code{x} and \code{y}.  \code{R} also provides a way to smartly dispatch the right type of plotting function.  In the case of an \code{ape phylo} object (a \code{S3} class object) \code{R} evaluates the class of the object and finds the correct functions, so the following works correctly.
+
+<<randtree1,fig=FALSE>>=
+library(ape)
+# Make a random tree with 10 tips
+rand_tree <- rcoal(10)
+plot(rand_tree)
+@ 
+
+However, typing \code{?plot} still takes us to the default \code{plot} help.  We have to type \code{plot.phylo} to find what we are looking for.  This is because \code{S3} generics are simply functions with a dot and the class name added.  
+
+The \code{S4} generic system is too complicated to describe here, but doesn't include the same dot notation.  As a result \code{?plot.phylo4} doesn't work, \code{R} does, however, find the right plotting function.
+
+<<convtree,fig=FALSE>>=
+library(phylobase)
+# convert rand_tree to a phylo4 object
+rand_p4_tree <- as(rand_tree, "phylo4")
+plot(rand_p4_tree)
+@ 
+
+All fine and good, but how to we find out about all the great features of the \code{phylobase} plotting function?  \code{R} has two nifty ways to find it, the first is to simply put a question mark in front of the whole call:
+
+\begin{verbatim}
+	> ?plot(rand_p4_tree)
+\end{verbatim}
+
+\code{R} looks at the class of the \code{rand\_p4\_tree} object and takes us to the correct help file (note: this only works with \code{S4} objects).  The second ways is handy if you already know the class of your object, or want to compare to generics for different classes:
+
+\begin{verbatim}
+	> method?plot("phylo4")
+\end{verbatim}
+
+More information about how \code{S4} documentation works 
+can be found in the methods package, by running the following command.
+
+<<doc,eval=FALSE>>=
+help('Documentation', package = "methods") 
+@ 
+
+\section{Trees without data}
+
+You can start with a tree --- an object of
+class \code{phylo} from the \code{ape} package
+(e.g., read in using the \code{read.tree()} or \code{read.nexus()}
+functions), and convert it to a \code{phylo4} object.
+
+For example, load the raw \emph{Geospiza} data:
+<<geodata>>=
+data(geospiza_raw)
+names(geospiza_raw)
+@ 
+
+Convert the \code{S3} tree to a \code{S4 phylo4} object using the \code{as()} function:
+<<convgeodata>>=
+library(phylobase)
+g1 <- as(geospiza_raw$tree,"phylo4")
+g1
+@ 
+
+Note that the nodes and edges are given default names if the tree contains no node or edge names.
+
+The \code{summary} method gives a little extra information, including information on branch lengths:
+<<sumgeodata>>=
+summary(g1)
+@ 
+
+Print tip labels:
+<<tiplabelgeodata>>=
+labels(g1)
+@ 
+
+Print internal node labels (R automatically assigns values):
+<<nodelabelgeodata>>=
+nodeLabels(g1)
+@ 
+
+Print edge labels (also automatically assigned):
+<<edgelabelgeodata>>=
+edgeLabels(g1)
+@ 
+
+Is it rooted?
+<<rootedgeodata>>=
+isRooted(g1)
+@
+
+Which node is the root?
+<<rootnodegeodata>>=
+rootNode(g1)
+@ 
+
+Does it have any polytomies?
+<<polygeodata>>=
+hasPoly(g1)
+@ 
+
+Does it have branch lengths?
+<<hasbrlengeodata>>=
+hasEdgeLength(g1)
+@ 
+
+You can modify labels and other aspects
+of the tree --- for example,
+<<modlabelsgeodata>>=
+labels(g1) <- tolower(labels(g1))
+@ 
+
+\section{Trees with data}
+
+The \code{phylo4d} class matches trees with data.
+(\textbf{fixme: need to be able to use ioNCL!})
+or combine it with a data frame to make a \code{phylo4d} (tree-with-data)
+object.
+
+Now we'll take the \emph{Geospiza} data from \verb+geospiza_raw$data+
+and merge it with the tree.  However, since \emph{G. olivacea} is included
+in the tree but not in the data set, we will initially run into some trouble:
+
+<<geomergedata,eval=FALSE>>=
+g2 <- phylo4d(g1,geospiza_raw$data)
+@ 
+
+gives
+<<geomergeerr1,echo=FALSE>>=
+err1 <- try(g2 <- phylo4d(g1,geospiza_raw$data),silent=TRUE)
+cat(as.character(err1))
+@ 
+
+We have two problems --- the first is that we forgot to lowercase
+the labels on the data to match the tip labels:
+
+<<geomergenames>>=
+gdata <- geospiza_raw$data
+row.names(gdata) <- tolower(row.names(gdata))
+@ 
+
+To deal with the second problem
+(missing data for \emph{G. olivacea}), we have a few choices.
+The easiest is to use \code{missing.tip.data="OK"}
+to allow R to create the new object:
+<<geomerge2>>=
+g2 <- phylo4d(g1,gdata,missing.tip.data="OK")
+@ 
+(setting \code{missing.tip.data} to \code{"warn"}
+would create the new object but print a warning).
+
+Another way to deal with this would be to 
+use \code{prune()} to drop
+the offending tip from the tree first:
+<<geomerge3,results=hide>>=
+g1B <- prune(g1,"olivacea")
+phylo4d(g1B,gdata)
+@ 
+
+You can summarize the new object:
+<<geomergesum>>=
+summary(g2)
+@ 
+
+Or use \code{tdata()} to extract the data (i.e., \code{tdata(g2)}). By default, \code{tdata()} will retrieve tip data, but you can also get internal node data only (\code{tdata(tree,"node")}) or --- if the tip and node data have the same format --- all the data combined (\code{tdata(tree,"allnode")}).
+
+Plotting calls \code{plot.phylog} from the \code{ade4} package.
+
+If you want to plot the data (e.g. for checking the input), \code{plot(tdata(g2))} will create the default plot for the data --- in this case, since it is a data frame [\textbf{this may change in future versions but should remain transparent}] this will be a \code{pairs} plot of the data.
+
+\section{Subsetting}
+
+The \code{subset} command offers a variety of ways of extracting portions of a \code{phylo4} or \code{phylo4d} tree, keeping any tip/node data consistent.
+
+\begin{description}
+\item[tips.include]{give a vector of tips (names or numbers) to retain}
+\item[tips.exclude]{give a vector of tips (names or numbers) to drop}
+\item[mrca]{give a vector of node or tip names or numbers; extract the clade containing these taxa}
+\item[node.subtree]{give a node (name or number); extract the subtree starting from this node}
+\end{description}
+
+Different ways to extract the \emph{fuliginosa}-\emph{scandens}
+clade:
+<<geoextract,results=hide>>=
+subset(g2,tips.include=c("fuliginosa","fortis","magnirostris",
+            "conirostris","scandens"))
+subset(g2,node.subtree="N07")
+subset(g2,mrca=c("scandens","fortis"))
+@ 
+
+One could drop the clade by  doing
+<<geodrop,results=hide>>=
+subset(g2,tips.exclude=c("fuliginosa","fortis","magnirostris",
+            "conirostris","scandens"))
+subset(g2,tips.exclude=names(descendants(g2,MRCA(g2,c("difficilis","fortis")))))
+@ 
+
+Another approach is to pick the subtree graphically, by plotting the tree and using \code{identify}, which returns the identify of the node you click on with the mouse.
+
+<<geoident,eval=FALSE>>=
+plot(g1)
+n1 <- identify(g1)
+subset(g2,node.subtree=n1)
+@ 
+
+\section{Tree-walking}
+
+\code{getnodes},
+\code{children}, \code{parent},
+\code{descendants}, \code{ancestors},
+\code{siblings},
+\code{MRCA} \ldots
+generally take a \code{phylo4} object, a node
+(specified by number or name) and return a named
+vector of node numbers.
+
+\section{multiPhylo classes}
+
+\section{Examples}
+
+\subsection{Constructing a Brownian motion trait simulator}
+
+This section will describe two (?) ways of constructing
+a simulator that generates trait values for extant species
+(tips) given a tree with branch lengths, assuming a model
+of Brownian motion.
+
+\subsubsection{the easy way}
+
+We can use the \code{vcv.phylo()} command from
+\code{ape} to construct the variance-covariance
+matrix of the tip traits, after which it's easy
+to use \code{mvrnorm} from the \code{MASS} package
+to generate a set of multivariate normally distributed
+values for the tips. (A benefit of this approach is
+that we can very quickly generate a very large
+number of replicates.)
+This example illustrates a common feature of
+working with \code{phylobase} --- combining tools from
+several different packages to operate on phylogenetic
+trees with data. 
+
+We start with a randomly generated tree using
+\code{rcoal()} from \code{ape} to generate the
+tree topology and branch lengths:
+<<rtree2>>=
+set.seed(1001)
+tree <- rcoal(12)
+@ 
+
+Next we generate the phylogenetic variance-covariance
+matrix (\code{ape::vcv.phylo}) and pick a single set
+of traits (\code{MASS:mvrnorm}).  Conveniently, the
+tip names of the original tree
+are inherited consistently by the variance-covariance
+matrix and the trait matrix:
+<<vcvphylo>>=
+vmat <- vcv.phylo(tree,cor=TRUE)
+library(MASS)
+trvec <- mvrnorm(1,mu=rep(0,12),Sigma=vmat)
+@ 
+
+The last step (easy) is to create the \code{phylo4d}
+object and plot it:
+<<plotvcvphylo,fig=TRUE>>=
+treed <- phylo4d(tree,tip.data=as.data.frame(trvec))
+plot(treed)
+@ 
+
+\subsubsection{The hard way?}
+
+Find root, traverse tree:
+
+
+% ========================================
+% = Table of commands, worth the effort? =
+% ========================================
+% \begin{tabular}{>{\tt}ll}
+% \hline
+% \rm Method & Description\\
+% \hline
+% tdata & Retrieve tip data\\
+% plot & plot tree with data if present\\
+% \hline
+% \end{tabular}
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%% Appendices %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\appendix
+\section{Definitions/slots}
+
+This section details the internal structure of the \code{phylo4}, \code{multiphylo4}, \code{phylo4d}, and \code{multiphylo4d} classes.  The basic building blocks of these classes are the \code{phylo4} object and a dataframe.  The \code{phylo4} tree format is largely similar to the one used by \code{phylo} class in the package \code{ape} \footnote{\url{http://ape.mpl.ird.fr/}}.
+
+
+\subsection{phylo4}
+Like \code{phylo}, the main components of
+the \code{phylo4} class are:
+\begin{description}
+\item[edge]{an $N \times 2$ matrix of integers,
+  where the first column \ldots}
+\item[edge.length]{numeric list of edge lengths
+(length $N$ or empty)}
+\item[Nnode]{integer, number of nodes}
+\item[tip.label]{character vector of tip labels (required)}
+\item[node.label]{character vector of node labels (maybe empty)}
+\item[root.edge]{integer defining root edge (maybe NA)}
+\end{description}
+
+We have defined basic methods for \code{phylo4}:\code{show}, \code{print} (copied from \code{print.phylo} in\code{ape}), and a variety of accessor functions (see help files). \code{summary} does not seem to be terribly useful in the context of a ``raw'' tree, because there is not much to compute: \textbf{end users?}
+
+Print method: add information about (ultrametric, scaled, polytomies (zero-length or structural))?
+
+\subsection{phylo4d}
+
+The \code{phylo4d} class extends \code{phylo4} with data.  Tip data, (internal) node data, and edge data are stored separately, but can be retrieved together or separately with \code{tdata(x,"tip")} or \code{tdata(x,"all")}.
+
+\textbf{edge data can also be included --- is this
+useful/worth keeping?}
+
+\subsection{multiphylo4}
+
+\section{Validity checking}
+
+\begin{itemize}
+\item number of rows of edge matrix ($N$) == length of edge-length vector (if $>0$)
+\item (number of tip labels)+(nNode)-1 == $N$
+\item data matrix must have row names
+\item row names must match tip labels (if not, spit out mismatches)
+\end{itemize}
+ 
+Default node labels:
+
+\section{Hacks/backward compatibility}
+
+There is a way to hack the \verb+$+ operator so that it would provide backward compatibility with code that is extracting internal elements of a \code{phylo4}. The basic recipe is: 
+
+<<eval=FALSE>>=
+setMethod("$","phylo4",function(x,name) { attr(x,name)})
+@ 
+
+but this has to be hacked slightly to intercept calls to elements that might be missing.  For example, \code{ape} detects whether log-likelihood, root edges, node labels, etc. are missing by testing whether they are \code{NULL}, whereas missing items are represented in \code{phylo4} by zero-length vectors in the slots (or \code{NA} for the root edge) --- so we need code like 
+<<eval=FALSE>>=
+if(!hasNodeLabels(x)) NULL else x at node.label
+@ 
+to handle these cases.
+
+
+\end{document}