[adegenet-commits] r879 - pkg/inst/doc

Thu May 26 12:35:58 CEST 2011

Author: jombart
Date: 2011-05-26 12:35:58 +0200 (Thu, 26 May 2011)
New Revision: 879

Modified:
   pkg/inst/doc/adegenet-dapc.Rnw
Log:
Started the DAPC tutorial. Shit this is gonna be long...


Modified: pkg/inst/doc/adegenet-dapc.Rnw
===================================================================

--- pkg/inst/doc/adegenet-dapc.Rnw	2011-05-26 09:53:53 UTC (rev 878)
+++ pkg/inst/doc/adegenet-dapc.Rnw	2011-05-26 10:35:58 UTC (rev 879)
@@ -36,14 +36,20 @@
 \maketitle
 
 \begin{abstract}
-  This vignette provides a tutorial for using the Discriminant Analysis of Principal Components
-  (DAPC \cite{tjart19})
+  This vignette provides a tutorial for applying the Discriminant Analysis of Principal Components
+  (DAPC \cite{tjart19}) using the \textit{adegenet} package \cite{tjart05} for the R software
+  \cite{np145}. This methods aims to identify and describe genetic clusters, although it can in fact
+  be applied to any quantitative data. We illustrate how to use \code{find.clusters} to identify
+  clusters, and \code{dapc} to describe the relationships between these clusters. More advanced
+  topics are then introduced, such as the stability of DAPC results and supplementary individuals.
 \end{abstract}
 
 
 \newpage
 \tableofcontents
 
+
+
 %%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%
 \section{Introduction}
@@ -51,9 +57,43 @@
 %%%%%%%%%%%%%%%%
 
 
+%%%%%%%%%%%%%%%%
+\subsection{Rationale}
+%%%%%%%%%%%%%%%%
 
+Investigating genetic diversity using multivariate approaches relies on finding synthetic variables
+built as linear combinations of alleles (i.e. $a_1 \mbox{allele}_1 + a_2 \mbox{allele}_2 + ... $)
+and which reflect as well as possible the genetic variation between the studied individuals.
+However, most of the time we are not only interested in the diversity amongst individuals, but
+also and possibly more in the diversity between groups of individuals.
+Typically, one will be analysing individual data to identify populations, or more largely genetic
+clusters, and then describe these clusters.
 
+A problem occuring in traditional methods is focussing on the entire variation.
+Genetic data can be described using a standard multivariate ANOVA model:
+$$
+\mbox{total variance} = \mbox{(variance between groups)} + \mbox{(variance within groups)}
+$$
+or more simply, denoting $\m{X}$ the data matrix:
+$$
+VAR(\m{X}) = B(\m{X}) + W(\m{X})
+$$
 
+That is, usual approaches such as Principal Component Analysis (PCA) or Principal Coordinate
+Analysis (PCoA / MDS) focus on $VAR(\m{X})$. That is, they only describe the global diversity,
+possibly overlooking differences between groups. On the contrary, DAPC optimizes $B(\m{X})$ while
+minimizing $W(\m{X})$: it seeks synthetic variables, the \textit{discriminant functions}, which show
+differences between groups as best as possible while minimizing variation within clusters.
+
+
+
+
+
+
+
+
+
+
 %%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%
 \section{Identifying clusters using \code{find.clusters}}
@@ -63,19 +103,45 @@
 %%%%%%%%%%%%%%%%
 \subsection{Rationale}
 %%%%%%%%%%%%%%%%
+DAPC in itself requires prior groups to be defined. However, groups are often unknown or uncertain,
+and there is a need for identifying genetic clusters before describing them. This can be achieved by
+using $k$-means, a clustering algorithm which finds $k$ groups maximizing the variation between
+groups, $B(\m{X})$. To identify the optimal number of clusters, $k$-means is run sequentially with
+increasing values of $k$, and different clustering solutions are compared using Bayesian Information
+Criterion (BIC). Ideally, the optimal clustering solution should correspond to the lowest BIC. In
+practice, the 'best' BIC is often indicated by an elbow in the curve of BIC values as a function of
+$k$.
 
+While $k$-means could be performed on the raw data, we prefer running the algorithm after
+transforming the data using PCA. This transformation has the major advantage of reducing the
+number of variables so as to speed up the clustering algorithm. Note this does not imply a loss of
+information and different results from the raw data, since one can retain all the principal
+components (PCs) and therefore all the variation in the original data. However, in practice, a reduced
+number of PCs is often sufficient to identify the existing clusters, while allowing the clusters to
+be obtained essentially instantaneously.
 
+
 %%%%%%%%%%%%%%%%
 \subsection{In practice}
 %%%%%%%%%%%%%%%%
 
+Identification of the clusters is achieved by \code{find.clusters}. This function first transforms
+the data using PCA, asking the users to specify the number of retained PCs interactively unless the
+argument \code{n.pca} is provided. Then, it runs $k$-means algorithm (function \code{kmeans} from
+the \textit{stats} package) with increasing values of $k$, unless the argument  \code{n.clust} is
+provided. See \code{?find.clusters} for other arguments.
 
+\code{find.clusters} is a generic function with methods for \texttt{data.frame}, and objects with
+the class \texttt{genind} (usual genetic markers) and \texttt{genlight} (genome wide SNP data).
 
 
 
 
 
 
+
+
+
 %%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%
 \section{Describing clusters using \code{dapc}}
@@ -121,8 +187,30 @@
 
 
 
+
 %%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%
+\section{Ensuring stability of DAPC results}
+%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%
+\subsection{Why DAPC results could vary?}
+%%%%%%%%%%%%%%%%
+
+
+
+%%%%%%%%%%%%%%%%
+\subsection{Using the $a$-score}
+%%%%%%%%%%%%%%%%
+
+
+
+
+
+%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%
 \section{Using supplementary individuals}
 %%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%