[adegenet-commits] r919 - in pkg/inst/doc: . figs
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Jun 20 17:24:12 CEST 2011
Author: jombart
Date: 2011-06-20 17:24:11 +0200 (Mon, 20 Jun 2011)
New Revision: 919
Added:
pkg/inst/doc/adegenet-spca.Rnw
pkg/inst/doc/adegenet-spca.pdf
pkg/inst/doc/adegenet-spca.tex
pkg/inst/doc/figs/rupicacolors.png
Log:
New sPCA vignette.
Added: pkg/inst/doc/adegenet-spca.Rnw
===================================================================
--- pkg/inst/doc/adegenet-spca.Rnw (rev 0)
+++ pkg/inst/doc/adegenet-spca.Rnw 2011-06-20 15:24:11 UTC (rev 919)
@@ -0,0 +1,962 @@
+\documentclass{article}
+% \VignettePackage{spca}
+% \VignetteIndexEntry{A tutorial for spatial Analysis of Principal Components}
+
+\usepackage{graphicx}
+\usepackage[colorlinks=true,urlcolor=blue]{hyperref}
+\usepackage{array}
+\usepackage{color}
+\usepackage{amsfonts}
+
+\usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote()
+
+
+% for bold symbols in mathmode
+\usepackage{bm}
+
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\beq}{\begin{equation}}
+\newcommand{\eeq}{\end{equation}}
+\newcommand{\m}[1]{\mathbf{#1}}
+
+
+
+\newcommand{\code}[1]{{{\tt #1}}}
+\title{A tutorial for spatial Analysis of Principal Components (sPCA) using \textit{adegenet} \Sexpr{packageDescription("adegenet", fields = "Version")}}
+\author{Thibaut Jombart}
+\date{\today}
+
+
+
+
+\sloppy
+\hyphenpenalty 10000
+
+
+\begin{document}
+
+
+\SweaveOpts{prefix.string = figs/spca, echo=TRUE, eval=TRUE, fig = FALSE, eps = FALSE, pdf = TRUE}
+
+
+\definecolor{Soutput}{rgb}{0,0,0.56}
+\definecolor{Sinput}{rgb}{0.56,0,0}
+\DefineVerbatimEnvironment{Sinput}{Verbatim}
+{formatcom={\color{Sinput}},fontsize=\footnotesize, baselinestretch=0.75}
+\DefineVerbatimEnvironment{Soutput}{Verbatim}
+{formatcom={\color{Soutput}},fontsize=\footnotesize, baselinestretch=0.75}
+
+\color{black}
+
+\maketitle
+
+\begin{abstract}
+ This vignette provides a tutorial for the spatial analysis of principal components (sPCA, \cite{tjart04}) using
+ the \textit{adegenet} package \cite{tjart05} for the R software \cite{np145}. sPCA is first
+ illustrated using a simple simulated dataset, and then using empirical data of Chamois
+ (\textit{Rupicapra rupicapra}) from the Bauges mountains (France). In particular, we illustrate
+ how sPCA complements classical PCA by being more powerful for retrieving non-trivial spatial genetic patterns.
+\end{abstract}
+
+
+\newpage
+\tableofcontents
+
+
+
+
+\newpage
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Introduction}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+This tutorial goes through the \emph{spatial Principal Component
+ Analysis} (sPCA, \cite{tjart04}), a multivariate method devoted to
+the identification of spatial genetic patterns.
+The purpose of this tutorial is to provide guidelines for the application of sPCA as well as to
+illustrate its usefulness for the investigation of spatial genetic patterns.
+After briefly going through the rationale of the method, we introduce the different tools
+implemented for sPCA in \texttt{adegenet}.
+This technical overview is then followed by the analysis of an empirical dataset which illustrates
+the advantage of sPCA over classical PCA.
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Rationale of sPCA}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Mathematical notations used in this tutorial are those from \cite{tjart04}.
+The sPCA analyses a data matrix $\m{X}$ which contains genotypes or
+populations (later refered to as 'entities') in rows and alleles in columns.
+Spatial information is stored inside a spatial weighting matrix
+$\m{L}$ which contains positive terms corresponding to some measurement
+(often binary) of spatial proximity among entities.
+Most often, these terms can be derived from a connection network built
+upon a given algorithm (for instance, pp.572-576 in \cite{tj88}).
+This matrix is row-standardized (\textit{i.e.}, each of its rows sums
+to one), and all its diagonal terms are zero.
+$\m{L}$ can be used to compute the spatial autocorrelation of a
+given centred variable $\m{x}$ (\texttt{i.e.}, with mean zero) with $n$ observations ($\m{x} \in
+\R^n$) using Moran's $I$ \cite{tj223,tj222,tj436}:
+\beq
+I(\m{x}) = \frac{\m{x}^T\m{Lx}}{\m{x}^T\m{x}}
+\label{eqn:I}
+\eeq
+
+In the case of genetic data, $\m{x}$ contains frequencies of an allele.
+Moran's $I$ can be used to measure spatial structure in the values
+of $\m{x}$: it is highly positive when values of $\m{x}$ observed at
+neighbouring sites tend to be similar (positive spatial
+autocorrelation, referred to as \emph{global structures}), while it is
+strongly negative when values of $\m{x}$ observed at
+neighbouring sites tend to be dissimilar (negative spatial
+autocorrelation, referred to as \emph{local structures}).
+\\
+
+
+However, Moran's index measures only spatial structures, does not take
+the variability of $\m{x}$ into account.
+The sPCA defines the following function to measure both spatial
+structure and variability in $\m{x}$:
+\beq
+C(\m{x}) = \mbox{var}(\m{x})I(\m{x}) = \frac{1}{n}\m{x}^T\m{Lx}
+\label{eqn:C}
+\eeq
+
+$C(\m{x})$ is highly positive when $\m{x}$ has a large variance and
+exhibits a global structure; conversely, it is largely negative
+when $\m{x}$ has a high variance and displays a local structure.
+This function is the criterion used in sPCA, which finds linear
+combinations of the alleles of $\m{X}$ (denoted $\m{Xv}$) decomposing $C$ from its
+maximum to its minimum value.
+Because $C(\m{Xv})$ is a product of variance and of autocorrelation,
+it is important, when interpreting the results, to detail both
+components and to compare their value with their range of variation
+(maximum attainable variance, as well as maximum and minimum $I$ are
+known analytically).
+A structure with a low spatial autocorrelation can barely be
+interpreted as a spatial pattern; similarly, a structure with a low
+variance would likely not reflect any genetic structure.
+We will later see how these information can be retrieved in \texttt{adegenet}.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{The \texttt{spca} function}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The simulated dataset used to illustrate this section has been
+analyzed in \cite{tjart04}, and corresponds to Figure 2A of the article.
+In \texttt{adegenet}, the matrix of alleles frequencies previously
+denoted $\m{X}$ exactly corresponds to the \texttt{@tab} slot of \texttt{genind} or
+\texttt{genpop} objects:
+<<>>=
+library(adegenet)
+data(spcaIllus)
+obj <- spcaIllus$dat2A
+obj
+head(truenames(obj[loc="L01"])$tab)
+@
+\noindent The object \texttt{obj} is a \texttt{genind} object; note
+that here, we only displayed the table for the first locus (\texttt{loc="L01"}).
+
+The function performing the sPCA is \texttt{spca}; it accepts a bunch
+of arguments, but only the first two are mandatory to perform the
+analysis (see \texttt{?spca} for further information):
+<<>>=
+args(spca)
+@
+The argument \texttt{obj} is a \texttt{genind}/\texttt{genpop} object.
+By definition in sPCA, the studied entities are georeferenced.
+The spatial information can be provided to the function \texttt{spca}
+in several ways, the first being through the \texttt{xy} argument,
+which is a matrix of spatial coordinates with
+'x' and 'y' coordinates in columns.
+Alternatively, these coordinates can be stored inside the
+\texttt{genind}/\texttt{genpop} object, preferably as
+\texttt{@other\$xy}, in which case the \texttt{spca} function will not
+require a \texttt{xy} argument.
+Basically, spatial information could be stored in any form and with
+any name in the \texttt{@other} slot, but the \texttt{spca} function
+would not recognize it directly.
+Note that \texttt{obj} already contains spatial coordinates at the
+appropriate place.
+Hence, the following uses are valid (\texttt{ask} and \texttt{scannf}
+are set to FALSE to avoid interactivity):
+<<>>=
+mySpca <- spca(obj, ask=FALSE, type=1, scannf=FALSE)
+mySpca2 <- spca(obj,xy=other(obj)$xy,ask=FALSE, type=1, scannf=FALSE)
+all.equal(mySpca, mySpca2)
+names(mySpca)[8]
+@
+\noindent Both objects are the same: they only differ by their call.
+\\
+
+
+Note, however, that spatial coordinates are not directly used in sPCA:
+the spatial information is included in the analysis by the spatial
+weighting matrix $\m{L}$ derived from a connection network (eq. \ref{eqn:I} and \ref{eqn:C}).
+Technically, the \texttt{spca} function does not directly use a
+matrix of spatial weightings, but a connection network with the class
+\texttt{nb} or a list of spatial weights of class \texttt{listw},
+which are both implemented in the \texttt{spdep} package.
+The function \texttt{chooseCN} is a wrapper for different functions
+spread in several packages implementing a variety of connection networks.
+If only spatial coordinates are provided to \texttt{spca},
+\texttt{chooseCN} is called to construct an appropriate graph.
+See \texttt{?chooseCN} for more information.
+Note that many of the \texttt{spca} arguments are in fact arguments
+for \texttt{chooseCN}: \texttt{type}, \texttt{ask}, \texttt{plot.nb},
+\texttt{edit.nb}, \texttt{d1}, \texttt{d2}, \texttt{k}, \texttt{a}, and \texttt{dmin}.
+For instance, the command:
+<<>>=
+mySpca <- spca(obj,type=1,ask=FALSE,scannf=FALSE)
+@
+\noindent performs a sPCA using the Delaunay triangulation as
+connection network (\texttt{type=1}, see \texttt{?chooseCN}), while
+the command:
+<<>>=
+mySpca <- spca(obj,type=5,d1=0,d2=2,scannf=FALSE)
+@
+\noindent computes a sPCA using a connection network which defines
+neighbouring entities from their distances (\texttt{type=5}),
+considering as neighbours two entities whose distance between 0 (\texttt{d1=0}) and 2(\texttt{d2=2}).
+\\
+
+
+Another possibility is of course to provide directly a connection
+network (\texttt{nb} object) or a list of spatial weights
+(\texttt{listw} object) to the \texttt{spca} function; this can be done via the \texttt{cn} argument.
+For instance:
+<<>>=
+myCn <- chooseCN(obj$other$xy, type=6, k=10, plot=FALSE)
+myCn
+class(myCn)
+mySpca2 <- spca(obj,cn=myCn,scannf=FALSE)
+@
+\noindent produces a sPCA using \texttt{myCn} ($k=10$ nearest
+neighbours) as a connection network.
+\\
+
+After providing a genetic dataset along with a spatial information,
+the \texttt{spca} function displays a barplot of eigenvalues and asks
+for a number of positive axes ('first number of axes') and negative
+axes ('second number of axes') to be retained (unless \texttt{scannf} is set to \texttt{FALSE}).
+For the object \texttt{mySpca}, this barplot would be (here we
+indicate in red the retained eigenvalue):
+<<fig=TRUE,pdf=TRUE>>=
+barplot(mySpca$eig,main="Eigenvalues of sPCA", col=rep(c("red","grey"),c(1,100)))
+@
+\noindent Positive eigenvalues (on the left) correspond to global
+structures, while negative eigenvalues (on the right) indicate local patterns.
+Actual structures should result in more extreme (positive or
+negative) eigenvalues; for instance, the object \texttt{mySpca} likely
+contains one single global structure, and no local structure.
+If one does not want to choose the number of retained axes
+interactively, the arguments \texttt{nfposi} (number of retained
+factors with positive eigenvalues) and \texttt{nfnega} (number of
+retained factors with negative eigenvalues) can be used.
+Once these information have been provided to \texttt{spca}, the
+analysis is computed and stored inside an object with the class \texttt{spca}.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Contents of a \texttt{spca} object}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Let us consider a \texttt{spca} object resulting from the analysis of
+the object \texttt{obj}, using a Delaunay triangulation as connection network:
+<<>>=
+mySpca <- spca(obj,type=1,scannf=FALSE,plot.nb=FALSE,nfposi=1,nfnega=0)
+class(mySpca)
+mySpca
+@
+
+\noindent An \texttt{spca} object is a list containing all required
+information about a performed sPCA.
+Details about the different components of such a list can be found in
+the \texttt{spca} documentation (\texttt{?spca}).
+The purpose of this section is to explicit how the elements described
+in \cite{tjart04} are stored inside a \texttt{spca} object.
+
+First, eigenvalues of the analysis are stored inside the
+\texttt{\$eig} component as a numeric vector stored in decreasing order:
+<<fig=TRUE,pdf=TRUE>>=
+head(mySpca$eig)
+tail(mySpca$eig)
+length(mySpca$eig)
+plot(mySpca$eig, type="h", lwd=2, main="A variant of the plot\n of sPCA eigenvalues")
+abline(h=0,col="grey")
+@
+
+\noindent The axes of the analysis, denoted $\m{v}$ in eq. (4) \cite{tjart04}
+are stored as columns inside the \texttt{\$c1} component.
+Each columns contains loadings for all the alleles:
+<<>>=
+mySpca$c1
+head(mySpca$c1)
+tail(mySpca$c1)
+dim(mySpca$c1)
+@
+
+
+\noindent The entity scores, denoted $\psi = \m{Xv}$ in the article, are stored
+in columns in the \texttt{\$li} component:
+<<>>=
+head(mySpca$li)
+tail(mySpca$li)
+dim(mySpca$li)
+@
+
+\noindent The lag vectors of the scores can be used to better perceive
+global structures.
+Lag vectors are stored in the \texttt{\$ls} component:
+<<>>=
+head(mySpca$ls)
+tail(mySpca$ls)
+dim(mySpca$ls)
+@
+
+\noindent Lastly, we can compare the axes of an ordinary, 'classical'
+PCA (denoted $\m{u}$ in the paper) to the axes of the sPCA ($\m{v}$).
+This is achieved by projecting $\m{u}$ onto $\m{v}$, but this
+projection is a particular one: because both $\m{u}$ and $\m{v}$ are
+centred to mean zero and scaled to unit variance, this value of the
+projection simply is the correlation between both axes.
+This information is stored inside the \texttt{\$as} component:
+<<>>=
+mySpca$as
+@
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Graphical display of \texttt{spca} results}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+The information contained inside a \texttt{spca} object can be displayed
+in several ways.
+While we have seen that a simple barplot of sPCA eigenvalues can give a first idea of the
+global and local structures to be retained, we have also seen that
+each eigenvalue can be decomposed into a \textit{variance} and a
+\textit{spatial autocorrelation} (Moran's $I$) component.
+This information is provided by the \texttt{summary} function, but it
+can also be represented graphically.
+The corresponding function is \texttt{screeplot}, and can be used on any
+\texttt{spca} object:
+<<screeplot,fig=TRUE>>=
+screeplot(mySpca)
+@
+
+\noindent The resulting figure represents eigenvalues of sPCA (denoted
+$\lambda_i$ with $i=1,\ldots,n-1$, where $\lambda_1$ is the strongest
+global eigenvalue, and $\lambda_{n-1}$ is the strongest negative
+eigenvalue) according the their variance and Moran's $I$ components.
+These eigenvalues are contained inside a rectangle indicated in dashed
+lines.
+The maximum attainable variance by a linear combination of alleles is
+the one from an ordinary PCA, indicated by the vertical dashed line on
+the right.
+The two horizontal dashed lines indicate the range of variation of
+Moran's $I$, given the spatial weighting matrix that was used.
+This figure is useful to assess whether a given score of entities contains
+relatively enough variability and spatial structuring to be interpreted.
+For instance, here, $\lambda_1$ clearly is the largest eigenvalue in
+terms of variance and of spatial autocorrelation, and can be well
+distinguished from all the other eigenvalues.
+Hence, only the first global structure, associated to $\lambda_1$, should be interpreted.
+\\
+
+
+The global and local tests proposed in \cite{tjart04} can
+be used to reinforce the decision of interpreting or not
+interpreting global and local structures.
+Each test can detect the presence of one kind of structure.
+We can apply them to the object \texttt{obj}, used in our sPCA:
+<<globalrtest,fig=TRUE>>=
+myGtest <- global.rtest(obj$tab,mySpca$lw,nperm=99)
+myGtest
+plot(myGtest)
+@
+
+\noindent The produced object is a \texttt{randtest} object (see
+\texttt{?randtest}), which is the class of objects for Monte-Carlo
+tests in the \textit{ade4} package.
+As shown, such object can be plotted using a \texttt{plot} function:
+the resulting figure shows an histogram of permuted test statistics
+and indicates the observed statistics by a black dot and a segment.
+Here, the plot clearly shows that the oberved test statistic is larger
+than most simulated values, leading to a likely rejection of
+alternative hypothesis.
+Note that because 99 permutations were used, the p-value cannot be
+lower than 0.01.
+In practice, more permutations should be used (like 9999 for results
+intended to be published).
+
+The same can be done with the local test, which here we do not expect
+to be significant:
+<<localrtest,fig=TRUE>>=
+myLtest <- local.rtest(obj$tab,mySpca$lw,nperm=99)
+myLtest
+plot(myLtest)
+@
+~\\
+
+
+Once we have an idea of which structures shall be interpreted, we can
+try to visualize spatial genetic patterns.
+There are several ways to do so.
+The first, most simple approach is through the function plot (see \texttt{?plot.spca}):
+<<plotspca,fig=TRUE>>=
+plot(mySpca)
+@
+\noindent This figure shows different information, that we detail from
+the top to bottom and from left to right.
+The first plot shows the connection network that was used to define
+spatial weightings.
+The second, third, and fourth plots are different representations of
+a score of entities in space, the first global score being the default (argument
+\texttt{axis}).
+In each, the values of scores (\texttt{\$li[,axis]} component of the
+\texttt{spca} object) are represented using black and white symbols
+(a variant being grey levels): white for negative values, and black
+for positive values.
+The second plot is a local interpolation of scores (function
+\texttt{s.image} in \textit{ade4}), using grey levels, with contour lines.
+The closer the contour lines are from each other, the stepest the
+genetic differentiation is.
+The third plot uses different sizes of squares to represent different
+absolute values (\texttt{s.value} in \textit{ade4}): large black squares are well differentiated from
+large white squares, but small squares are less differentiated.
+The fourth plot is a variant using grey levels (\texttt{s.value} in
+\textit{ade4}, with 'greylevel' method).
+Here, all the three representations of the first global score show
+that genotypes are splitted in two genetical clusters, one in the west
+(or left) and one in the east (right).
+The last two plots of the \texttt{plot.spca} function are the two
+already seen displays of eigenvalues.
+\\
+
+Another way of representing a score of sPCA is using the
+\texttt{colorplot} function.
+This function can show up to three scores at the same time by
+translating each score into a channel of color (red, green, and blue).
+The obtained values are used to compose a color using the RGB system.
+See \texttt{?colorplot} for details about this function.
+The original idea of such representation is due to \cite{tj179}.
+Despite the \texttt{colorplot} clearly is more powerful to represent
+more than one score on a single map, we can use it to represent the
+first global structure that was retained in \texttt{mySpca}:
+<<colorplot,fig=TRUE>>=
+colorplot(mySpca,cex=3,main="colorplot of mySpca, first global score")
+@
+\noindent See \texttt{example(colorplot)} and \texttt{example(spca)}
+for more examples of applications of colorplot to represent sPCA scores.
+
+
+So far, we assessed the spatial genetic structures existing in the data.
+We learned that a global structure existed, and we observed that it
+consisted in two east-west genetic clusters.
+Now, we may like to know how each allele contributes to a given score.
+To quantify such contribution, only the absolute value of loadings for
+a given structure is meaningful.
+However, it is more relevant to consider squared loadings, as their
+sum is always constrained to be unit (because $\|\m{v}\|^2=1$).
+We can look for the alleles contributing most to the first axis
+of sPCA, using the function \texttt{loadingplot} (see \texttt{?loadingplot} for
+a description of the arguments):
+<<fig=TRUE>>=
+myLoadings <- mySpca$c1[,1]^2
+names(myLoadings) <- rownames(mySpca$c1)
+loadingplot(myLoadings, xlab="Alleles",ylab="Weight of the alleles",main="Contribution of alleles \n to the first sPCA axis")
+@
+\noindent See \texttt{?loadingplot} for more information about this
+function, in particular for the definition of the threshold value
+above which alleles are annotated.
+Note that it is possible to also separate the alleles by markers,
+using the \texttt{fac} argument, to assess if all markers have
+comparable contributions to a given structure.
+In our case, we would only have to specify \texttt{fac=obj at loc.fac};
+also note that \texttt{loadingplot} invisibly returns information
+about the alleles whose contribution is above the threshold.
+For instance, to identify the 5\% of alleles with the greatest
+contributions to the first global structure in \texttt{mySpca}, we need:
+<<fig=TRUE>>=
+temp <- loadingplot(myLoadings, threshold=quantile(myLoadings, 0.95), xlab="Alleles",ylab="Weight of the alleles",main="Contribution of alleles \n to the first sPCA axis", fac=obj at loc.fac, cex.fac=0.6)
+temp
+@
+
+But to assess the average contribution of each marker, a traditional
+boxplot remains a better tool:
+<<boxplot,fig=TRUE>>=
+boxplot(myLoadings~obj$loc.fac, las=3, main="Contributions by markers \nto the first global score")
+@
+
+
+
+
+
+
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Case study: spatial genetic structure of the chamois in the Bauges mountains}
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+The chamois (\textit{Rupicapra rupicapra}) is a conserved species in France.
+The Bauges mountains is a protected area in which the species has been
+recently studied.
+One of the most important questions for conservation purposes relates to whether individuals
+from this area form a single reproductive unit, or whether they
+are structured into sub-groups, and if so, what causes are likely to
+induce this structuring.
+
+While field observations are very scarce and do not allow to answer
+this question, genetic data can be used to tackle the issue, as
+departure from panmixia should result in genetic structuring.
+The dataset \textit{rupica} contains 335 georeferenced genotypes of Chamois from the
+Bauges mountains for 9 microsatellite markers, which we propose to
+analyse in this exercise.
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{An overview of the data}
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+We first load the data:
+<<echo=TRUE>>=
+data(rupica)
+rupica
+@
+\texttt{rupica} is a typical \texttt{genind} object, which is the class
+of objects storing genotypes (as opposed to population data) in \textit{adegenet}.
+\texttt{rupica} also contains topographic information about the
+sampled area, which can be displayed by calling
+\texttt{rupica\$other\$showBauges}.
+For instance, the spatial distribution of the sampling can be
+displayed as follows:
+<<fig=TRUE,pdf=FALSE,png=TRUE,echo=TRUE>>=
+rupica$other$showBauges()
+points(rupica$other$xy, col="red",pch=20)
+@
+
+This spatial distribution is clearly not random, but seems arranged into
+loose clusters.
+However, superimposed samples can bias our visual assessment of the spatial clustering.
+Use a two-dimensional kernel density estimation (function \texttt{s.kde2d}) to overcome this possible
+issue.
+<<fig=TRUE,pdf=FALSE,png=TRUE,echo=TRUE>>=
+rupica$other$showBauges()
+s.kde2d(rupica$other$xy,add.plot=TRUE)
+points(rupica$other$xy, col="red",pch=20)
+@
+
+Is geographical clustering strong enough to assign safely each individual to a group?
+Accordingly, shall we analyse these data at individual or group level?
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Summarising the genetic diversity}
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+As a prior clustering of genotypes is not known, we cannot employ usual
+$F_{ST}$-based approaches to detect genetic structuring.
+However, genetic structure could still result in a deficit of
+heterozygosity.
+Use the \texttt{summary} of \texttt{genind} objects to compare expected and
+observed heterozygosity:
+<<fig=TRUE>>=
+rupica.smry <- summary(rupica)
+plot(rupica.smry$Hexp, rupica.smry$Hobs, main="Observed vs expected heterozygosity")
+abline(0,1,col="red")
+@
+The red line indicate identity between both quantities.
+What can we say about heterozygosity in this population?
+How can this be tested?
+The result below can be reproduced using a standard testing procedure:
+<<>>=
+t.test(rupica.smry$Hexp, rupica.smry$Hobs,paired=TRUE,var.equal=TRUE)
+@
+~\\
+
+We can seek a global picture of the genetic diversity among genotypes
+using a Principal Component Analysis (PCA, \cite{tj383,np137}, \texttt{dudi.pca} in \texttt{ade4}
+package).
+The analysis is performed on a table of standardised alleles
+frequencies, obtained by \texttt{scaleGen} (use the binomial scaling option).
+Remember to disable the scaling option when performing the PCA.
+The function \texttt{dudi.pca} displays a barplot of
+eigenvalues and asks for a number of retained principal components:
+<<results=hide, fig=TRUE>>=
+rupica.X <- scaleGen(rupica, method="binom")
+rupica.pca1 <- dudi.pca(rupica.X, cent=FALSE, scale=FALSE, scannf=FALSE, nf=2)
+barplot(rupica.pca1$eig)
+@
+
+\noindent The output produced by \texttt{dudi.pca} is a \texttt{dudi} object.
+A \texttt{dudi} object contains various information; in the case of
+PCA, principal axes (loadings), principal components (synthetic variable), and eigenvalues are respectively
+stored in \texttt{\$c1}, \texttt{\$li}, and \texttt{\$eig} slots.
+Here is the content of the PCA:
+<<>>=
+rupica.pca1
+@
+
+In general, eigenvalues represent the amount of genetic diversity --- as measured by
+the multivariate method being used --- represented by each principal component (PC).
+Verify that here, each eigenvalue is the variance of the corresponding PC.
+
+An abrupt decrease in eigenvalues is likely to indicate the boundary
+between true patterns and non-interpretable structures.
+In this case, how many PCs would you interprete?
+\\
+
+
+Use \texttt{s.label} to display to two first components of the analysis.
+Then, use a kernel density (\texttt{s.kde2d}) for a better
+assessment of the distribution of the genotypes onto the principal axes:
+<<fig=TRUE>>=
+s.label(rupica.pca1$li)
+s.kde2d(rupica.pca1$li, add.p=TRUE, cpoint=0)
+add.scatter.eig(rupica.pca1$eig,2,1,2)
+@
+
+What can we say about the genetic diversity among these genotypes as
+inferred by PCA?
+The function \texttt{loadingplot} allows to visualize the contribution
+of each allele, expressed as squared loadings, for a given principal component.
+Using this function, reproduce this figure:
+<<fig=TRUE>>=
+loadingplot(rupica.pca1$c1^2)
+@
+
+\noindent What do we observe?
+We can get back to the genotypes for the concerned markers (e.g.,
+Bm203) to check whether the highlighted genotypes are uncommon.
+\texttt{truenames} extracts the table of allele frequencies from a
+\texttt{genind} object (restoring original labels for markers, alleles, and individuals):
+<<echo=TRUE>>=
+X <- truenames(rupica)
+class(X)
+dim(X)
+bm203.221 <- X[,"Bm203.221"]
+table(bm203.221)
+@
+Only 4 genotypes possess one copy of the allele 221 of marker bm203 (the second result
+corresponds to a replaced missing data).
+Which individuals are they?
+<<>>=
+rownames(X)[bm203.221==0.5]
+@
+Conclusion?
+\\
+
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Mapping and testing PCA results}
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+A frequent practice in spatial genetics is mapping the first principal components (PCs) onto the geographic space.
+The function \texttt{s.value} is well-suited to do so, using black and
+white squares of variable size for positive and negative values.
+To give a legend for this type of representation:
+<<svaluedem,fig=TRUE, echo=TRUE>>=
+s.value(cbind(1:11,rep(1,11)), -5:5, cleg=0)
+text(1:11,rep(1,11), -5:5, col="red",cex=1.5)
+@
+
+\noindent Apply this graphical representation to the first
+two PCs of the PCA:
+<<fig=TRUE>>=
+showBauges <- rupica$other$showBauges
+showBauges()
+s.value(rupica$other$xy, rupica.pca1$li[,1], add.p=TRUE, cleg=0.5)
+title("PCA - first PC",col.main="yellow" ,line=-2, cex.main=2)
+@
+
+<<fig=TRUE>>=
+showBauges()
+s.value(rupica$other$xy, rupica.pca1$li[,2], add.p=TRUE, csize=0.7)
+title("PCA - second PC",col.main="yellow" ,line=-2, cex.main=2)
+@
+
+
+\noindent What can we say about spatial genetic structure as inferred by PCA?
+This visual assessment can be complemented by testing the spatial autocorrelation in the first PCs
+of PCA.
+This can be achieved using Moran's $I$ test.
+Use the function \texttt{moran.mc} in the package \emph{spdep} to perform these tests.
+You will need first to define the spatial connectivity between the sampled individuals.
+For these data, spatial connectivity is best defined as the overlap between home ranges of
+individuals.
+Home ranges will be modelled as disks with a radius of 1150m.
+Use \texttt{chooseCN} to create a connection network based on distance range (``neighbourhood by
+distance'').
+What threshold distance do you choose for individuals to be considered as neighbours?
+<<>>=
+rupica.graph <- chooseCN(rupica$other$xy,type=5,d1=0,d2=2300, plot=FALSE, res="listw")
+@
+The connection network should ressemble this:
+<<fig=TRUE,pdf=FALSE, png=TRUE>>=
+rupica.graph
+plot(rupica.graph, rupica$other$xy)
+title("rupica.graph")
+@
+
+Perform Moran's test for the first two PCs, and plot the results.
+The first test should be significant:
+<<fig=TRUE>>=
+pc1.mctest <- moran.mc(rupica.pca1$li[,1], rupica.graph, 999)
+plot(pc1.mctest)
+@
+
+\noindent Compare this result to the mapping of the first PC of PCA. What is wrong?
+When a test gives unexpected results, it is worth looking into the data in more details.
+Moran's plot (\texttt{moran.plot}) plots the tested variable against its lagged vector.
+Use it on the first PC:
+<<fig=TRUE>>=
+moran.plot(rupica.pca1$li[,1], rupica.graph)
+@
+
+\noindent Actual positive autocorrelation corresponds to a positive correlation between a variable
+and its lag vector.
+Is it the case here? How can we explain that Moran's test was significant?
+
+
+Repeat these analyses for the second PC. What are your conclusions?
+<<fig=TRUE>>=
+pc2.mctest <- moran.mc(rupica.pca1$li[,2], rupica.graph, 999)
+plot(pc2.mctest)
+@
+
+
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Multivariate tests of spatial structure}
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+So far, we have only tested the existence of spatial structures in the first two principal components
+of a PCA of the data.
+Therefore, these tests only describe one fragment of the data, and do not encompass the whole
+diversity in the data.
+As a complement, we can use Mantel test (\texttt{mantel.randtest}) to test spatial structures in the
+whole data, by assessing the correlation between genetic distances and geographic distances.
+Pairwise Euclidean distances are computed using \texttt{dist}.
+Perform Mantel test, using the scaled genetic data you used before in PCA, and the geographic coordinates.
+<<fig=TRUE>>=
+mtest <- mantel.randtest(dist(rupica.X), dist(rupica$other$xy))
+plot(mtest, nclass=30)
+@
+
+\noindent What is your conclusion? Shall we be looking for spatial structures?
+If so, how can we explain that PCA did not reveal them?
+Does the Mantel correlogram (\texttt{mantel.correlog} in \emph{vegan} package) bring any help solving the problem?
+
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{spatial Principal Component Analysis}
+%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+The spatial Principal Component Analysis (sPCA, function \texttt{spca} \cite{tjart04}) has been especially developed to
+investigate hidden or non-obvious spatial genetic patterns.
+Like Moran's $I$ test, sPCA first requires the spatial proximities between genotypes to be
+modeled.
+You will reuse the connection network defined previously using \texttt{chooseCN}, and pass it as the
+'cn' argument of the function \texttt{spca}.
+~\\
+
+Read the documentation of \texttt{spca}, and apply the function to the dataset \texttt{rupica}.
+The function will display a barplot of eigenvalues:
+<<echo=TRUE, fig=TRUE>>=
+rupica.spca1 <- spca(rupica, cn=rupica.graph,scannf=FALSE, nfposi=2,nfnega=0)
+barplot(rupica.spca1$eig, col=rep(c("red","grey"), c(2,1000)) )
+@
+This figure illustrates the fundamental difference between PCA and sPCA.
+Like \texttt{dudi.pca}, \texttt{spca} displays a barplot of
+eigenvalues, but unlike in PCA, eigenvalues of sPCA can also be negative.
+This is because the criterion optimized by the analysis can have
+positive and negative values, corresponding respectively to positive
+and negative autocorrelation.
+Positive spatial autocorrelation correspond to greater genetic similarity between geographically
+closer individuals.
+Conversely, negative spatial autocorrelation corresponds to greater dissimilarity between neighbours.
+The spatial autocorrelation of a variable is measured by Moran's $I$, and interpreted as follows:
+\begin{itemize}
+ \item $I_0 = -1/(n-1)$: no spatial autocorrelation ($x$ is randomly distributed across space)
+ \item $I > I_0$: positive spatial autocorrelation
+ \item $I < I_0$: negative spatial autocorrelation
+\end{itemize}
+
+Principal components of PCA ensure that ($\phi$ referring to one PC) $var(\phi)$ is maximum.
+By contrast, sPCA provides PC which decompose the quantity $var(\phi)I(\phi)$.
+In other words, PCA focuses on variability only, while sPCA is a compromise between variability
+($var(\phi)$) and spatial structure ($I(\phi)$).
+\\
+
+
+In this case, only the principal components associated with the two
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/adegenet -r 919
More information about the adegenet-commits
mailing list