[adegenet-commits] r920 - in pkg/inst/doc: . figs
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Jun 20 19:52:28 CEST 2011
Author: jombart
Date: 2011-06-20 19:51:51 +0200 (Mon, 20 Jun 2011)
New Revision: 920
Added:
pkg/inst/doc/figs/spca-008.pdf
pkg/inst/doc/figs/spca-010.pdf
pkg/inst/doc/figs/spca-021.png
pkg/inst/doc/figs/spca-022.png
pkg/inst/doc/figs/spca-024.pdf
pkg/inst/doc/figs/spca-028.png
pkg/inst/doc/figs/spca-029.png
pkg/inst/doc/figs/spca-032.pdf
pkg/inst/doc/figs/spca-039.png
pkg/inst/doc/figs/spca-040.png
pkg/inst/doc/figs/spca-042.png
pkg/inst/doc/figs/spca-043.pdf
pkg/inst/doc/figs/spca-045.pdf
pkg/inst/doc/figs/spca-046.pdf
pkg/inst/doc/figs/spca-047.pdf
pkg/inst/doc/figs/spca-049.pdf
pkg/inst/doc/figs/spca-050.png
pkg/inst/doc/figs/spca-051.png
pkg/inst/doc/figs/spca-052.png
pkg/inst/doc/figs/spca-plotspca.png
Modified:
pkg/inst/doc/adegenet-spca.Rnw
pkg/inst/doc/adegenet-spca.pdf
pkg/inst/doc/adegenet-spca.tex
pkg/inst/doc/figs/spca-023.png
pkg/inst/doc/figs/spca-025.pdf
pkg/inst/doc/figs/spca-030.pdf
pkg/inst/doc/figs/spca-034.pdf
pkg/inst/doc/figs/spca-035.pdf
pkg/inst/doc/figs/spca-044.pdf
pkg/inst/doc/figs/spca-boxplot.pdf
pkg/inst/doc/figs/spca-colorplot.pdf
pkg/inst/doc/figs/spca-globalrtest.pdf
pkg/inst/doc/figs/spca-localrtest.pdf
pkg/inst/doc/figs/spca-plotspca.pdf
pkg/inst/doc/figs/spca-screeplot.pdf
pkg/inst/doc/figs/spca-svaluedem.pdf
Log:
spca vignette fisnisssssshed!
Modified: pkg/inst/doc/adegenet-spca.Rnw
===================================================================
--- pkg/inst/doc/adegenet-spca.Rnw 2011-06-20 15:24:11 UTC (rev 919)
+++ pkg/inst/doc/adegenet-spca.Rnw 2011-06-20 17:51:51 UTC (rev 920)
@@ -22,7 +22,7 @@
\newcommand{\code}[1]{{{\tt #1}}}
-\title{A tutorial for spatial Analysis of Principal Components (sPCA) using \textit{adegenet} \Sexpr{packageDescription("adegenet", fields = "Version")}}
+\title{A tutorial for the spatial Analysis of Principal Components (sPCA) using \textit{adegenet} \Sexpr{packageDescription("adegenet", fields = "Version")}}
\author{Thibaut Jombart}
\date{\today}
@@ -78,9 +78,9 @@
The purpose of this tutorial is to provide guidelines for the application of sPCA as well as to
illustrate its usefulness for the investigation of spatial genetic patterns.
After briefly going through the rationale of the method, we introduce the different tools
-implemented for sPCA in \texttt{adegenet}.
+implemented for sPCA in \textit{adegenet}.
This technical overview is then followed by the analysis of an empirical dataset which illustrates
-the advantage of sPCA over classical PCA.
+the advantage of sPCA over classical PCA for investigating spatial patterns.
@@ -88,8 +88,8 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Rationale of sPCA}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Mathematical notations used in this tutorial are those from \cite{tjart04}.
-The sPCA analyses a data matrix $\m{X}$ which contains genotypes or
+Mathematical notations used in this tutorial are identical to the original publication \cite{tjart04}.
+The sPCA analyses a matrix of relative allele frequencies $\m{X}$ which contains genotypes or
populations (later refered to as 'entities') in rows and alleles in columns.
Spatial information is stored inside a spatial weighting matrix
$\m{L}$ which contains positive terms corresponding to some measurement
@@ -117,8 +117,8 @@
\\
-However, Moran's index measures only spatial structures, does not take
-the variability of $\m{x}$ into account.
+However, since it is standardized by the variance of
+$\m{x}$, Moran's index measures only spatial structures and not genetic variability.
The sPCA defines the following function to measure both spatial
structure and variability in $\m{x}$:
\beq
@@ -132,7 +132,7 @@
This function is the criterion used in sPCA, which finds linear
combinations of the alleles of $\m{X}$ (denoted $\m{Xv}$) decomposing $C$ from its
maximum to its minimum value.
-Because $C(\m{Xv})$ is a product of variance and of autocorrelation,
+Because $C(\m{Xv})$ is a product of variance and autocorrelation,
it is important, when interpreting the results, to detail both
components and to compare their value with their range of variation
(maximum attainable variance, as well as maximum and minimum $I$ are
@@ -140,7 +140,7 @@
A structure with a low spatial autocorrelation can barely be
interpreted as a spatial pattern; similarly, a structure with a low
variance would likely not reflect any genetic structure.
-We will later see how these information can be retrieved in \texttt{adegenet}.
+We will later see how these information can be retrieved from \texttt{spca} results.
@@ -151,11 +151,16 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%
The simulated dataset used to illustrate this section has been
analyzed in \cite{tjart04}, and corresponds to Figure 2A of the article.
-In \texttt{adegenet}, the matrix of alleles frequencies previously
+In \textit{adegenet}, the matrix of alleles frequencies previously
denoted $\m{X}$ exactly corresponds to the \texttt{@tab} slot of \texttt{genind} or
\texttt{genpop} objects:
+<<print=FALSE,echo=FALSE,results=hide>>=
+library(adehabitat)
+library(spdep)
+@
<<>>=
library(adegenet)
+library(adehabitat)
data(spcaIllus)
obj <- spcaIllus$dat2A
obj
@@ -163,7 +168,9 @@
@
\noindent The object \texttt{obj} is a \texttt{genind} object; note
that here, we only displayed the table for the first locus (\texttt{loc="L01"}).
+\\
+
The function performing the sPCA is \texttt{spca}; it accepts a bunch
of arguments, but only the first two are mandatory to perform the
analysis (see \texttt{?spca} for further information):
@@ -178,34 +185,27 @@
'x' and 'y' coordinates in columns.
Alternatively, these coordinates can be stored inside the
\texttt{genind}/\texttt{genpop} object, preferably as
-\texttt{@other\$xy}, in which case the \texttt{spca} function will not
-require a \texttt{xy} argument.
-Basically, spatial information could be stored in any form and with
-any name in the \texttt{@other} slot, but the \texttt{spca} function
-would not recognize it directly.
+\texttt{@other\$xy}, in which case the \texttt{spca} function will detect and use this information,
+and not
+request an \texttt{xy} argument.
Note that \texttt{obj} already contains spatial coordinates at the
appropriate place.
-Hence, the following uses are valid (\texttt{ask} and \texttt{scannf}
+Hence, we can use the following command to run the sPCA (\texttt{ask} and \texttt{scannf}
are set to FALSE to avoid interactivity):
<<>>=
mySpca <- spca(obj, ask=FALSE, type=1, scannf=FALSE)
-mySpca2 <- spca(obj,xy=other(obj)$xy,ask=FALSE, type=1, scannf=FALSE)
-all.equal(mySpca, mySpca2)
-names(mySpca)[8]
@
-\noindent Both objects are the same: they only differ by their call.
-\\
+~\\
Note, however, that spatial coordinates are not directly used in sPCA:
the spatial information is included in the analysis by the spatial
weighting matrix $\m{L}$ derived from a connection network (eq. \ref{eqn:I} and \ref{eqn:C}).
-Technically, the \texttt{spca} function does not directly use a
-matrix of spatial weightings, but a connection network with the class
-\texttt{nb} or a list of spatial weights of class \texttt{listw},
-which are both implemented in the \texttt{spdep} package.
+Technically, the \texttt{spca} function can incorporate spatial weightings as a matrix (argument
+\texttt{matWeight}), as a connection network with the classes
+\texttt{nb} or \texttt{listw} (argument \texttt{cn}), both implemented in the \texttt{spdep} package.
The function \texttt{chooseCN} is a wrapper for different functions
-spread in several packages implementing a variety of connection networks.
+scattered across several packages implementing a variety of connection networks.
If only spatial coordinates are provided to \texttt{spca},
\texttt{chooseCN} is called to construct an appropriate graph.
See \texttt{?chooseCN} for more information.
@@ -213,7 +213,7 @@
for \texttt{chooseCN}: \texttt{type}, \texttt{ask}, \texttt{plot.nb},
\texttt{edit.nb}, \texttt{d1}, \texttt{d2}, \texttt{k}, \texttt{a}, and \texttt{dmin}.
For instance, the command:
-<<>>=
+<<eval=FALSE>>=
mySpca <- spca(obj,type=1,ask=FALSE,scannf=FALSE)
@
\noindent performs a sPCA using the Delaunay triangulation as
@@ -223,8 +223,8 @@
mySpca <- spca(obj,type=5,d1=0,d2=2,scannf=FALSE)
@
\noindent computes a sPCA using a connection network which defines
-neighbouring entities from their distances (\texttt{type=5}),
-considering as neighbours two entities whose distance between 0 (\texttt{d1=0}) and 2(\texttt{d2=2}).
+neighbouring entities based on pairwise geographic distances (\texttt{type=5}),
+considering as neighbours two entities whose distance between 0 (\texttt{d1=0}) and 2 (\texttt{d2=2}).
\\
@@ -242,15 +242,16 @@
neighbours) as a connection network.
\\
-After providing a genetic dataset along with a spatial information,
-the \texttt{spca} function displays a barplot of eigenvalues and asks
-for a number of positive axes ('first number of axes') and negative
-axes ('second number of axes') to be retained (unless \texttt{scannf} is set to \texttt{FALSE}).
+
+When used interactively (\texttt{scannf=TRUE}), \texttt{spca} displays a barplot of eigenvalues and
+asks the user for a number of positive axes ('first number of axes') and negative
+axes ('second number of axes') to be retained.
For the object \texttt{mySpca}, this barplot would be (here we
indicate in red the retained eigenvalue):
<<fig=TRUE,pdf=TRUE>>=
barplot(mySpca$eig,main="Eigenvalues of sPCA", col=rep(c("red","grey"),c(1,100)))
@
+
\noindent Positive eigenvalues (on the left) correspond to global
structures, while negative eigenvalues (on the right) indicate local patterns.
Actual structures should result in more extreme (positive or
@@ -260,7 +261,7 @@
interactively, the arguments \texttt{nfposi} (number of retained
factors with positive eigenvalues) and \texttt{nfnega} (number of
retained factors with negative eigenvalues) can be used.
-Once these information have been provided to \texttt{spca}, the
+Once this information has been provided to \texttt{spca}, the
analysis is computed and stored inside an object with the class \texttt{spca}.
@@ -271,7 +272,7 @@
\subsection{Contents of a \texttt{spca} object}
%%%%%%%%%%%%%%%%%%%%%%%%%%%
Let us consider a \texttt{spca} object resulting from the analysis of
-the object \texttt{obj}, using a Delaunay triangulation as connection network:
+the object \texttt{obj}, using a Delaunay triangulation (\texttt{type=1}) as connection network:
<<>>=
mySpca <- spca(obj,type=1,scannf=FALSE,plot.nb=FALSE,nfposi=1,nfnega=0)
class(mySpca)
@@ -291,7 +292,9 @@
head(mySpca$eig)
tail(mySpca$eig)
length(mySpca$eig)
-plot(mySpca$eig, type="h", lwd=2, main="A variant of the plot\n of sPCA eigenvalues")
+myPal <- colorRampPalette(c("red","grey","blue"))
+barplot(mySpca$eig, main="A variant of the plot\n of sPCA eigenvalues", col=myPal(length(mySpca$eig)))
+legend("topright", fill=c("red","blue"), leg=c("Global structures", "Local structures"))
abline(h=0,col="grey")
@
@@ -299,7 +302,6 @@
are stored as columns inside the \texttt{\$c1} component.
Each columns contains loadings for all the alleles:
<<>>=
-mySpca$c1
head(mySpca$c1)
tail(mySpca$c1)
dim(mySpca$c1)
@@ -323,11 +325,11 @@
dim(mySpca$ls)
@
-\noindent Lastly, we can compare the axes of an ordinary, 'classical'
+\noindent Lastly, we can compare the axes of an classical
PCA (denoted $\m{u}$ in the paper) to the axes of the sPCA ($\m{v}$).
This is achieved by projecting $\m{u}$ onto $\m{v}$, but this
projection is a particular one: because both $\m{u}$ and $\m{v}$ are
-centred to mean zero and scaled to unit variance, this value of the
+centred to mean zero and scaled to unit variance, the value of the
projection simply is the correlation between both axes.
This information is stored inside the \texttt{\$as} component:
<<>>=
@@ -357,8 +359,8 @@
@
\noindent The resulting figure represents eigenvalues of sPCA (denoted
-$\lambda_i$ with $i=1,\ldots,n-1$, where $\lambda_1$ is the strongest
-global eigenvalue, and $\lambda_{n-1}$ is the strongest negative
+$\lambda_i$ with $i=1,\ldots,r$, where $\lambda_1$ is the highest
+positive eigenvalue, and $\lambda_{r}$ is the highest negative
eigenvalue) according the their variance and Moran's $I$ components.
These eigenvalues are contained inside a rectangle indicated in dashed
lines.
@@ -395,10 +397,10 @@
and indicates the observed statistics by a black dot and a segment.
Here, the plot clearly shows that the oberved test statistic is larger
than most simulated values, leading to a likely rejection of
-alternative hypothesis.
+the null hypothesis of absence of spatial structure.
Note that because 99 permutations were used, the p-value cannot be
lower than 0.01.
-In practice, more permutations should be used (like 9999 for results
+In practice, more permutations should be used (like 999 or 9999 for results
intended to be published).
The same can be done with the local test, which here we do not expect
@@ -415,11 +417,13 @@
try to visualize spatial genetic patterns.
There are several ways to do so.
The first, most simple approach is through the function plot (see \texttt{?plot.spca}):
-<<plotspca,fig=TRUE>>=
+<<plotspca,fig=TRUE,results=hide,print=FALSE,pdf=FALSE,png=TRUE>>=
plot(mySpca)
@
-\noindent This figure shows different information, that we detail from
-the top to bottom and from left to right.
+
+
+\noindent This figure displays various information, that we detail from
+the top to bottom and from left to right (also see \texttt{?plot.spca}).
The first plot shows the connection network that was used to define
spatial weightings.
The second, third, and fourth plots are different representations of
@@ -445,9 +449,12 @@
already seen displays of eigenvalues.
\\
-Another way of representing a score of sPCA is using the
+
+While the default \texttt{plot} function for \texttt{spca} objects provides a useful summary of the
+results, more flexible tools are needed e.g. to map the principal components onto the geographic space.
+This can be achieved using the
\texttt{colorplot} function.
-This function can show up to three scores at the same time by
+This function can summarize up to three scores at the same time by
translating each score into a channel of color (red, green, and blue).
The obtained values are used to compose a color using the RGB system.
See \texttt{?colorplot} for details about this function.
@@ -458,26 +465,63 @@
<<colorplot,fig=TRUE>>=
colorplot(mySpca,cex=3,main="colorplot of mySpca, first global score")
@
-\noindent See \texttt{example(colorplot)} and \texttt{example(spca)}
+
+\noindent See examples in \texttt{?colorplot} and \texttt{?spca}
for more examples of applications of colorplot to represent sPCA scores.
+\\
+Another common practice is interpolating principal components to get maps of genetic clines.
+Note that it is crucial to perform this interpolation after the analysis, and not before, which
+would add artefactual structures to the data.
+Interpolation is easy to realize using \texttt{interp} from the \texttt{akima} package, and
+\texttt{image}, or \texttt{filled.contour} to display the results:
+<<>>=
+library(akima)
+x <- other(obj)$xy[,1]
+y <- other(obj)$xy[,2]
+@
+<<fig=TRUE,png=TRUE,pdf=FALSE>>=
+temp <- interp(x, y, mySpca$li[,1])
+image(temp)
+@
-So far, we assessed the spatial genetic structures existing in the data.
-We learned that a global structure existed, and we observed that it
-consisted in two east-west genetic clusters.
-Now, we may like to know how each allele contributes to a given score.
-To quantify such contribution, only the absolute value of loadings for
-a given structure is meaningful.
-However, it is more relevant to consider squared loadings, as their
-sum is always constrained to be unit (because $\|\m{v}\|^2=1$).
-We can look for the alleles contributing most to the first axis
+\noindent Note that for better clarity, we can use the lagged principal scores (\texttt{\$ls})
+rather than the original scores (\texttt{\$li}); we also achieve a better resolution using specific
+interpolated coordinates:
+<<fig=TRUE,pdf=FALSE,png=TRUE>>=
+interpX <- seq(min(x),max(x),le=200)
+interpY <- seq(min(y),max(y),le=200)
+temp <- interp(x, y, mySpca$ls[,1], xo=interpX, yo=interpY)
+image(temp)
+@
+
+\noindent Alternatively, \texttt{filled.contour} can be used for the display, and a customized color
+palette can be specified:
+<<fig=TRUE,pdf=FALSE,png=TRUE>>=
+myPal <- colorRampPalette(c("firebrick2", "white", "lightslateblue"))
+annot <- function(){
+ title("sPCA - interpolated map of individual scores")
+ points(x,y)
+}
+filled.contour(temp, color.pal=myPal, nlev=50, key.title=title("lagged \nscore 1"), plot.title=annot())
+@
+~\\
+
+
+Besides assessing spatial patterns, it is sometimes valuable to assess which alleles actually
+exhibit the structure of interest.
+In sPCA, the contribution of alleles to a specific structure is given by the corresponding squared loading.
+We can look for the alleles contributing most to e.g. the first axis
of sPCA, using the function \texttt{loadingplot} (see \texttt{?loadingplot} for
a description of the arguments):
<<fig=TRUE>>=
myLoadings <- mySpca$c1[,1]^2
names(myLoadings) <- rownames(mySpca$c1)
-loadingplot(myLoadings, xlab="Alleles",ylab="Weight of the alleles",main="Contribution of alleles \n to the first sPCA axis")
+loadingplot(myLoadings, xlab="Alleles",
+ ylab="Weight of the alleles",
+ main="Contribution of alleles \n to the first sPCA axis")
@
+
\noindent See \texttt{?loadingplot} for more information about this
function, in particular for the definition of the threshold value
above which alleles are annotated.
@@ -490,14 +534,18 @@
For instance, to identify the 5\% of alleles with the greatest
contributions to the first global structure in \texttt{mySpca}, we need:
<<fig=TRUE>>=
-temp <- loadingplot(myLoadings, threshold=quantile(myLoadings, 0.95), xlab="Alleles",ylab="Weight of the alleles",main="Contribution of alleles \n to the first sPCA axis", fac=obj at loc.fac, cex.fac=0.6)
+temp <- loadingplot(myLoadings, threshold=quantile(myLoadings, 0.95),
+ xlab="Alleles",ylab="Weight of the alleles",
+ main="Contribution of alleles \n to the first sPCA axis",
+ fac=obj$loc.fac, cex.fac=0.6)
temp
@
-But to assess the average contribution of each marker, a traditional
-boxplot remains a better tool:
+But to assess the average contribution of each marker, the
+boxplot probably is a better tool:
<<boxplot,fig=TRUE>>=
-boxplot(myLoadings~obj$loc.fac, las=3, main="Contributions by markers \nto the first global score")
+boxplot(myLoadings~obj$loc.fac, las=3, ylab="Contribution", xlab="Marker",
+ main="Contributions by markers \nto the first global score", col="grey")
@
@@ -528,7 +576,7 @@
departure from panmixia should result in genetic structuring.
The dataset \textit{rupica} contains 335 georeferenced genotypes of Chamois from the
Bauges mountains for 9 microsatellite markers, which we propose to
-analyse in this exercise.
+analyse.
@@ -540,19 +588,19 @@
data(rupica)
rupica
@
-\texttt{rupica} is a typical \texttt{genind} object, which is the class
+\texttt{rupica} is a \texttt{genind} object, that is, the class
of objects storing genotypes (as opposed to population data) in \textit{adegenet}.
\texttt{rupica} also contains topographic information about the
sampled area, which can be displayed by calling
\texttt{rupica\$other\$showBauges}.
-For instance, the spatial distribution of the sampling can be
-displayed as follows:
+Altitude maps are displayed using the \textit{adehabitat} package \cite{tj440}.
+The spatial distribution of the sampling can be displayed as follows:
<<fig=TRUE,pdf=FALSE,png=TRUE,echo=TRUE>>=
rupica$other$showBauges()
points(rupica$other$xy, col="red",pch=20)
@
-This spatial distribution is clearly not random, but seems arranged into
+\noindent This spatial distribution is clearly not random, but seems arranged into
loose clusters.
However, superimposed samples can bias our visual assessment of the spatial clustering.
Use a two-dimensional kernel density estimation (function \texttt{s.kde2d}) to overcome this possible
@@ -563,13 +611,13 @@
points(rupica$other$xy, col="red",pch=20)
@
-Is geographical clustering strong enough to assign safely each individual to a group?
-Accordingly, shall we analyse these data at individual or group level?
+Unfortunately, geographical clustering is not strong enough to assign unambiguously each individual to a group.
+Therefore, we need to carry all analyses at the individual level, which precludes the use of most
+population genetics tools.
-
%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Summarising the genetic diversity}
%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -581,30 +629,32 @@
observed heterozygosity:
<<fig=TRUE>>=
rupica.smry <- summary(rupica)
-plot(rupica.smry$Hexp, rupica.smry$Hobs, main="Observed vs expected heterozygosity")
+plot(rupica.smry$Hobs, rupica.smry$Hexp, main="Observed vs expected heterozygosity")
abline(0,1,col="red")
@
-The red line indicate identity between both quantities.
-What can we say about heterozygosity in this population?
-How can this be tested?
-The result below can be reproduced using a standard testing procedure:
+
+\noindent The red line indicate identity between both quantities.
+Observed heterozygosity do not seem to deviate massively from theoretical expectations.
+This is confirmed by a classical pairwise $t$-test::
<<>>=
t.test(rupica.smry$Hexp, rupica.smry$Hobs,paired=TRUE,var.equal=TRUE)
@
~\\
We can seek a global picture of the genetic diversity among genotypes
-using a Principal Component Analysis (PCA, \cite{tj383,np137}, \texttt{dudi.pca} in \texttt{ade4}
+using a Principal Component Analysis (PCA, function \texttt{dudi.pca} in the \texttt{ade4}
package).
-The analysis is performed on a table of standardised alleles
+The analysis is performed on a table of standardized alleles
frequencies, obtained by \texttt{scaleGen} (use the binomial scaling option).
-Remember to disable the scaling option when performing the PCA.
+Note that we disable the scaling option when performing the PCA, which would otherwise re-scale the
+data and therefore erase the previous scaling of \texttt{scaleGen}.
The function \texttt{dudi.pca} displays a barplot of
eigenvalues and asks for a number of retained principal components:
<<results=hide, fig=TRUE>>=
rupica.X <- scaleGen(rupica, method="binom")
rupica.pca1 <- dudi.pca(rupica.X, cent=FALSE, scale=FALSE, scannf=FALSE, nf=2)
-barplot(rupica.pca1$eig)
+barplot(rupica.pca1$eig, title="Rupica dataset - PCA eigenvalues",
+ col=heat.colors(length(rupica.pca1$eig)))
@
\noindent The output produced by \texttt{dudi.pca} is a \texttt{dudi} object.
@@ -618,16 +668,14 @@
In general, eigenvalues represent the amount of genetic diversity --- as measured by
the multivariate method being used --- represented by each principal component (PC).
-Verify that here, each eigenvalue is the variance of the corresponding PC.
-
An abrupt decrease in eigenvalues is likely to indicate the boundary
between true patterns and non-interpretable structures.
-In this case, how many PCs would you interprete?
+In this case, the first two PCs may contain some relevant biological signal.
\\
-Use \texttt{s.label} to display to two first components of the analysis.
-Then, use a kernel density (\texttt{s.kde2d}) for a better
+We can use \texttt{s.label} to display to two first components of the analysis.
+Kernel density estimation (\texttt{s.kde2d}) is used for a better
assessment of the distribution of the genotypes onto the principal axes:
<<fig=TRUE>>=
s.label(rupica.pca1$li)
@@ -635,17 +683,14 @@
add.scatter.eig(rupica.pca1$eig,2,1,2)
@
-What can we say about the genetic diversity among these genotypes as
-inferred by PCA?
-The function \texttt{loadingplot} allows to visualize the contribution
-of each allele, expressed as squared loadings, for a given principal component.
+\noindent This scatterplot shows that the only structure identified by PCA points to a few outliers.
+\texttt{loadingplot} confirms that this corresponds to the possession of a few original alleles:
Using this function, reproduce this figure:
<<fig=TRUE>>=
loadingplot(rupica.pca1$c1^2)
@
-\noindent What do we observe?
-We can get back to the genotypes for the concerned markers (e.g.,
+\noindent We can go back to the genotypes for the concerned markers (e.g.,
Bm203) to check whether the highlighted genotypes are uncommon.
\texttt{truenames} extracts the table of allele frequencies from a
\texttt{genind} object (restoring original labels for markers, alleles, and individuals):
@@ -662,7 +707,9 @@
<<>>=
rownames(X)[bm203.221==0.5]
@
-Conclusion?
+These are indeed our outliers.
+From the point of view of PCA, this would be the only structure in the data.
+However, further analyses show that more is to be seen...
\\
@@ -674,7 +721,7 @@
\subsection{Mapping and testing PCA results}
%%%%%%%%%%%%%%%%%%%%%%%%%%
A frequent practice in spatial genetics is mapping the first principal components (PCs) onto the geographic space.
-The function \texttt{s.value} is well-suited to do so, using black and
+\textit{ade4}'s function \texttt{s.value} is well-suited to do so, using black and
white squares of variable size for positive and negative values.
To give a legend for this type of representation:
<<svaluedem,fig=TRUE, echo=TRUE>>=
@@ -682,34 +729,30 @@
text(1:11,rep(1,11), -5:5, col="red",cex=1.5)
@
-\noindent Apply this graphical representation to the first
+\noindent We apply this graphical representation to the first
two PCs of the PCA:
-<<fig=TRUE>>=
+<<fig=TRUE, pdf=FALSE, png=TRUE>>=
showBauges <- rupica$other$showBauges
showBauges()
s.value(rupica$other$xy, rupica.pca1$li[,1], add.p=TRUE, cleg=0.5)
title("PCA - first PC",col.main="yellow" ,line=-2, cex.main=2)
@
-<<fig=TRUE>>=
+<<fig=TRUE, pdf=FALSE, png=TRUE>>=
showBauges()
s.value(rupica$other$xy, rupica.pca1$li[,2], add.p=TRUE, csize=0.7)
title("PCA - second PC",col.main="yellow" ,line=-2, cex.main=2)
@
-\noindent What can we say about spatial genetic structure as inferred by PCA?
-This visual assessment can be complemented by testing the spatial autocorrelation in the first PCs
-of PCA.
+\noindent As we can see, none of these PCs seems to display a particular spatial pattern.
+This visual assessment can be complemented by a test of spatial autocorrelation in these PCs.
This can be achieved using Moran's $I$ test.
-Use the function \texttt{moran.mc} in the package \emph{spdep} to perform these tests.
-You will need first to define the spatial connectivity between the sampled individuals.
+We use \textit{spdep}'s function \texttt{moran.mc} to perform these two tests.
+We first need to define the spatial connectivity between the sampled individuals.
For these data, spatial connectivity is best defined as the overlap between home ranges of
-individuals.
-Home ranges will be modelled as disks with a radius of 1150m.
-Use \texttt{chooseCN} to create a connection network based on distance range (``neighbourhood by
-distance'').
-What threshold distance do you choose for individuals to be considered as neighbours?
+individuals, modelled as disks with a radius of 1150m.
+\texttt{chooseCN} is used to create the corresponding connection network:
<<>>=
rupica.graph <- chooseCN(rupica$other$xy,type=5,d1=0,d2=2300, plot=FALSE, res="listw")
@
@@ -727,20 +770,19 @@
plot(pc1.mctest)
@
-\noindent Compare this result to the mapping of the first PC of PCA. What is wrong?
-When a test gives unexpected results, it is worth looking into the data in more details.
-Moran's plot (\texttt{moran.plot}) plots the tested variable against its lagged vector.
-Use it on the first PC:
+\noindent This result is surprisingly significant. Why is this?
+Moran's plot (\texttt{moran.plot}) represents the tested variable against its lagged vector; we
+apply it to the first PC:
<<fig=TRUE>>=
moran.plot(rupica.pca1$li[,1], rupica.graph)
@
-\noindent Actual positive autocorrelation corresponds to a positive correlation between a variable
+\noindent Positive autocorrelation corresponds to a positive correlation between a variable
and its lag vector.
-Is it the case here? How can we explain that Moran's test was significant?
-
-
-Repeat these analyses for the second PC. What are your conclusions?
+Here, we can see that this relation is entirely driven by the previously identified outliers, which
+turn out to be neighbours.
+This is therefore a fairly trivial and uninteresting pattern.
+Results obtained on the second PC are less surprisingly non-significant:
<<fig=TRUE>>=
pc2.mctest <- moran.mc(rupica.pca1$li[,2], rupica.graph, 999)
plot(pc2.mctest)
@@ -768,58 +810,26 @@
plot(mtest, nclass=30)
@
-\noindent What is your conclusion? Shall we be looking for spatial structures?
-If so, how can we explain that PCA did not reveal them?
-Does the Mantel correlogram (\texttt{mantel.correlog} in \emph{vegan} package) bring any help solving the problem?
+\noindent Interestingly, this test turns out to be marginally significant, and would encourage us to
+look for spatial patterns. This is the role of the spatial Principal Component Analysis.
-
%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{spatial Principal Component Analysis}
%%%%%%%%%%%%%%%%%%%%%%%%%%
-The spatial Principal Component Analysis (sPCA, function \texttt{spca} \cite{tjart04}) has been especially developed to
-investigate hidden or non-obvious spatial genetic patterns.
-Like Moran's $I$ test, sPCA first requires the spatial proximities between genotypes to be
-modeled.
-You will reuse the connection network defined previously using \texttt{chooseCN}, and pass it as the
-'cn' argument of the function \texttt{spca}.
-~\\
-
-Read the documentation of \texttt{spca}, and apply the function to the dataset \texttt{rupica}.
-The function will display a barplot of eigenvalues:
+We apply an sPCA to the \texttt{rupica} dataset, using the connection network used previously in
+Moran's $I$ tests:
<<echo=TRUE, fig=TRUE>>=
rupica.spca1 <- spca(rupica, cn=rupica.graph,scannf=FALSE, nfposi=2,nfnega=0)
-barplot(rupica.spca1$eig, col=rep(c("red","grey"), c(2,1000)) )
+barplot(rupica.spca1$eig, col=rep(c("red","grey"), c(2,1000)), main="rupica dataset - sPCA eigenvalues")
@
-This figure illustrates the fundamental difference between PCA and sPCA.
-Like \texttt{dudi.pca}, \texttt{spca} displays a barplot of
-eigenvalues, but unlike in PCA, eigenvalues of sPCA can also be negative.
-This is because the criterion optimized by the analysis can have
-positive and negative values, corresponding respectively to positive
-and negative autocorrelation.
-Positive spatial autocorrelation correspond to greater genetic similarity between geographically
-closer individuals.
-Conversely, negative spatial autocorrelation corresponds to greater dissimilarity between neighbours.
-The spatial autocorrelation of a variable is measured by Moran's $I$, and interpreted as follows:
-\begin{itemize}
- \item $I_0 = -1/(n-1)$: no spatial autocorrelation ($x$ is randomly distributed across space)
- \item $I > I_0$: positive spatial autocorrelation
- \item $I < I_0$: negative spatial autocorrelation
-\end{itemize}
-Principal components of PCA ensure that ($\phi$ referring to one PC) $var(\phi)$ is maximum.
-By contrast, sPCA provides PC which decompose the quantity $var(\phi)I(\phi)$.
-In other words, PCA focuses on variability only, while sPCA is a compromise between variability
-($var(\phi)$) and spatial structure ($I(\phi)$).
-\\
-
-
-In this case, only the principal components associated with the two
-first positive eigenvalues (in red) shall be retained.
+\noindent The principal components associated with the first two
+positive eigenvalues (in red) shall be retained.
The printing of \texttt{spca} objects is more explicit than
\texttt{dudi} objects, but named with the same conventions:
<<>>=
@@ -844,44 +854,55 @@
\\
-Try visualising the sPCA results as you did before with the PCA results.
-To clarify the possible spatial patterns, you can map the lagged PC (\$ls) instead of the PC (\$li), which are a
+We map the sPCA results using \texttt{s.value} and lagged scores (\texttt{\$ls}) instead of the PC (\texttt{\$li}), which are a
'denoisified' version of the PCs.
+<<fig=TRUE,pdf=FALSE, png=TRUE>>=
+showBauges()
+s.value(rupica$other$xy, rupica.spca1$ls[,1], add.p=TRUE, csize=0.7)
+title("sPCA - first PC",col.main="yellow" ,line=-2, cex.main=2)
+@
-First, map the first principal component of sPCA.
-How would you interprete this result?
-How does it compare to the first PC of PCA?
-What inferrence can we make about the way the landscape
-influences gene flow in this population of Chamois?
+\noindent This first PC shows a remarkably clear structure opposing two high-altitude areas
+separated by a valley, which is thought to be an obstacle to the dispersal of Chamois (due to higher
+exposition to predation in low-altitude areas).
-Do the same with the second PC of sPCA.
-Some field observations suggest that this pattern is not artefactual.
-How would you interprete this second structure?
+The second PC of sPCA shows an equally interesting structure:
+<<fig=TRUE,pdf=FALSE, png=TRUE>>=
+showBauges()
+s.value(rupica$other$xy, rupica.spca1$ls[,2], add.p=TRUE, csize=0.7)
+title("sPCA - second PC",col.main="yellow" ,line=-2, cex.main=2)
+@
+
+\noindent The smaller clusters appearing on this map correspond to social units identified by
+direct observation in the field.
+Therefore, this genetic structure is merely a reflect of the social behaviour of these individuals.
\\
-To finish, you can try representing both structures at the same time
-using the color coding introduced by \cite{tj177} (\texttt{?colorplot}).
+
+Both genetic structures can be represented altogether using \texttt{colorplot}.
The final figure should ressemble this (although colors may change from one computer to another):
-<<eval=FALSE>>=
+<<fig=TRUE,pdf=FALSE, png=TRUE>>=
showBauges()
-colorplot(rupica$other$xy, rupica.spca1$ls, axes=1:2, transp=TRUE, add=TRUE, cex=2)
+colorplot(rupica$other$xy, rupica.spca1$ls, axes=1:2, transp=TRUE, add=TRUE, cex=3)
title("sPCA - colorplot of PC 1 and 2\n(lagged scores)", col.main="yellow", line=-2, cex=2)
@
-\begin{center}
-\includegraphics[width=.6\textwidth]{figs/rupicacolors.png}
-\end{center}
+%% \begin{center}
+%% \includegraphics[width=.6\textwidth]{figs/rupicacolors.png}
+%% \end{center}
-
-
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/adegenet -r 920
More information about the adegenet-commits
mailing list