[adegenet-commits] r189 - pkg/misc/bug-report.1.2-2.01 www www/files
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Fri Oct 10 20:07:16 CEST 2008
Author: jombart
Date: 2008-10-10 20:07:16 +0200 (Fri, 10 Oct 2008)
New Revision: 189
Added:
pkg/misc/bug-report.1.2-2.01/secondData.stru
www/files/tutorial-spca.1.3.rnw
www/files/tutorial-spca.pdf
www/files/tutorial.1.5.rnw
Modified:
www/documentation.html
www/download.html
Log:
Added sPCA tutorial.
Added: pkg/misc/bug-report.1.2-2.01/secondData.stru
===================================================================
--- pkg/misc/bug-report.1.2-2.01/secondData.stru (rev 0)
+++ pkg/misc/bug-report.1.2-2.01/secondData.stru 2008-10-10 18:07:16 UTC (rev 189)
@@ -0,0 +1,5 @@
+ CSDR247_1 HSC_1 ILSTS005_1 ILSTS011_1 ILSTS029_1 MAF209_1 MAF65_1 MCM527_1 OarFCB20_1 TGLA53_1 INRA172_1 OarFCB48_1 ILSTS087_1 MICROBETA_1 SR-CRSP3_1
+34913 2 245 245 284 282 180 176 278 280 163 176 101 103 120 120 154 164 099 101 136 139 144 147 156 160 141 141 167 172 119 119
+34915 2 243 243 282 282 180 176 278 280 166 168 101 103 120 135 154 169 105 105 139 139 156 156 158 160 141 141 164 167 109 117
+34916 2 233 235 284 289 180 176 278 280 165 168 101 103 133 122 154 166 96 099 136 139 142 142 154 158 148 141 164 172 109 119
+
Modified: www/documentation.html
===================================================================
--- www/documentation.html 2008-10-09 22:20:48 UTC (rev 188)
+++ www/documentation.html 2008-10-10 18:07:16 UTC (rev 189)
@@ -28,8 +28,7 @@
documentation of the package.<br>
- the adegenet <span style="font-weight: bold;">tutorial</span> (<a
href="files/tutorial.pdf">pdf</a>), which goes through the main
-functionalities of the package.<img style="width: 80px; height: 37px;"
- alt="" src="images/new.png"><br>
+functionalities of the package. Regularly updated.<br>
- <span style="font-weight: bold;">files</span> <span
style="font-weight: bold;">required</span> for some parts of the
tutorial<br>
@@ -41,7 +40,10 @@
a toy dataset reproducing triploid data <br>
- <a href="files/worksOnlyInDiploids.txt">worksOnyInDiploids.txt</a>:
the list of <span style="font-weight: bold;">functions</span> working <span
- style="font-weight: bold;">only for diploid</span> genotypes <img
+ style="font-weight: bold;">only for diploid</span> genotypes <br>
+- a <span style="font-weight: bold;">sPCA tutorial</span> (<a
+ href="files/tutorial-spca.pdf">pdf</a>), showing how to use the
+spatial Principal Component Analysis (Jombart et al. 2008)<img
style="width: 80px; height: 37px;" alt="" src="images/new.png"><br>
<br>
<span style="text-decoration: underline;">Related publications:</span><br>
Modified: www/download.html
===================================================================
--- www/download.html 2008-10-09 22:20:48 UTC (rev 188)
+++ www/download.html 2008-10-10 18:07:16 UTC (rev 189)
@@ -26,7 +26,8 @@
<br>
<img alt="" src="images/bullet.png" style="width: 10px; height: 10px;">
The
-<span style="font-weight: bold;">devel version</span> is also available
+<span style="font-weight: bold;">devel version</span> (adegenet_1.2-2)
+is also available
from <a href="https://r-forge.r-project.org/scm/?group_id=120"
target="_top">R-Forge's
daily snapshots</a>.<br>
Added: www/files/tutorial-spca.1.3.rnw
===================================================================
--- www/files/tutorial-spca.1.3.rnw (rev 0)
+++ www/files/tutorial-spca.1.3.rnw 2008-10-10 18:07:16 UTC (rev 189)
@@ -0,0 +1,506 @@
+% On lui donne l'en-tete
+\input{enTeteTJ}
+
+\setboolean{tutorial}{true}
+\newcommand{\titrefiche}{A tutorial for the spatial Principal
+ Component Analysis using the R package \emph{adegenet\_1.2-2 }}
+\newcommand{\shorttitle}{\emph{adegenet} tutorial}
+\newcommand{\sommairefiche}{ }
+\newcommand{\auteurfiche}{T. JOMBART}
+\newcommand{\rcmd}[1]{\textcolor{red}{\texttt{#1}}}
+
+%infos générales du document
+\title{\titrefiche}
+\author{\auteurfiche}
+\date{October 2008 - adegenet\_1.2-2}
+
+
+% debut du document proprement dit
+\begin{document}
+\input{styleTJ}
+\selectlanguage{english}
+
+%% Les options par défaut de Sweave, ici on lui dit de mettre les figures dans le
+%% dossier "figs" avec le préfixe, histoire de ne pas avoir trop de
+%% fichiers dans le dossier de travail. On lui dit aussi qu'il y a une figure par défaut,
+%% qu'on ne veut les figures en EPS, et on lui donne la taille par défaut des figures
+%% (en pouces) pour R, mais PAS pour le document final.
+
+\SweaveOpts{prefix.string = figs/figure, fig = FALSE, eps = FALSE, pdf = TRUE, png = FALSE, width = 6, height = 6, size=.6}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+% DEBUT DU DOCUMENT
+
+\maketitle
+% Pour éviter les problèmes de couleur...
+\color{black}
+\noindent
+{\small
+This tutorial still is in its beta form: it may (and will surely) need completion and improvements.
+You can contribute to its improvement it by posting questions on
+the \textit{adegenet} forum:
+\url{adegenet-forum at lists.r-forge.r-project.org}, or by sending
+comments to me at \url{jombart at biomserv.univ-lyon1.fr}.
+}
+\newpage
+\tableofcontents
+\newpage
+\color{black}
+
+\newpage
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Introduction}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+This tutorial goes through the \emph{spatial Principal Component
+ Analysis} \citep[sPCA,][]{tjart04}, a multivariate method devoted to
+the multivariate analysis of genetic markers.
+The sPCA is implemented inside the \emph{adegenet} package
+\citep{tjart05} for the R software \citep{tj400,np145}.
+Parts of this implementation relies on functions from the \emph{ade4}
+package \citep{tj311,tj521}.
+Reading of the original paper describing the sPCA is assumed.
+The purpose of this tutorial is to provide guidelines for practical
+issues performing a sPCA and for interpreting the results.
+After recalling basics of the sPCA, we detail the different tools in \emph{adegenet}
+that are related to the method.
+We conclude by going through the analysis of an empirical dataset.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{The sPCA in \emph{adegenet}}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Some basics about sPCA}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Mathematical notations used in this tutorial are those from \citet{tjart04}.
+The sPCA analyses a data matrix $\m{X}$ which contains genotypes or
+populations (later refered to as 'entities') in rows and alleles in columns.
+Spatial information is stored inside a spatial weighting matrix
+$\m{L}$ which contains positive terms corresponding to some measurement
+(often binary) of spatial proximity among entities.
+Most often, these terms can be derived from a connection network built
+upon a given algorithm \citep[for instance, ][pp.572-576]{tj88}.
+This matrix is row-standardized (\textit{i.e.}, each of its rows sums
+to one), and all its diagonal terms are zero.
+$\m{L}$ can be used to compute the spatial autocorrelation of a
+given centred variable $\m{x}$ (\textit{i.e.}, with mean zero) with $n$ observations ($\m{x} \in
+\R^n$) using Moran's $I$ \citep{tj223,tj222,tj436}:
+\beq
+I(\m{x}) = \frac{\m{x}^T\m{Lx}}{\m{x}^T\m{x}}
+\label{eqn:I}
+\eeq
+
+In the case of genetic data, $\m{x}$ contains frequencies of an allele.
+Moran's $I$ can be used to measure spatial structure among the values
+of $\m{x}$: it is highly positive when values of $\m{x}$ observed at
+neighbouring sites tend to be similar (positive spatial
+autocorrelation, referred to as \emph{global structures}), while it is
+strongly negative when values of $\m{x}$ observed at
+neighbouring sites tend to be dissimilar (negative spatial
+autocorrelation, referred to as \emph{local structures}).
+
+However, Moran's index measures only spatial structures, and does not take
+the variability of $\m{x}$ into account.
+The sPCA defines the following function to measure both spatial
+structure and variability in $\m{x}$:
+\beq
+C(\m{x}) = \mbox{var}(\m{x})I(\m{x}) = \frac{1}{n}\m{x}^T\m{Lx}
+\label{eqn:C}
+\eeq
+
+$C(\m{x})$ is highly positive when $\m{x}$ has a large variance and
+exhibits a global structure; conversely, it is largely negative
+when $\m{x}$ has a high variance and displays a local structure.
+This function is the criterion used in sPCA, which finds linear
+combinations of the alleles of $\m{X}$ (denoted $\psi=\m{Xv}$) decomposing $C$ from its
+maximum to its minimum value.
+Because $C(\m{Xv})$ is a product of variance and of autocorrelation,
+it is important, when interpreting the results, to detail both
+components and to compare their value with their range of variation
+(maximum attainable variance, as well as maximum and minimum $I$ are
+known analytically).
+A structure with a low spatial autocorrelation can barely be
+interpreted as a spatial pattern; similarly, a structure with a low
+variance would likely not reflect any genetic structure.
+We will later see how these information can be retrieved in \emph{adegenet}.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{The \texttt{spca} function}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The simulated dataset used to illustrate this section has been
+analyzed in \citet{tjart04}, and corresponds to the Figure 2A of the article.
+In \emph{adegenet}, the matrix of alleles frequencies previously
+denoted $\m{X}$ exactly corresponds to the \texttt{@tab} slot of \texttt{genind} or
+\texttt{genpop} objects:
+<<>>=
+data(spcaIllus)
+obj <- spcaIllus$dat2A
+obj
+head(truenames(obj[loc="L01"])$tab)
+@
+\noindent The object \texttt{obj} is a \texttt{genind} object; note
+that here, we only displayed the table for the first locus (\texttt{loc="L01"}).
+
+The function performing the sPCA is \rcmd{spca}; it accepts a bunch
+of arguments, but only the first two are mandatory to perform the
+analysis (see \texttt{?spca} for further information):
+<<>>=
+args(spca)
+@
+The argument \texttt{obj} is a \texttt{genind}/\texttt{genpop} object.
+By definition in sPCA, the studied entities are georeferenced.
+The spatial information can be provided to the function \texttt{spca}
+in several ways, the first being through the \texttt{xy} argument,
+which is a matrix of spatial coordinates with
+'x' and 'y' coordinates in columns.
+Alternatively, these coordinates can be stored inside the
+\texttt{genind}/\texttt{genpop} object, preferably as
+\texttt{@other\$xy}, in which case the \texttt{spca} function will not
+require a \texttt{xy} argument.
+Basically, spatial information could be stored in any form and with
+any name in the \texttt{@other} slot, but the \texttt{spca} function
+would not recognize it directly.
+Note that \texttt{obj} already contains spatial coordinates at the
+appropriate place.
+Hence, the following uses are valid (\texttt{ask} and \texttt{scannf}
+are set to FALSE to avoid interactivity):
+<<>>=
+mySpca <- spca(obj,ask=FALSE,scannf=FALSE)
+mySpca2 <- spca(obj,xy=obj at other$xy,ask=FALSE,scannf=FALSE)
+all.equal(mySpca, mySpca2)
+names(mySpca)[8]
+@
+\noindent Both objects are the same: they only differ by their call.
+\\
+
+
+Note, however, that spatial coordinates are not directly used in sPCA:
+the spatial information is included in the analysis by the spatial
+weighting matrix $\m{L}$ derived from a connection network (eq. \ref{eqn:I} and \ref{eqn:C}).
+Technically, the \texttt{spca} function does not directly use a
+matrix of spatial weightings, but a connection network with the class
+\texttt{nb} or a list of spatial weights of class \texttt{listw},
+which are both implemented by Roger Bivand's package \texttt{spdep}.
+The function \texttt{chooseCN} is a wrapper for different functions
+spread in several packages implementing a variety of connection networks.
+If only spatial coordinates are provided to \texttt{spca},
+\texttt{chooseCN} is called to construct an appropriate graph.
+See \texttt{?chooseCN} for more information.
+Note that many of the \texttt{spca} arguments are in fact arguments
+for \texttt{chooseCN}: \texttt{type}, \texttt{ask}, \texttt{plot.nb},
+\texttt{edit.nb}, \texttt{d1}, \texttt{d2}, \texttt{k}, \texttt{a}, and \texttt{dmin}.
+For instance, the command:
+<<>>=
+mySpca <- spca(obj,type=1,ask=FALSE,scannf=FALSE)
+@
+\noindent performs a sPCA using the Delaunay triangulation as
+connection network (\texttt{type=1}, see \texttt{?chooseCN}), while
+the command:
+<<>>=
+mySpca <- spca(obj,type=5,d1=0,d2=2,scannf=FALSE)
+@
+\noindent computes a sPCA using a connection network which defines
+neighbouring entities from their distances (\texttt{type=5}),
+considering as neighbours two entities whose distance between 0
+(\texttt{d1=0}) and 2 (\texttt{d2=2}).
+
+Another possibility is of course to provide directly a connection
+network (\texttt{nb} object) or a list of spatial weights
+(\texttt{listw} object) to the \texttt{spca} function; this can be done via the \texttt{cn} argument.
+For instance:
+<<>>=
+myCn <- chooseCN(obj$other$xy, type=6, k=10, plot=FALSE)
+myCn
+class(myCn)
+mySpca2 <- spca(obj,cn=myCn,scannf=FALSE)
+@
+\noindent produces a sPCA using \texttt{myCn} ($k=10$ nearest
+neighbours) as a connection network.
+\\
+
+After providing a genetic dataset along with a spatial information,
+the \texttt{spca} function displays a barplot of eigenvalues and asks
+for a number of positive axes ('first number of axes') and negative
+axes ('second number of axes') to be retained (unless \texttt{scannf} is set to \texttt{FALSE}).
+For the object \texttt{mySpca}, this barplot would be (here we
+indicate in red the retained eigenvalue):
+<<fig=TRUE,pdf=TRUE,size=.5>>=
+barplot(mySpca$eig,main="Eigenvalues of sPCA", col=rep(c("red","grey"),c(1,100)))
+@
+\noindent Positive eigenvalues (on the left) correspond to global
+structures, while negative eigenvalues (on the right) indicate local patterns.
+Actual structures should result in more extreme (positive or
+negative) eigenvalues; for instance, the object \texttt{mySpca} likely
+contains one single global structure, and no local structure.
+If one does not want to choose the number of retained axes
+interactively, the arguments \texttt{nfposi} (number of retained
+factors with positive eigenvalues) and \texttt{nfnega} (number of
+retained factors with negative eigenvalues) can be used.
+Once these information have been provided to \texttt{spca}, the
+analysis is computed and stored inside an object with the class \texttt{spca}.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{A \texttt{spca} object}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Let us consider a \texttt{spca} object resulting from the analysis of
+the object \texttt{obj}, using a Delaunay triangulation as connection network:
+<<>>=
+mySpca <- spca(obj,type=1,scannf=FALSE,plot.nb=FALSE,nfposi=1,nfnega=0)
+class(mySpca)
+mySpca
+@
+
+\noindent An \texttt{spca} object is a list containing all required
+information about a performed sPCA.
+Details about the different components of such a list can be found in
+the \texttt{spca} documentation (\texttt{?spca}).
+The purpose of this section is to show how the elements described
+in \citet{tjart04} are stored inside a \texttt{spca} object.
+
+First, eigenvalues of the analysis are stored inside the
+\texttt{\$eig} component as a numeric vector stored in decreasing order:
+<<fig=TRUE,pdf=TRUE,png=FALSE>>=
+head(mySpca$eig)
+tail(mySpca$eig)
+length(mySpca$eig)
+plot(mySpca$eig, type="h", lwd=2, main="A variant of the plot\n of sPCA eigenvalues")
+abline(h=0,col="grey")
+@
+
+\noindent The axes of the analysis, denoted $\m{v}$ in \citet[][eq.(4)]{tjart04} are stored as columns inside the \texttt{\$c1} component.
+Each columns contains loadings for all the alleles:
+<<>>=
+head(mySpca$c1)
+tail(mySpca$c1)
+dim(mySpca$c1)
+@
+
+
+\noindent The entity scores ($\psi = \m{Xv}$), are stored
+in columns in the \texttt{\$li} component:
+<<>>=
+head(mySpca$li)
+tail(mySpca$li)
+dim(mySpca$li)
+@
+
+\noindent The lag vectors of the scores can be displayed graphically
+instead of basic scores so as to better perceive global structures.
+Lag vectors are stored in the \texttt{\$ls} component:
+<<>>=
+head(mySpca$ls)
+tail(mySpca$ls)
+dim(mySpca$ls)
+@
+
+\noindent Lastly, we can compare the axes of an ordinary, 'classical'
+PCA (denoted $\m{u}$ in the paper) to the axes of the sPCA ($\m{v}$).
+This is achieved by projecting $\m{u}$ onto $\m{v}$, but this
+projection is a particular one: because both $\m{u}$ and $\m{v}$ are
+centred to mean zero and scaled to unit variance, the value of the
+projection simply is the correlation between both axes.
+This information is stored inside the \texttt{\$as} component:
+<<>>=
+mySpca$as
+@
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Representing the information}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+The information contained inside a \texttt{spca} object can be displayed
+in several ways.
+While we have seen that a simple barplot of sPCA eigenvalues can give a first idea of the
+global and local structures to be retained, we have also seen that
+each eigenvalue can be decomposed into a \textit{variance} and a
+\textit{spatial autocorrelation} (Moran's $I$) component.
+This information is provided by the \texttt{summary} function, but it
+can also be represented graphically.
+The corresponding function is \rcmd{screeplot}, and can be used on any
+\texttt{spca} object:
+<<screeplot,fig=TRUE>>=
+screeplot(mySpca)
+@
+\noindent The resulting figure represents eigenvalues of sPCA (denoted
+$\lambda_i$ with $i=1,\ldots,n-1$, where $\lambda_1$ is the strongest
+global eigenvalue, and $\lambda_{n-1}$ is the strongest local eigenvalue) according the their variance and Moran's $I$ components.
+These eigenvalues are contained inside a rectangle indicated in dashed
+lines.
+The maximum attainable variance by a linear combination of alleles is
+the one from an ordinary PCA, indicated by the vertical dashed line onto
+the right.
+The two horizontal dashed lines indicate the range of variation of
+Moran's $I$, given the spatial weighting matrix that was used.
+This figure is useful to assess whether a given score of entities contains
+relatively enough variability and spatial structuring to be interpreted.
+For instance, here, $\lambda_1$ clearly is the largest eigenvalue in
+terms of variance and of spatial autocorrelation, and can be well
+distinguished from all the other eigenvalues.
+Hence, only the first global structure, associated to $\lambda_1$, should be interpreted.
+\\
+
+
+The global and local tests proposed in \citet{tjart04} can
+be used to reinforce the decision of interpreting or not
+interpreting global and local structures.
+Each test can detect the presence of one kind of structure.
+We can apply them to the object \texttt{obj}, used in our sPCA:
+<<globalrtest,fig=TRUE>>=
+myGtest <- global.rtest(obj$tab,mySpca$lw,nperm=99)
+myGtest
+plot(myGtest)
+@
+
+\noindent The produced object is a \texttt{randtest} object (see
+\texttt{?randtest}), which is the class of object for Monte-Carlo
+tests in the \textit{ade4} package.
+As shown, such object can be plotted using a \texttt{plot} function:
+the resulting figure shows an histogram of permuted test statistics
+and indicates the observed statistics by a black dot and a segment.
+Here, the plot clearly shows that the oberved test statistic is larger
+than most simulated values, leading to a likely rejection of
+alternative hypothesis.
+Note that because 99 permutations were used, the p-value cannot be
+lower than 0.01.
+In practice, more permutations should be used (like 9999 for results
+intended to be published).
+
+The same can be done with the local test, which here we do not expect
+to be significant:
+<<localrtest,fig=TRUE>>=
+myLtest <- local.rtest(obj$tab,mySpca$lw,nperm=99)
+myLtest
+plot(myLtest)
+@
+~\\
+
+
+Once we have an idea of which structures shall be interpreted, we can
+try to visualize spatial genetic patterns.
+There are several ways to do so.
+The first, most simple approach is through the function plot (see \texttt{?plot.spca}):
+<<plotspca,fig=TRUE,size=.8>>=
+plot(mySpca)
+@
+\noindent This figure shows different information, that we detail from
+the top to the bottom and from the left to the right.
+The first plot shows the connection network that was used to define
+spatial weightings.
+The second, third, and fourth plots are different representations of
+entities scores onto one axis in space, the first global score being the default (argument
+\texttt{axis}).
+In each, the values of scores (\texttt{\$li[,axis]} component of the
+\texttt{spca} object) are represented using black and white symbols
+(a variant being grey levels): white for negative values, and black
+for positive values.
+The second plot is a local interpolation of scores (function
+\rcmd{s.image} in \textit{ade4}), using grey levels, with contour lines.
+The closer the contour lines are from each other, the stepest the
+genetic differentiation is.
+The third plot uses different sizes of squares to represent different
+absolute values (\rcmd{s.value} in \textit{ade4}): large black squares are well differentiated from
+large white squares, but small squares are less differentiated.
+The fourth plot is a variant using grey levels (\rcmd{s.value} in
+\textit{ade4}, with 'greylevel' method).
+Here, all the three representations of the first global score show
+that genotypes are splitted in two genetical clusters, one in the west
+(or left) and one in the east (right).
+The last two plots of the \texttt{plot.spca} function are the two
+already seen displays of eigenvalues.
+\\
+
+Another way of representing a score of sPCA is using the
+\rcmd{colorplot} function.
+This function can show up to three scores at the same time by
+translating each score into a channel of color (red, green, and blue).
+The obtained values are used to compose a color using the RGB system.
+See \texttt{?colorplot} for details about this function.
+The original idea of such representation is due to \citet{tj179}.
+Despite the \texttt{colorplot} clearly is more powerful to represent
+more than one set of scores on a single map, we can use it to represent the
+first global structure that was retained in \texttt{mySpca}:
+<<colorplot,fig=TRUE>>=
+colorplot(mySpca,cex=3,main="colorplot of mySpca, first global score")
+@
+\noindent See \texttt{example(colorplot)} and \texttt{example(spca)}
+for more examples of applications of colorplot to represent sPCA scores.
+
+
+So far, we assessed the spatial genetic structures existing in the data.
+We learned that a global structure existed, and we observed that it
+consisted in two east-west genetic clusters.
+Now, we may like to know how each allele contributes to a given set of
+scores.
+To quantify such contribution, the absolute value of loadings for
+a given structure can be used.
+However, it is more relevant to consider squared loadings, as their
+sum is always constrained to be unit (because $\|\m{v}\|^2=1$).
+We can look for the alleles contributing most to the first axis
+of sPCA, using the function \texttt{loadingplot} (see \texttt{?loadingplot} for
+a description of the arguments):
+<<fig=TRUE>>=
+myLoadings <- mySpca$c1[,1]^2
+names(myLoadings) <- rownames(mySpca$c1)
+loadingplot(myLoadings, xlab="Alleles",ylab="Weight of the alleles",main="Contribution of alleles \n to the first sPCA axis")
+@
+\noindent See \texttt{?loadingplot} for more information about this
+function, in particular for the definition of the threshold value
+above which alleles are annotated.
+Note that it is possible to separate alleles by markers,
+using the \texttt{fac} argument, to assess if all markers have
+comparable contributions to a given structure.
+In our case, we would only have to specify \texttt{fac=obj at loc.fac};
+also note that \texttt{loadingplot} invisibly returns information
+about the alleles whose contribution is above the threshold.
+For instance, to identify the 5\% of alleles with the greatest
+contributions to the first global structure in \texttt{mySpca}, we need:
+<<fig=TRUE, width=8, size=.9>>=
+temp <- loadingplot(myLoadings, threshold=quantile(myLoadings, 0.95), xlab="Alleles",ylab="Weight of the alleles",main="Contribution of alleles \n to the first sPCA axis", fac=obj at loc.fac, cex.lab=1, cex.fac=.8)
+temp
+@
+
+But to assess the average contribution of each marker, a traditional
+boxplot remains a better tool:
+<<boxplot,fig=TRUE, keep.source=TRUE>>=
+boxplot(myLoadings~obj$loc.fac, las=3,
+ main="Contributions by markers \nto the first global score")
+@
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Diving into data: analysis of a real dataset}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%
+To come...
+
+
+\clearpage
+\bibliography{biblioTJ}
+\bibliographystyle{jtbnew}
+
+\end{document}
Property changes on: www/files/tutorial-spca.1.3.rnw
___________________________________________________________________
Name: svn:executable
+ *
Added: www/files/tutorial-spca.pdf
===================================================================
(Binary files differ)
Property changes on: www/files/tutorial-spca.pdf
___________________________________________________________________
Name: svn:mime-type
+ application/octet-stream
Added: www/files/tutorial.1.5.rnw
===================================================================
--- www/files/tutorial.1.5.rnw (rev 0)
+++ www/files/tutorial.1.5.rnw 2008-10-10 18:07:16 UTC (rev 189)
@@ -0,0 +1,964 @@
+% On lui donne l'en-tete
+\input{enTeteTJ}
+
+\setboolean{tutorial}{true}
+\newcommand{\titrefiche}{A tutorial for the R package
+ \texttt{adegenet\_1.2-0 }}
+\newcommand{\shorttitle}{\texttt{adegenet} tutorial}
+\newcommand{\sommairefiche}{ }
+\newcommand{\auteurfiche}{T. JOMBART}
+\newcommand{\rcmd}[1]{\textcolor{red}{\texttt{#1}}}
+
+%infos générales du document
+\title{\titrefiche}
+\author{\auteurfiche}
+\date{June 2008 - adegenet\_1.2-0}
+
+
+% debut du document proprement dit
+\begin{document}
+\input{styleTJ}
+\selectlanguage{english}
+
+%% Les options par défaut de Sweave, ici on lui dit de mettre les figures dans le
+%% dossier "figs" avec le préfixe, histoire de ne pas avoir trop de
+%% fichiers dans le dossier de travail. On lui dit aussi qu'il y a une figure par défaut,
+%% qu'on ne veut les figures en EPS, et on lui donne la taille par défaut des figures
+%% (en pouces) pour R, mais PAS pour le document final.
+
+\SweaveOpts{prefix.string = figs/figure, fig = FALSE, eps = FALSE, pdf = TRUE, png = FALSE, width = 6, height = 6}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+% DEBUT DU DOCUMENT
+
+\maketitle
+% Pour éviter les problèmes de couleur...
+\color{black}
+\noindent
+{\small For any questions or comments, please send an email to: \\ \url{jombart at biomserv.univ-lyon1.fr}.}
+\newpage
+\tableofcontents
+\SweaveOpts{prefix.string = figs/analyse, fig = FALSE, eps = FALSE, pdf = TRUE, png = FALSE, width = 6, height = 6}
+\newpage
+\color{black}
+
+\newpage
+\section{Introduction}
+This tutorial proposes a short visit through functionalities of the \rcmd{adegenet} package for R \citep{tj400,np145}.
+The purpose of this package is to facilitate the multivariate analysis of molecular marker data, especially using the \rcmd{ade4} package \citep{tj311}.
+Data can be imported from popular softwares like GENETIX, or converted from simple data frame of genotypes.
+\rcmd{adegenet} also aims at providing a platform from which to use easily methods provided by other R packages \citep[e.g., ][]{tj510}.
+Indeed, if it is possible to perform various genetic data analyses using R, data formats often differ from one package to another, and conversions are sometimes far from easy and straightforward.
+\\
+
+In this tutorial, I first present the two object classes used in \rcmd{adegenet}, namely \rcmd{genind} (genotypes of individuals) and \rcmd{genpop} (genotypes grouped by populations).
+Then, several topics will be tackled using reproductible examples.
+
+\section{First steps}
+\subsection{Installing the package}
+Current version of the package is 1.1-0, and is compatible with R 2.6.2.
+Here the \rcmd{adegenet} package is installed along with other recommended packages.
+<<inst,eval=F>>=
+install.packages("adegenet",dep=TRUE)
+@
+Then the first step is to load the package:
+<<>>=
+library(adegenet)
+@
+
+\subsection{Object classes}
+Two classes of objects are defined, depending on the scale at which the genetic information is stored:
+\rcmd{genind} is used for individual genotypes, whereas \rcmd{genpop} is used for alleles numbers counted by populations.
+Note that the term 'population', here and later, is employed in a broad sense: it simply refers to any grouping of individuals.
+
+\subsubsection{genind objects}
+These objects can be obtained by importation from foreign softwares,
+from a \rcmd{data.frame} of genotypes, or by conversion from a table of allelic frequencies (see 'importing data').
+<<genind>>=
+data(nancycats)
+is.genind(nancycats)
+nancycats
+@
+A \rcmd{genind} object is formal S4 object with several slots,
+accessed using the '\rcmd{@}' operator (see \rcmd{class?genind}).
+Note that the '\rcmd{\$}' was also implemented for adegenet objects,
+so that slots can be accessed as if they were components of a list.
+The main slot in \rcmd{genind} is a table of allelic frequencies of individuals (in rows) for every alleles in every loci.
+Being frequencies, data sum to one per locus, giving the score of 1 for an homozygote and 0.5 for an heterozygote.
+For instance:
+<<>>=
+nancycats$tab[10:18,1:10]
+@
+Individual '010' is an homozygote for the allele 09 at locus 1, while '018' is an heterozygote with alleles 06 and 09.
+As user-defined labels are not always valid (for instance, there can
+be duplicated), generic labels are used for individuals, markers, alleles and eventually population.
+The true names are stored in the object (components \rcmd{\$[...].names} where ... can be 'ind', 'loc', 'all' or 'pop').
+For instance :
+<<>>=
+nancycats$loc.names
+@
+gives the true marker names, and
+<<>>=
+nancycats$all.names[[3]]
+@
+gives the allele names for marker 3.
+\\
+
+Optional components are also allowed.
+The slot \rcmd{@other} is a list that can include any additionnal information.
+The optional slot \rcmd{@pop} (a factor giving a grouping of individuals) is particular in that the behaviour of many functions will check automatically for it and behave accordingly.
+In fact, each time an argument 'pop' is required by a function, it is first seeked in \rcmd{@pop}.
+For instance, using the function \rcmd{genind2genpop} to convert \rcmd{nancycats} to a \rcmd{genpop} object, there is no need to give a 'pop' argument as it exists in the \rcmd{genind} object:
+<<>>=
+table(nancycats$pop)
+catpop <- genind2genpop(nancycats)
+catpop
+@
+
+Other additional components can be stored (like here, spatial coordinates of populations in \$xy) but will not be passed during any conversion (\rcmd{catpop} has no \$other\$xy).
+
+Finally, a \rcmd{genind} object generally contains its matched call, \textit{i.e.} the instruction that created itself.
+This is not the case, however, for objects loaded using \rcmd{data}.
+When call is available, it can be used to regenerate an object.
+<<>>=
+obj <- read.genetix(system.file("files/nancycats.gtx",package="adegenet"))
+obj$call
+toto <- eval(obj$call)
+identical(obj,toto)
+@
+
+\subsubsection{genpop objects}
+We use the previously built \rcmd{genpop} object:
+<<>>=
+catpop
+is.genpop(catpop)
+catpop$tab[1:5,1:10]
+@
+The matrix \$tab contains alleles counts per population (here, cat colonies).
+These objects are otherwise very similar to \rcmd{genind} in their
+structure, and possess generic names, true names, the matched call and
+an \rcmd{@other} slot.
+
+
+
+
+
+
+
+
+
+
+\section{Various topics}
+\subsection{Importing data}
+Data can be read from the softwares GENETIX (.gtx), STRUCTURE (.str or
+.stru), FSTAT (.dat) and Genepop (.gen) files, using the corresponding
+\rcmd{read} function: \rcmd{read.genetix}, \rcmd{read.structure},
+\rcmd{read.fstat}, and \rcmd{read.genepop}.
+In all cases, a \rcmd{genind} object will be produced.
+Alternatively, one can use the function \rcmd{import2genind} which detects a file format from its extension and uses the appropriate routine.
+For instance:
+<<import>>=
+obj1 <- read.genetix(system.file("files/nancycats.gtx",package="adegenet"))
+obj2 <- import2genind(system.file("files/nancycats.gtx", package="adegenet"))
+all.equal(obj1,obj2)
+
+@
+
+\noindent The only difference between \rcmd{obj1} and \rcmd{obj2} is
+their call (which is normal as they were obtained from different
+command lines).
+However, it happens that data are available in other formats.
+Most of these can be read using \rcmd{df2genind}, which transform at
+\rcmd{data.frame} (imported in R using \rcmd{read.table}, for
+instance) into a \rcmd{genind}.
+The \rcmd{data.frame} must contain genotypes in rows, markers in
+column, and each of its terms must be a character string coding the alleles.
+There is no restriction about the ploidy of the data.
+Missing data can be series of '0' or NAs.
+Since version 1.2-0 of adegenet, \rcmd{df2genind} handles two different cases:
+\begin{itemize}
+\item a separator is used between the alleles (argument 'sep')
+\item no separator is used; in this case, all alleles should be coded by the same number of characters.
+ For instance, if some alleles are coded by two characters, then there
+ should be no '1','2','3'... but only '01', '02', and '03'.
+\end{itemize}
+
+%% In all cases, it should be possible to store data in an individuals x markers table where each element is a character string coding 2 alleles.
+%% Such data are interpretable when all strings contain 2,4 or 6 characters.
+%% For instance, "11" will be an homozygote 1/1, "1209" will be an heterozygote 12/09.
+%% The function \rcmd{df2genind} converts such data to a \rcmd{genind}.
+%% One has to read data into R, using for instance \rcmd{read.table}, and then
+%% use \rcmd{df2genind}.
+
+Here, I provide an example of using \rcmd{df2genind} to read a data
+set from the hierfstat package.
+Data are diploid, alleles are coded by one character, and there are no
+separator between alleles.
+<<>>=
+library(hierfstat)
+toto <- read.fstat.data(paste(.path.package("hierfstat"),"/data/diploid.dat",sep="",collapse=""),nloc=5)
+head(toto)
+@
+\rcmd{toto} is a data frame containing genotypes and a population factor.
+<<>>=
+obj <- df2genind(X=toto[,-1],pop=toto[,1])
+obj
+head(genind2df(obj))
+@
+
+
+Lastly, \rcmd{genind} or \rcmd{genpop} objects can be obtained from a data matrix similar to the \rcmd{\$tab} component (respectively, alleles frequencies and alleles counts).
+Such action is achieved by the constructors \rcmd{genind} (or \rcmd{as.genind}) and \rcmd{genpop}
+(or \rcmd{as.genpop}).
+The table passed as argument to these constructors must have correct
+names: rownames identify the genotypes/populations, while colnames
+have the form '[marker].[allele]'
+Here is an example for \rcmd{genpop} using dataset from ade4:
+<<>>=
+library(ade4)
+data(microsatt)
+microsatt$tab[10:15,12:15]
+@
+\rcmd{microsatt\$tab} contains alleles counts, and can therefore be used to make a \rcmd{genpop} object.
+<<>>=
+toto <- genpop(microsatt$tab)
+toto
+@
+
+
+
+
+\subsection{Exporting data}
+Genotypes in \rcmd{genind} format can be exported to the R packages
+\emph{genetics} (using \rcmd{genind2genotype}) and \emph{hierfstat} (using \rcmd{genind2hierfstat}).
+The package \emph{genetics} is now deprecated, but the implemented
+class \rcmd{genotype} is still used by various packages.
+The package \emph{hierfstat} does not define a class, but requires
+data to be formated in a particular way.
+Here are examples of how to use these functions:
+<<genind2genotype>>=
+obj <- genind2genotype(nancycats)
+class(obj)
+obj[1:4,1:5]
+class(obj$fca8)
+@
+
+<<genind2hierfstat>>=
+obj <- genind2hierfstat(nancycats)
+class(obj)
+obj[1:4,1:5]
+@
+
+\noindent Now we can ues the function \rcmd{varcomp.glob} from
+\emph{hierfstat} to compute 'variance' components:
+<<>>=
+varcomp.glob(obj$pop,obj[,-1])
+@
+
+
+A more generic way to export data is to produce a data.frame of genotypes
+coded by character strings.
+This is done by \rcmd{genind2df}:
+<<genind2df>>=
+obj <- genind2df(nancycats)
+obj[1:5,1:5]
+@
+
+\noindent However, some softwares will require alleles to be
+separated.
+The argument \rcmd{sep} allows one to specify any separator.
+For instance:
+<<>>=
+genind2df(nancycats,sep="|")[1:5,1:5]
+@
+
+Note that tabulations can be obtained as follows using '{\backslash}t' character.
+
+
+
+
+
+\subsection{Manipulating data}
+Data manipulation is meant to be easy in \rcmd{adegenet} (if it is
+not, complain!).
+First, as \rcmd{genind} and \rcmd{genpop} objects are basically formed
+by a data matrix (the \rcmd{@tab} slot), it is natural to subset these objects like it is done
+with a matrix.
+The \rcmd{$[$} operator does this, forming a new object with the retained genotypes/populations and alleles:
+<<>>=
+titi <- toto[1:3,]
+toto$pop.names
+titi
+titi$pop.names
+@
+
+\noindent The object \rcmd{toto} has been subsetted, keeping only the
+first three populations.
+Of course, any subsetting available for a matrix can be used with \rcmd{genind} and \rcmd{genpop} objects.
+For instance, we can subset \rcmd{titi} to keep only the third marker:
+<<>>=
+titi <- titi[,titi$loc.fac=="L3"]
+titi
+@
+
+\noindent Now, \rcmd{titi} only contains the 11 alleles of the third
+marker of \rcmd{toto}.
+\\
+
+To simplify the task of separating data by marker, the function
+\rcmd{seploc} can be used.
+It returns a list of objects (optionnaly, of data matrices), each
+corresponding to a marker:
+<<seploc>>=
+sepCats <- seploc(nancycats)
+class(sepCats)
+names(sepCats)
+sepCats$fca45
+@
+
+\noindent The object \rcmd{sepCats\$fca45} only contains data of the
+marker fca45.
+\\
+
+Following the same idea, \rcmd{seppop} allows one to separate genotypes
+in a \rcmd{genind} object by population.
+For instance, we can separate genotype of cattles in the dataset \rcmd{microbov}
+by breed:
+<<seppop>>=
+data(microbov)
+obj <- seppop(microbov)
+class(obj)
+names(obj)
+obj$Borgou
+@
+
+\noindent The returned object \rcmd{obj} is a list of \rcmd{genind}
+objects each containing genotypes of a given breed.
+
+A last, rather vicious trick is to separate data by population and by marker.
+This is easy using \rcmd{lapply}; one can first separate population
+then markers, or the contrary.
+Here, we separate markers inside each breed in \rcmd{obj}
+<<sepultim>>=
+obj <- lapply(obj,seploc)
+names(obj)
+class(obj$Borgou)
+names(obj$Borgou)
+obj$Borgou$INRA63
+@
+
+For instance, \rcmd{obj\$Borgou\$INRA63} contains genotypes of the
+breed Borgou for the marker INRA63.
+\\
+
+Lastly, one may want to pool genotypes in different datasets, but having
+the same markers, into a single dataset.
+This is more than just merging the \rcmd{@tab} components of all
+datasets, because alleles can differ (they almost always do) and
+markers are not necessarily sorted the same way.
+The function \rcmd{repool} is designed to avoid these problems.
+It can merge any \rcmd{genind} provided as arguments as soon as the
+same markers are used.
+For instance, it can be used after a \rcmd{seppop} to retain only some populations:
+<<repool>>=
+obj <- seppop(microbov)
+names(obj)
+newObj <- repool(obj$Borgou, obj$Charolais)
+newObj
+newObj$pop.names
+@
+Done !
+
+
+
+
+\subsection{Using summaries}
+Both \rcmd{genind} and \rcmd{genpop} objects have a summary providing basic information about data.
+Informations are both printed and invisibly returned as a list.
+
+<<summary,fig=T,png=FALSE,pdf=T,size=.8>>=
+toto <- summary(nancycats)
+names(toto)
+
+par(mfrow=c(2,2))
+
+plot(toto$pop.eff,toto$pop.nall,xlab="Colonies sample size",ylab="Number of alleles",main="Alleles numbers and sample sizes",type="n")
+text(toto$pop.eff,toto$pop.nall,lab=names(toto$pop.eff))
+
+barplot(toto$loc.nall,ylab="Number of alleles", main="Number of alleles per locus")
+
+barplot(toto$Hexp-toto$Hobs,main="Heterozygosity: expected-observed",ylab="Hexp - Hobs")
+
+barplot(toto$pop.eff,main="Sample sizes per population",ylab="Number of genotypes",las=3)
+@
+
+Is mean observed H significantly lower than mean expected H ?
+
+<<>>=
+bartlett.test(list(toto$Hexp,toto$Hobs))
+t.test(toto$Hexp,toto$Hobs,pair=T,var.equal=TRUE,alter="greater")
+@
+Yes, it is.
+
+
+\subsection{Testing for structuration among populations}
+The G-statistic test \citep{tj511} is implemented for \rcmd{genind} objects and produces a \rcmd{randtest} object (package ade4).
+The function to use is \rcmd{gstat.randtest}, and requires the package \emph{hierfstat}.:
+<<>>=
+library(ade4)
+toto <- gstat.randtest(nancycats,nsim=99)
+toto
+plot(toto)
+@
+Now that the test is performed, one can ask for F statistics.
+To get these, data are first converted to be used in the hierfstat package:
+<<>>=
+library(hierfstat)
+toto <- genind2hierfstat(nancycats)
+head(toto)
+varcomp.glob(toto$pop,toto[,-1])
+@
+F statistics are provided in \$F; for instance, here, $F_{st}$ is $0.083$.
+
+\subsection{Testing for Hardy-Weinberg equilibrium}
+The Hardy-Weinberg equilibrium test is implemented for \rcmd{genind} objects.
+The function to use is \rcmd{HWE.test.genind}, and requires the package \emph{genetics}.
+Here we first produce a matrix of p-values (\rcmd{res="matrix"}) using parametric test.
+Monte Carlo procedure are more reliable but also more computer-intensive (use \rcmd{permut=TRUE}).
+<<HWE>>=
+toto <- HWE.test.genind(nancycats,res="matrix")
+dim(toto)
+@
+One test is performed per locus and population, \textit{i.e.} 153 tests in this case.
+Thus, the first question is: which tests are highly significant?
+<<>>=
+colnames(toto)
+which(toto<0.0001,TRUE)
+@
+Here only 4 tests indicate departure from HW.
+Rows give populations, columns give markers.
+Now complete tests are returned, but the significant ones are already known.
+<<>>=
+toto <- HWE.test.genind(nancycats,res="full")
+toto$fca23$P06
+toto$fca90$P10
+toto$fca96$P10
+toto$fca37$P13
+@
+
+
+\subsection{Performing a Principal Component Analysis on \rcmd{genind} objects}
+The tables contained in \rcmd{genind} objects can be submitted to a Principal Component Analysis (PCA) to seek a typology of individuals.
+Such analysis is straightforward using \textit{adegenet} to prepare data and \textit{ade4} for the analysis \textit{per se}.
+One has first to replace missing data.
+Putting each missing observation at the mean of the concerned allele frequency seems the best choice (NA will be stuck at the origin).
+<<pcaexpl>>=
+data(microbov)
+any(is.na(microbov$tab))
+sum(is.na(microbov$tab))
+@
+There are 6325 missing data.
+Assuming that these are evenly distributed (for illustration purpose
+only!), we replace them using \rcmd{na.replace}.
+As we intend to use a PCA, the appropriate replacement method is to
+put each NA at the mean of the corresponding allele (argument 'method'
+set to 'mean').
+<<>>=
+obj <- na.replace(microbov,method="mean")
+@
+
+\noindent Done. Now, the analysis can be performed. Data are centred but not scaled as 'units' are the same.
+<<fig=T,size=.4>>=
+pca1 <- dudi.pca(obj$tab,cent=TRUE,scale=FALSE,scannf=FALSE,nf=3)
+barplot(pca1$eig[1:50],main="Eigenvalues")
+@
+Here we represent the genotypes and 95\% inertia ellipses for populations.
+<<fig=T,size=.6>>=
+s.class(pca1$li,obj$pop,lab=obj$pop.names,sub="PCA 1-2",csub=2)
+add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=2,posi="top")
+@
+
+\noindent This plane shows that the main structuring is between African an French breeds, the second structure reflecting genetic diversity among African breeds.
+The third axis reflects the diversity among French breeds:
+Overall, all breeds seem well differentiated.
+<<fig=T,size=.6>>=
+s.class(pca1$li,obj$pop,xax=1,yax=3,lab=obj$pop.names,sub="PCA 1-3",csub=2)
+add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=3,posi="top")
+@
+
+
+\subsection{Performing a Correspondance Analysis on \rcmd{genpop} objects}
+Being contingency tables, the \rcmd{@tab} in \rcmd{genpop} objects can be submitted to a Correspondance Analysis (CA) to seek a typology of populations.
+The approach is very similar to the previous one for PCA.
+Missing data are first replaced during convertion from \rcmd{genind},
+but one could create a \rcmd{genpop} with NAs and then use
+\rcmd{na.replace} to get rid of missing observations.
+<<caexpl,fig=T,size=.4>>=
+data(microbov)
+obj <- genind2genpop(microbov,missing="chi2")
+ca1 <- dudi.coa(as.data.frame(obj$tab),scannf=FALSE,nf=3)
+barplot(ca1$eig,main="Eigenvalues")
+@
+
+Now we display the resulting typologies:
+<<fig=T,size=.6>>=
+s.label(ca1$li,lab=obj$pop.names,sub="CA 1-2",csub=2)
+add.scatter.eig(ca1$eig,nf=3,xax=1,yax=2,posi="top")
+@
+
+<<fig=T,size=.6>>=
+s.label(ca1$li,xax=1,yax=3,lab=obj$pop.names,sub="CA 1-3",csub=2)
+add.scatter.eig(ca1$eig,nf=3,xax=2,yax=3,posi="bottomright")
+@
+
+\noindent Once again, axes are to be interpreted separately in terms of continental differentiation, a among-breed diversities.
+
+
+\subsection{Analyzing a single locus}
+Here the emphasis is put on analyzing a single locus using different methods.
+Any marker can be isolated using the \rcmd{seploc} instruction.
+<<pca>>=
+data(nancycats)
+toto <- seploc(nancycats,truenames=TRUE, res.type="matrix")
+X <- toto$fca90
+@
+\rcmd{fca90.ind} is a matrix containing only genotypes for the marker fca90.
+It can be analyzed, for instance, using an inter-class PCA.
+This analyzis provides a typology of individuals having maximal inter-colonies variance.
+<<fig=T,png=F,pdf=T, size=.6>>=
+library(ade4)
+
+pcaX <- dudi.pca(X,cent=T,scale=F,scannf=FALSE)
+pcabetX <- between(pcaX,nancycats$pop,scannf=FALSE)
+s.arrow(pcabetX$c1,xlim=c(-.9,.9))
+s.class(pcabetX$ls,nancycats$pop,cell=0,cstar=0,add.p=T)
+sunflowerplot(X %*% as.matrix(pcabetX$c1),add=T)
+add.scatter.eig(pcabetX$eig,xax=1,yax=2,posi="bottomright")
+@
+Here the differences between individuals are mainly expressed by three alleles: 199, 197 and 193.
+However, there is no clear structuration to be seen at an individual level.
+Is $F_{st}$ significant taking only this marker into account?
+We perform the G-statistic test and enventually compute the corresponding F statistics.
+Note that we use the constructor \rcmd{genind} to generate an object
+of this class from \rcmd{X}:
+<<>>=
+fca90.ind <- genind(X,pop=nancycats$pop)
+gstat.randtest(fca90.ind,nsim=999)
+F <- varcomp(genind2hierfstat(fca90.ind))$F
+rownames(F) <- c("tot","pop")
+colnames(F) <- c("pop","ind")
+F
+@
+In this case the information is best summarized by F statistics than by an ordination method.
+It is likely because all colonies are differentiated but none forming clusters of related colonies.
+
+\subsection{Testing for isolation by distance}
+Isolation by distance (IBD) is tested using Mantel test between a matrix of genetic distances and a matrix of geographic distances.
+It can be tested using individuals as well as populations.
+This example uses cat colonies.
+We use Edwards' distance \textit{versus} Euclidean distances between colonies.
+<<ibd,fig=T,size=.35>>=
+data(nancycats)
+toto <- genind2genpop(nancycats,miss="0")
+Dgen <- dist.genpop(toto,method=2)
+Dgeo <- dist(nancycats$other$xy)
+library(ade4)
+ibd <- mantel.randtest(Dgen,Dgeo)
+ibd
+plot(ibd)
+@
+Isolation by distance is clearly not significant.
+
+
+
+
+\subsection{Using Monmonier's algorithm to define genetic boundaries}
+Monmonier's algorithm \citep{tj433} was originally designed to find boundaries of maximum differences between contiguous polygons of a tesselation.
+As such, the method was basically used in geographical analysis.
+More recently, \cite{np120} suggested that this algorithm could be employed to detect genetic boundaries among georeferecend genotypes (or populations).
+This algorithm is implemented using a more general approach than the initial one in \rcmd{adegenet}.
+
+Instead of using Voronoi tesselation as in original version, the functions \rcmd{monmonier} and \rcmd{optimize.monmonier} can handle various neighbouring graphs such as Delaunay triangulation, Gabriel's graph, Relative Neighbours graph, etc.
+These graphs defined spatial connectivity among 'points' (genotypes or populations), any couple of points being neighbours (if connected) or not.
+Another information is given by a set of markers which define genetic distances among these 'points'.
+The aim of Monmonier's algorithm is to find the path through the strongest genetic distances between neighbours.
+A more complete description of the principle of this algorithm will be found in the documentation of \rcmd{monmonier}.
+Indeed, the very purpose of this tutorial is simply to show how it can be used on genetic data.
+\\
+
+Let's take the example from the function's manpage and detail it.
+The dataset used is \rcmd{sim2pop}.
+
+<<mon1,fig=TRUE,size=.6>>=
+data(sim2pop)
+sim2pop
+summary(sim2pop$pop)
+
+temp <- sim2pop$pop
+levels(temp) <- c(17,19)
+temp <- as.numeric(as.character(temp))
+plot(sim2pop$other$xy,pch=temp,cex=1.5,xlab='x',ylab='y')
+legend("topright",leg=c("Pop A", "Pop B"),pch=c(17,19))
+@
+
+There are two sampled populations in this dataset, with inequal sample sizes (100 and 30).
+Twenty microsatellite-like loci are available for all genotypes (no missing data).
+So, what do \rcmd{monmonier} ask for?
+<<mon2>>=
+args(monmonier)
+@
+
+\noindent The first argument (\rcmd{xy}) is a matrix of geographic coordinates, already stored in \rcmd{sim2pop}.
+Next argument is an object of class \rcmd{dist}, which is basically a distance matrix cut in half.
+For now, we will use the classical Euclidean distance among alleles frequencies of genotypes.
+This is obtained by:
+<<mon3>>=
+D <- dist(sim2pop$tab)
+@
+
+\noindent The next argument (\rcmd{cn}) is a connection network.
+As existing routines to build such networks are spread over several packages, the function \rcmd{chooseCN} will help you choose one.
+This is an interactive function, so difficult to demonstrate here (see \rcmd{?chooseCN}).
+Here we ask the function not to ask for a choice (\rcmd{ask=FALSE}) and select the second type of graph which is the one of Gabriel (\rcmd{type=2}).
+<<mon4,fig=TRUE>>=
+gab <- chooseCN(sim2pop$other$xy,ask=FALSE,type=2)
+@
+
+\noindent The obtained network is automatically plotted by the function.
+It seems we are now ready to proceed to the algorithm.
+<<mon5,eval=FALSE>>=
+ mon1 <- monmonier(sim2pop$other$xy,D,gab$cn)
+@
+\begin{center}
+\includegraphics[width=.5\textwidth]{figs/monthres1.png}
+\end{center}
+
+\noindent This plot shows all local differences sorted in decreasing order.
+The idea behind this is that a significant boundary would cause local differences to decrease abruptly after the boundary.This should be used to choose the \emph{threshold} difference for the algorithm to stop.
+Here, no boundary is visible: we stop.
+\\
+
+Why do the algorithm fail to find a boundary?
+Either because there is no genetic differentiation to be found, or because the signal differentiating both populations is too weak to overcome the random noise in genetic distances.
+What is the $F_{st}$ between the two samples?
+<<>>=
+library(hierfstat)
+temp <- genind2hierfstat(sim2pop)
+varcomp.glob(temp[,1],temp[,-1])$F
+@
+
+\noindent This value is somewhat moderate ($F_{st}=0.038$).
+Is it significant?
+
+<<fig=TRUE>>=
+gtest <- gstat.randtest(sim2pop)
+gtest
+plot(gtest)
+@
+
+\noindent Yes, it is very significant.
+The two samples are indeed genetically differenciated.
+So, can Monmonier's algorithm find a boundary between the two populations?
+Yes, if we get rid of the random noise.
+This can be achieved using simple ordination method like Principal Coordinates Analysis.
+
+<<mon6,fig=TRUE>>=
+library(ade4)
+pco1 <- dudi.pco(D,scannf=FALSE,nf=1)
+barplot(pco1$eig,main="Eigenvalues")
+@
+
+\noindent We retain only the first eigenvalue.
+The corresponding coordinates are used to redefine the genetic distances among genotypes.
+The algorithm is then rerunned.
+<<mon7>>=
+D <- dist(pco1$li)
+@
+<<mon8,eval=FALSE>>=
+mon1 <- monmonier(sim2pop$other$xy,D,gab$cn)
+@
+\begin{center}
+\includegraphics[width=.5\textwidth]{figs/monthres2.png}
+\end{center}
+
+<<mon9,echo=FALSE>>=
+mon1 <- monmonier(sim2pop$other$xy,D,gab,scan=FALSE)
+mon1
+@
+
+\noindent This may take some time... but never more than five minutes on an 'ordinary' personnal computer.
+The object \rcmd{mon1} contains the whole information about the boundaries found.
+As several boundaries can be seeked at the same time (argument \rcmd{nrun}), you have to specify about which run and which direction you want to get informations (values of differences or path coordinates).
+For instance:
+<<mon10>>=
+names(mon1)
+names(mon1$run1)
+mon1$run1$dir1
+@
+
+\noindent It can also be useful to identify which points are crossed
+by the barrier; this can be done using \rcmd{coords.monmonier}:
+<<>>=
+coords.monmonier(mon1)
+@
+
+\noindent The returned dataframe contains, in this order, the $x$ and
+$y$ coordinates of the points of the barrier, and the identifiers of
+the two 'parent' points, that is, the points whose barycenter is the
+point of the barrier.
+
+Finally, you can plot very simply the obtained boundary using the method \rcmd{plot}:
+<<fig=TRUE>>=
+plot(mon1)
+@
+
+\noindent see arguments in \rcmd{?plot.monmonier} to customize this representation.
+Last, we can compare the infered boundary with the actual distribution of populations:
+<<fig=TRUE,size=.7>>=
+plot(mon1,add.arrows=FALSE,bwd=8)
+
+temp <- sim2pop$pop
+levels(temp) <- c(17,19)
+temp <- as.numeric(as.character(temp))
+points(sim2pop$other$xy,pch=temp,cex=1.3)
+legend("topright",leg=c("Pop A", "Pop B"),pch=c(17,19))
+@
+
+\noindent Not too bad...
+
+
+
+
+\subsection{How to simulate hybridization?}
+The function \rcmd{hybridize} allows to simulate hybridization between
+individuals from two distinct genetic pools, or more broadly between
+two \rcmd{genind} objects.
+Here, we use the example from the manpage of the function, to go a
+little further.
+Please have a look at the documentation, especially at the different
+possible outputs (outputs for the software STRUCTURE is available!).
+<<hybr>>=
+temp <- seppop(microbov)
+names(temp)
+salers <- temp$Salers
+zebu <- temp$Zebu
+zebler <- hybridize(salers, zebu, n=40, pop="zebler")
+@
+
+\noindent A first generation (F1) of hybrids 'zebler' is obtained.
+Is it possible to perform a backcross, say, with 'salers' population?
+Yes, here it is:
+<<>>=
+F2 <- hybridize(salers, zebler, n=40)
+F3 <- hybridize(salers, F2, n=40)
+F4 <- hybridize(salers, F3, n=40)
+@
+
+\noindent and so on...
+Are these hybrids still genetically distinct?
+Let's merge all hybrids in a single dataset and test for genetic differentiation:
+<<fig=TRUE,size=.4>>=
+dat <- repool(zebler,F2,F3,F4)
+test <- gstat.randtest(dat)
+plot(test)
+temp <- genind2hierfstat(dat)
+varcomp.glob(temp[,1],temp[,-1])$F
+@
+
+\noindent The $F_{st}$ is not very strong (0.013) but still very
+significant: hybrids are still pretty well differentiated.
+
+
+
+\subsection{Reading AFLP data}
+Adegenet was primarly suited to handle codominant markers like microsatellites.
+However, dominant markers like AFLP can be used as well.
+This is a particular case of genind object where each locus possesses
+only one allele, and where values in the \rcmd{@tab} slot are no longer frequencies, but presence/absence indications.
+\\
+
+Here is an example using a toy dataset 'AFLP.txt' that can be downloaded
+from the adegenet website, section 'Documentation':
+<<aflpread>>=
+dat <- read.table("AFLP.txt",header=TRUE)
+dat
+@
+\noindent The \rcmd{genind} constructor is used to build a genind object:
+<<>>=
+obj <- genind(dat)
+obj
+truenames(obj)
+@
+
+\noindent To continue with the toy example, we can proceed to a simple PCA.
+NAs are first replaced:
+<<>>=
+objNoNa <- na.replace(obj,met=0)
+objNoNa
+@
+
+\noindent Now the PCA is performed:
+<<pcaaflp,fig=TRUE>>=
+library(ade4)
+pca1 <- dudi.pca(objNoNa,scannf=FALSE,scale=FALSE)
+scatter(pca1)
+@
+
+\noindent More generally, multivariate analyses from ade4, the sPCA (\rcmd{spca}), the
+global and local tests (\rcmd{global.rtest}, \rcmd{local.rtest}), or
+the Monmonier's algorithm (\rcmd{monmonier}) will work without problem
+with AFLP data.
+See more information about non-diploid data in adegenet in the next section.
+
+
+
+
+
+\subsection{Handling non-diploid data}
+As said above, adegenet was primarly suited to handle codominant markers like microsatellites.
+Version 1.2-0 handles different levels of ploidy.
+This is reflected by the fact that \rcmd{genind} objects now have a
+\rcmd{@ploidy} slot, which contains an integer indicating the number
+of copies of each gene.
+Note that to change this value, an integer must be assigned, and not a
+numeric.
+For instance:
+<<changePlo>>=
+obj <- new("genind")
+obj
+is.integer(1)
+obj at ploidy <- as.integer(1)
+obj at ploidy
+@
+
+Basically, most features of adegenet are now available for different
+levels of ploidy.
+The list of functions that do not work with non-diploid data is provided on the
+adegenet website (section 'Documents').
+In general, these functions do not work because they rely on other
+softwares that do not support different levels of ploidy.
+For instance, there will not be any \rcmd{read.genetix} for
+non-diploid data because such data are not handled by the GENETIX software.
+
+
+\textbf{Reading non-diploid} genotypes is done using \rcmd{df2genind}:
+data are read from a genotype x loci data.frame, where each element is
+a character string indicating alleles, possibly separated by a
+particular character.
+This can be illustrated using the toy datasets \texttt{haplo.txt} and
+\texttt{triplo.txt}, available from the adegenet website (section 'Documents').
+
+First we read haploid data.
+<<haplo>>=
+haplo <- read.table("haplo.txt",header=TRUE,colClasses="character")
+haplo
+@
+\noindent Now we build a genind object:
+<<>>=
+obj <- df2genind(haplo,ploidy=1)
+obj
+summary(obj)
+@
+\noindent The warning rightfully tells that entirely non-typed
+individuals and markers that were removed.
+To check that the reading was correct, we can do the converse operation:
+<<>>=
+genind2df(obj)
+@
+\noindent Note that some alleles were wrongly coded (\rcmd{haplo[1,1]}
+was '3' instead of '03') and were automatically recoded cleanly by
+\rcmd{df2genind}.
+\\
+
+The same example can be adapted to triploid data (file
+\texttt{triplo.txt}).
+<<haplo>>=
+triplo <- read.table("triplo.txt",header=TRUE,colClasses="character")
+triplo
+@
+\noindent Note how nasty this file is: some alleles are miscoded ('3'
+instead of '03'), NAs are sometimes repeated ('NA/NA/NA' instead of
+simply 'NA'), etc.
+As far as the right ploidy level and the right separator are specified,
+\rcmd{df2genind} should overcome these problems:
+<<>>=
+obj <- df2genind(triplo,ploidy=3,sep="/")
+obj
+summary(obj)
+genind2df(obj,sep="-")
+@
+\noindent It worked!
+
+
+
+
+
+\subsection{Searching scales of spatial genetic variation: Mantel correlogram}
+One way to investigate the scales of spatial genetic patterns is to
+compute a Mantel correlogram.
+This approach consist in computing Mantel correlation at different
+distance classes.
+The obtained values can be tested using a Monte Carlo procedure, like
+the classical Mantel test (see \rcmd{mantel.randtest} and above
+section about isolation-by-distance).
+\\
+
+The Mantel correlogram is implemented by the function \rcmd{mgram} of the package \texttt{ecodist}.
+We illustrate the procedure using the dataset \texttt{spcaIllus}.
+<<mantCor>>=
+library(ecodist)
+data(spcaIllus)
+@
+
+\noindent First two distance matrices are constructed, from genetic
+data and geographic coordinates:
+<<>>=
+Dgen <- dist(spcaIllus$dat2A$tab)
+Dgeo <- dist(spcaIllus$dat2A$other$xy)
+@
+
+\noindent Now we compute the Mantel correlogram:
+<<>>=
+mantCor <- mgram(Dgen,Dgeo,nperm=499,nclass=8)
+mantCor
+@
+
+\noindent The function \rcmd{mgram} offers several options, that
+should be looked at into details for real application (\rcmd{?mgram}).
+Here, the argument \texttt{nperm} specifies the number of permutations
+of data used in the Monte Carlo testing procedure, and \rcmd{nclass}
+gives the number of distance classes to be used.
+Now we can look at the correlogram:
+<<fig=TRUE>>=
+plot(mantCor)
+@
+
+Plain dots indicate significant correlations.
+The Mantel correlogram shows that genotypes are relatively
+similar at short distance, while this similarity decreases with distance.
+The negative correlation around a distance of 10 indicates the genetic
+differenciation between some patches.
+
+This is consistent with the actual structuring:
+<<fig=TRUE>>=
+xy <- spcaIllus$dat2A$other$xy
+pop <- spcaIllus$dat2A$pop
+levels(pop) <- c("x","o","+")
+col <- pop
+levels(col) <- c("red","blue","green")
+
+plot(xy,pch=as.character(pop), col=as.character(col))
+legend("topright",pch=c("x","o","+"), col=c("red","blue","green"), legend=paste("pop",1:3), bg="grey")
+@
+
+
+
+
+
+\bibliography{biblioTJ}
+\bibliographystyle{jtbnew}
+
+\end{document}
More information about the adegenet-commits
mailing list