[adegenet-commits] r914 - pkg/inst/doc

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Thu Jun 16 16:31:17 CEST 2011


Author: jombart
Date: 2011-06-16 16:31:16 +0200 (Thu, 16 Jun 2011)
New Revision: 914

Modified:
   pkg/inst/doc/adegenet-basics.Rnw
Log:
Moving forward. Redone new part on PCA in basics.


Modified: pkg/inst/doc/adegenet-basics.Rnw
===================================================================
--- pkg/inst/doc/adegenet-basics.Rnw	2011-06-15 16:18:02 UTC (rev 913)
+++ pkg/inst/doc/adegenet-basics.Rnw	2011-06-16 14:31:16 UTC (rev 914)
@@ -444,7 +444,7 @@
 
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Inporting data from other software}
+\subsection{Importing data from other software}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Genetic markers data can most of the time be stored as a table with individuals in row and markers
 in column, where each entry is a character string coding the alleles possessed at one locus.
@@ -518,7 +518,7 @@
 from the adegenet website, section 'Documentation':
 <<aflpread>>=
 dat <- read.table("http://adegenet.r-forge.r-project.org/files/AFLP.txt",header=TRUE)
- dat
+dat
 @
 \noindent The function \texttt{df2genind} is used to obtain a genind object:
 <<>>=
@@ -951,14 +951,47 @@
 
 
 
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Testing for Hardy-Weinberg equilibrium}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The Hardy-Weinberg equilibrium test is implemented for \texttt{genind} objects.
+The function to use is \texttt{HWE.test.genind}, and requires the package \emph{genetics}.
+Here we first produce a matrix of p-values (\texttt{res="matrix"}) using parametric test.
+Monte Carlo procedure are more reliable but also more computer-intensive (use \texttt{permut=TRUE}).
+<<HWE>>=
+toto <- HWE.test.genind(nancycats,res="matrix")
+dim(toto)
+@
+One test is performed per locus and population, \textit{i.e.} 153 tests in this case.
+Thus, the first question is: which tests are highly significant?
+<<>>=
+colnames(toto)
+which(toto<0.0001,TRUE)
+@
+Here, only 4 tests indicate departure from HW.
+Rows give populations, columns give markers.
+Now complete tests are returned, but the significant ones are already known.
+<<>>=
+toto <- HWE.test.genind(nancycats,res="full")
+toto$fca23$P06
+toto$fca90$P10
+toto$fca96$P10
+toto$fca37$P13
+@
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Measuring and testing population structure (a.k.a F statistics)}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 Population structure is traditionally measured and tested using F statistics, in particular Fst.
 Since version 2.13.0 of R, the package \emph{hierfstat}, which implemented most F statistics and
 related tests, has been removed from CRAN for maintenance issues.
 As a consequence, \emph{adegenet} has lost a few functionalities, namely general F statistics
-(function \texttt{fstat}), a test of overall population structure (\texttt{gstat.randtest}).
+(function \texttt{fstat}) and a test of overall population structure (\texttt{gstat.randtest}).
 \\
 
 
@@ -1033,36 +1066,8 @@
 
 
 
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Testing for Hardy-Weinberg equilibrium}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-The Hardy-Weinberg equilibrium test is implemented for \texttt{genind} objects.
-The function to use is \texttt{HWE.test.genind}, and requires the package \emph{genetics}.
-Here we first produce a matrix of p-values (\texttt{res="matrix"}) using parametric test.
-Monte Carlo procedure are more reliable but also more computer-intensive (use \texttt{permut=TRUE}).
-<<HWE>>=
-toto <- HWE.test.genind(nancycats,res="matrix")
-dim(toto)
-@
-One test is performed per locus and population, \textit{i.e.} 153 tests in this case.
-Thus, the first question is: which tests are highly significant?
-<<>>=
-colnames(toto)
-which(toto<0.0001,TRUE)
-@
-Here, only 4 tests indicate departure from HW.
-Rows give populations, columns give markers.
-Now complete tests are returned, but the significant ones are already known.
-<<>>=
-toto <- HWE.test.genind(nancycats,res="full")
-toto$fca23$P06
-toto$fca90$P10
-toto$fca96$P10
-toto$fca37$P13
-@
 
 
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Estimating inbreeding}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -1147,6 +1152,91 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{General overview}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+Multivariate analysis consists in summarising a strongly multivariate information into a few
+synthetic variables.
+In genetics, such approaches are useful to get a simplified picture of the genetic diversity
+obersved amongst individuals or populations.
+A review of multivariate analysis in population genetics can be found in \cite{tjart10}.
+Here, we aim at providing an overview of some applications using methods implemented in \textit{ade4}
+and \textit{adegenet}.
+\\
+
+
+Useful functions include:
+\begin{itemize}
+  \item \texttt{scaleGen} (\textit{adegenet}): centre/scale allele frequencies and replaces missing
+    data; useful, among other things, before running a principal component analysis (PCA).
+  \item \texttt{dudi.pca} (\textit{ade4}): implements PCA; can be used on transformed allele
+    frequencies of individuals or populations.
+  \item \texttt{dudi.ca} (\textit{ade4}): implements Correspondance Analysis (CA); can be used on raw
+    allele counts of populations (\texttt{@tab} slot in \texttt{genpop} objects).
+  \item \texttt{dist.genpop} (\textit{adegenet}): implements 5 pairwise genetic distances between populations
+  \item \texttt{pairwise.fst} (\textit{adegenet}): implements pairwise $F_{ST}$, which is also a
+    Euclidean distance between populations.
+  \item \texttt{dist} (\textit{stats}): computes pairwise distances between multivariate
+    observations; can be used on raw or transformed allele frequencies.
+  \item \texttt{dudi.pco} (\textit{ade4}): implements Principal Coordinates Analysis (PCoA); this
+    methods finds synthetic variables which summarize a Euclidean distance matrix as best as
+    possible; can be used on outputs of \texttt{dist}, \texttt{dist.genpop}, and \texttt{pairwise.fst}.
+  \item \texttt{is.euclid} (\textit{ade4}): tests whether a distance matrix is Euclidean, which is a
+    pre-requisite of PCoA.
+  \item \texttt{cailliez} (\textit{ade4}): renders a non-Euclidean distance matrix Euclidean by
+    adding a constant to all entries.
+  \item \texttt{dapc} (\textit{adegenet}): implements the Discriminant Analysis of Principal
+    Components (DAPC \cite{tjart19}), a new standard method for the analysis of population genetic structures; see
+    dedicated vignette (\textit{adegenet-dapc}).
+  \item \texttt{sPCA} (\textit{adegenet}): implements the spatial Principal Component Analysis
+    (sPCA \cite{tjart04}), a method for the analysis of spatial genetic structures; see dedicated vignette (\textit{adegenet-dapc}).
+  \item \texttt{glPca} (\textit{adegenet}): implements PCA for genome-wide SNP data stored as
+    \texttt{genlight} objects; see dedicated vignette (\textit{adegenet-genomics}).
+    %% \item \texttt{} (\textit{}):
+\end{itemize}
+
+Besides the procedures themselves, graphic functions are also often of the utmost importance; these include:
+\begin{itemize}
+  \item \texttt{scatter} (\textit{ade4,adegenet}): generic function to display multivariate
+    analyses; in practice, the only useful application for genetic data is the one implemented in
+    \texttt{adegenet} for DAPC results.
+  \item \texttt{s.label} (\textit{ade4}): function used for basic display of principal components.
+  \item \texttt{loadingplot} (\textit{adegenet}): function used to display the loadings (i.e.,
+    contribution to a given structure) of alleles for a given principal component; annotates and returns the most
+    contributing alleles.
+  \item \texttt{s.class} (\textit{ade4}): displays two quantitative variables with known groups of
+    observations, using inertia ellipses for the groups; useful to represent principal components
+    when groups are known.
+  \item \texttt{s.chull} (\textit{ade4}): same as \texttt{s.class}, except convex polygons are used
+    rather than ellipses.
+  \item \texttt{s.value} (\textit{ade4}): graphical display of a quantitative variable distributed
+    over a two-dimensional space; useful to map principal components or allele frequencies over a
+    geographic area.
+  \item \texttt{colorplot} (\textit{adegenet}): graphical display of 1 to 3 quantitative variables distributed
+    over a two-dimensional space; useful for combined representations of principal components over a geographic area.
+    Can also be used to produce color versions of traditional scatterplots.
+  \item \texttt{transp} (\textit{adegenet}): auxiliary function making colors transparent.
+  \item \texttt{num2col} (\textit{adegenet}): auxiliary function transforming a quantitative
+    variable into colors using a given palette.
+  \item \texttt{assignplot} (\textit{adegenet}): specific plot of group membership probabilities for
+    DAPC; see dedicated vignette (\textit{adegenet-dapc}).
+  \item \texttt{compoplot} (\textit{adegenet}): specific 'STRUCTURE-like' plot of group membership probabilities for
+    DAPC; see dedicated vignette (\textit{adegenet-dapc}).
+  \item \texttt{add.scatter} (\textit{ade4}): add inset plots to an existing figure.
+  %% \item \texttt{} (\textit{}): .
+  %% \item \texttt{} (\textit{}): .
+  %% \item \texttt{} (\textit{}): .
+  %% \item \texttt{} (\textit{}): .
+  %% \item \texttt{} (\textit{}): .
+\end{itemize}
+
+\noindent In the sections below, we briefly illustrate how these tools can be combined to extract information from
+genetic data.
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Performing a Principal Component Analysis on \texttt{genind} objects}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 The tables contained in \texttt{genind} objects can be submitted to a Principal Component Analysis (PCA) to seek a typology of individuals.
@@ -1168,33 +1258,94 @@
 obj <- na.replace(microbov,method="mean")
 @
 
-\noindent Done. Now, the analysis can be performed. Data are centred but not scaled as 'units' are the same.
+\noindent Note that alternatively, we could have used \texttt{scaleGen} to obtain centred/scaled
+allele frequencies and replace NA.
+\\
+
+The analysis can now be performed.
+Data are centred but not scaled as 'units' are the same.
+Note: in practice, retained axes can be chosen interactively by removing the arguments \texttt{scannf=FALSE,nf=3}.
 <<fig=T>>=
 pca1 <- dudi.pca(obj$tab,cent=TRUE,scale=FALSE,scannf=FALSE,nf=3)
-barplot(pca1$eig[1:50],main="Eigenvalues")
+barplot(pca1$eig[1:50],main="PCA eigenvalues", col=heat.colors(50))
 @
-Here we represent the genotypes and 95\% inertia ellipses for populations.
+<<>>=
+pca1
+@
+The output object \texttt{pca1} is a list containing various information; of particular interest are:
+\begin{itemize}
+\item \texttt{\$eig}: the eigenvalues of the analysis, indicating the amount of variance represented
+  by each principal component (PC).
+\item \texttt{\$li}: the principal components of the analysis; these are the synthetic variables
+  summarizing the genetic diversity, usually visualized using scatterplots.
+\item \texttt{\$c1}: the allele loadings, used to compute linear combinations forming the PCs;
+  squared, they represent the contribution to each PCs.
+\end{itemize}
+
+The basic scatterplot for this analysis can be obtained by:
+<<fig=TRUE>>=
+s.label(pca1$li)
+title("PCA of microbov dataset\naxes 1-2")
+add.scatter.eig(pca1$eig[1:20], 3,1,2)
+@
+
+However, this figure can largely be improved.
+First, we can use \texttt{s.class} to represent both the genotypes and inertia ellipses for populations.
 <<fig=T>>=
-s.class(pca1$li,obj$pop,lab=obj$pop.names,sub="PCA 1-2",csub=2)
-add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=2,posi="top")
+s.class(pca1$li, pop(obj),lab=obj$pop.names)
+title("PCA of microbov dataset\naxes 1-2")
+add.scatter.eig(pca1$eig[1:20], 3,1,2)
 @
 
 \noindent This plane shows that the main structuring is between African an French breeds, the second structure reflecting genetic diversity among African breeds.
 The third axis reflects the diversity among French breeds:
-Overall, all breeds seem well differentiated.
 <<fig=T>>=
-s.class(pca1$li,obj$pop,xax=1,yax=3,lab=obj$pop.names,sub="PCA 1-3",csub=2)
-add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=3,posi="top")
+s.class(pca1$li,pop(obj),xax=1,yax=3,lab=obj$pop.names,sub="PCA 1-3",csub=2)
+title("PCA of microbov dataset\naxes 1-3")
+add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=3)
 @
+Overall, all breeds seem well differentiated.
+~\\
 
+However, we can yet improve these scatterplots, which are fortunately easy to customize.
+For instance, we can remove the grid, choose different colors for the groups, use larger dots and transparency to
+better assess the density of points, and remove internal segments of the ellipses:
+<<fig=TRUE>>=
+col <- rainbow(length(levels(pop(obj))))
+s.class(pca1$li, pop(obj),xax=1,yax=3, col=transp(col,.6), axesell=FALSE, cstar=0, cpoint=3, grid=FALSE)
+@
+~\\
 
 
+Let us now assume that we ignore the group memberships.
+We can still use color in an informative way.
+For instance, we can recode the principal components represented in the scatterplot on the RGB
+scale:
+<<fig=TRUE>>=
+colorplot(pca1$li, pca1$li, transp=TRUE, cex=3, xlab="PC 1", ylab="PC 2")
+title("PCA of microbov dataset\naxes 1-2")
+abline(v=0,h=0,col="grey", lty=2)
+@
+Colors are based on the first three PCs of the PCA, recoded respectively on the red, green, and blue channel.
+In this figure, the genetic diversity is represented in two redundant ways: by the distances
+(further away = more genetically different), and by the colors (more different colors = more
+genetically different).
 
+We can represent the diversity on the third axis similarly:
+<<fig=TRUE>>=
+colorplot(pca1$li[c(1,3)], pca1$li, transp=TRUE, cex=3, xlab="PC 1", ylab="PC 3")
+title("PCA of microbov dataset\naxes 1-3")
+abline(v=0,h=0,col="grey", lty=2)
+@
 
+
+
+
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \subsection{Performing a Correspondance Analysis on \texttt{genpop} objects}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Being contingency tables, the \texttt{@tab} in \texttt{genpop} objects can be submitted to a Correspondance Analysis (CA) to seek a typology of populations.
+Being contingency tables, the \texttt{@tab} slot in \texttt{genpop} objects can be submitted to a Correspondance Analysis (CA) to seek a typology of populations.
 The approach is very similar to the previous one for PCA.
 Missing data are first replaced during convertion from \texttt{genind},
 but one could create a \texttt{genpop} with NAs and then use
@@ -1203,70 +1354,75 @@
 data(microbov)
 obj <- genind2genpop(microbov,missing="chi2")
 ca1 <- dudi.coa(as.data.frame(obj$tab),scannf=FALSE,nf=3)
-barplot(ca1$eig,main="Eigenvalues")
+barplot(ca1$eig,main="Correspondance Analysis eigenvalues", col=heat.colors(length(ca1$eig)))
 @
 
-Now we display the resulting typologies:
+Now we display the resulting typology using a basic scatterplot:
 <<fig=T>>=
 s.label(ca1$li,lab=obj$pop.names,sub="CA 1-2",csub=2)
-add.scatter.eig(ca1$eig,nf=3,xax=1,yax=2,posi="top")
+add.scatter.eig(ca1$eig,nf=3,xax=1,yax=2,posi="bottomright")
 @
 
 <<fig=T>>=
 s.label(ca1$li,xax=1,yax=3,lab=obj$pop.names,sub="CA 1-3",csub=2)
-add.scatter.eig(ca1$eig,nf=3,xax=2,yax=3,posi="bottomright")
+add.scatter.eig(ca1$eig,nf=3,xax=2,yax=3,posi="topleft")
 @
 
-\noindent Once again, axes are to be interpreted separately in terms of continental differentiation, a among-breed diversities.
+\noindent As in the PCA above, axes are to be interpreted separately in terms of continental differentiation, and between-breeds diversity.
+Importantly, as in any analysis carried out at a population level, all information about the
+diversity within populations is lost in this analysis.
+See the vignette on DAPC for an individual-based approach which is nontheless optimal in terms of
+group separation (\textit{adegenet-dapc}).
 
 
 
 
+\SweaveOpts{eval=FALSE}
+%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% \subsection{Analyzing a single locus}
+%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%% Here the emphasis is put on analyzing a single locus using different methods.
+%% Any marker can be isolated using the \texttt{seploc} instruction.
+%% <<pca>>=
+%% data(nancycats)
+%% toto <- seploc(nancycats,truenames=TRUE, res.type="matrix")
+%% X <- toto$fca90
+%% @
+%% \texttt{fca90.ind} is a matrix containing only genotypes for the marker fca90.
+%% It can be analyzed, for instance, using an inter-class PCA.
+%% This analyzis provides a typology of individuals having maximal inter-colonies variance.
+%% <<fig=T,png=F,pdf=T>>=
+%% library(ade4)
 
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Analyzing a single locus}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-Here the emphasis is put on analyzing a single locus using different methods.
-Any marker can be isolated using the \texttt{seploc} instruction.
-<<pca>>=
-data(nancycats)
-toto <- seploc(nancycats,truenames=TRUE, res.type="matrix")
-X <- toto$fca90
-@
-\texttt{fca90.ind} is a matrix containing only genotypes for the marker fca90.
-It can be analyzed, for instance, using an inter-class PCA.
-This analyzis provides a typology of individuals having maximal inter-colonies variance.
-<<fig=T,png=F,pdf=T>>=
-library(ade4)
+%% pcaX <- dudi.pca(X,cent=T,scale=F,scannf=FALSE)
+%% pcabetX <- between(pcaX,nancycats$pop,scannf=FALSE)
+%% s.arrow(pcabetX$c1,xlim=c(-.9,.9))
+%% s.class(pcabetX$ls,nancycats$pop,cell=0,cstar=0,add.p=T)
+%% sunflowerplot(X %*% as.matrix(pcabetX$c1),add=T)
+%% add.scatter.eig(pcabetX$eig,xax=1,yax=2,posi="bottomright")
+%% @
+%% Here the differences between individuals are mainly expressed by three alleles: 199, 197 and 193.
+%% However, there is no clear structuration to be seen at an individual level.
+%% Is $F_{st}$ significant taking only this marker into account?
+%% We perform the G-statistic test and enventually compute the corresponding F statistics.
+%% Note that we use the constructor \texttt{genind} to generate an object
+%% of this class from \texttt{X}:
+%% <<>>=
+%% fca90.ind <- genind(X,pop=nancycats$pop)
+%% gstat.randtest(fca90.ind,nsim=999)
+%% F <- varcomp(genind2hierfstat(fca90.ind))$F
+%% rownames(F) <- c("tot","pop")
+%% colnames(F) <- c("pop","ind")
+%% F
+%% @
+%% In this case the information is best summarized by F statistics than by an ordination method.
+%% It is likely because all colonies are differentiated but none forming clusters of related colonies.
 
-pcaX <- dudi.pca(X,cent=T,scale=F,scannf=FALSE)
-pcabetX <- between(pcaX,nancycats$pop,scannf=FALSE)
-s.arrow(pcabetX$c1,xlim=c(-.9,.9))
-s.class(pcabetX$ls,nancycats$pop,cell=0,cstar=0,add.p=T)
-sunflowerplot(X %*% as.matrix(pcabetX$c1),add=T)
-add.scatter.eig(pcabetX$eig,xax=1,yax=2,posi="bottomright")
-@
-Here the differences between individuals are mainly expressed by three alleles: 199, 197 and 193.
-However, there is no clear structuration to be seen at an individual level.
-Is $F_{st}$ significant taking only this marker into account?
-We perform the G-statistic test and enventually compute the corresponding F statistics.
-Note that we use the constructor \texttt{genind} to generate an object
-of this class from \texttt{X}:
-<<>>=
-fca90.ind <- genind(X,pop=nancycats$pop)
-gstat.randtest(fca90.ind,nsim=999)
-F <- varcomp(genind2hierfstat(fca90.ind))$F
-rownames(F) <- c("tot","pop")
-colnames(F) <- c("pop","ind")
-F
-@
-In this case the information is best summarized by F statistics than by an ordination method.
-It is likely because all colonies are differentiated but none forming clusters of related colonies.
 
 
+\SweaveOpts{eval=TRUE}
 
 
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Spatial analysis}
@@ -1528,6 +1684,14 @@
   statistical computing. R Foundation for Statistical Computing,
   Vienna, Austria. ISBN 3-900051-07-0.
 
+\bibitem{tjart10}
+  Jombart T, Pontier D and Dufour A-B (2009) Genetic markers in the playground of multivariate analysis.
+  \textit{Heredity} 102: 330-341.
+
+\bibitem{tjart04}
+Jombart T, Devillard S, Dufour A-B and Pontier D (2008) Revealing cryptic spatial patterns in genetic variability by a new multivariate method.
+\textit{Heredity} 101: 92-103.
+
 \end{thebibliography}
 
 \end{document}



More information about the adegenet-commits mailing list