[adegenet-commits] r910 - in pkg/inst/doc: . figs

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Tue Jun 14 21:52:28 CEST 2011


Author: jombart
Date: 2011-06-14 21:52:28 +0200 (Tue, 14 Jun 2011)
New Revision: 910

Added:
   pkg/inst/doc/figs/genomics-027.pdf
   pkg/inst/doc/figs/genomics-030.pdf
   pkg/inst/doc/figs/genomics-032.pdf
   pkg/inst/doc/figs/genomics-033.pdf
   pkg/inst/doc/figs/genomics-037.pdf
   pkg/inst/doc/figs/genomics-038.pdf
   pkg/inst/doc/figs/genomics-039.pdf
   pkg/inst/doc/figs/genomics-044.pdf
   pkg/inst/doc/figs/genomics-053.pdf
   pkg/inst/doc/figs/genomics-055.pdf
   pkg/inst/doc/figs/genomics-056.pdf
   pkg/inst/doc/figs/genomics-062.pdf
   pkg/inst/doc/figs/genomics-063.pdf
   pkg/inst/doc/figs/genomics-065.pdf
Modified:
   pkg/inst/doc/adegenet-basics.Rnw
   pkg/inst/doc/adegenet-basics.tex
Log:
Started new version of the adegenet-basics vignette.


Modified: pkg/inst/doc/adegenet-basics.Rnw
===================================================================
--- pkg/inst/doc/adegenet-basics.Rnw	2011-06-14 19:08:49 UTC (rev 909)
+++ pkg/inst/doc/adegenet-basics.Rnw	2011-06-14 19:52:28 UTC (rev 910)
@@ -1,6 +1,6 @@
 \documentclass{article}
-% \VignettePackage{adegenet}
-% \VignetteIndexEntry{adegenet basics: analysing genetic data}
+% \VignettePackage{dapc}
+% \VignetteIndexEntry{An introduction to the adegenet package}
 
 \usepackage{graphicx}
 \usepackage[colorlinks=true,urlcolor=blue]{hyperref}
@@ -8,8 +8,20 @@
 \usepackage{color}
 
 \usepackage[utf8]{inputenc} % for UTF-8/single quotes from sQuote()
+
+
+% for bold symbols in mathmode
+\usepackage{bm}
+
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\beq}{\begin{equation}}
+\newcommand{\eeq}{\end{equation}}
+\newcommand{\m}[1]{\mathbf{#1}}
+
+
+
 \newcommand{\code}[1]{{{\tt #1}}}
-\title{adegenet basics: analysing genetic data}
+\title{An introduction to \textit{adegenet} \Sexpr{packageDescription("adegenet", fields = "Version")}}
 \author{Thibaut Jombart}
 \date{\today}
 
@@ -23,7 +35,9 @@
 \begin{document}
 
 
+\SweaveOpts{prefix.string = figs/base, echo=TRUE, eval=TRUE, fig = FALSE, eps = FALSE, pdf = TRUE}
 
+
 \definecolor{Soutput}{rgb}{0,0,0.56}
 \definecolor{Sinput}{rgb}{0.56,0,0}
 \DefineVerbatimEnvironment{Sinput}{Verbatim}
@@ -34,9 +48,1461 @@
 \color{black}
 
 \maketitle
+
+\begin{abstract}
+  This vignette provides an introductory tutorial to the \textit{adegenet} package \cite{tjart05}
+  for the R software \cite{np145}. This package implements tools to handle, analyse and simulate
+  genetic data.  Originally developped for multiallelic, codominant markers such as microsatellites,
+  \textit{adegenet} now also handles dominant markers, allows for any ploidy in the data, and
+  implements the most memory-efficient storage and handling of genome-wide SNP data. This vignette
+  introduces basic functionalities of the package. Other vignettes are dedicated to specific topics
+  (see Introduction below).
+\end{abstract}
+
+
+\newpage
 \tableofcontents
 
 
 
 
+\newpage
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Introduction}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+This tutorial introduces some basic functionalities of the \textit{adegenet} package for R \cite{np145}.
+The purpose of this package is to provide tools for handling, analysing and simulating genetic
+markers data, with an emphasis on multivariate approaches and exploratory methods.
+Standard multivariate analyses are implemented in the \textit{ade4} package \cite{tj311}, of which
+\textit{adegenet} was originally an extension.
+However, the package has since grown methods of its own such as the Discriminant Analysis of
+Principal Components (DAPC, \cite{tjart19}), the spatial Principal Components Analysis (sPCA,
+\cite{tjart04}), or the \textit{SeqTrack} algorithm \cite{tjart20}.
+
+
+Data can be imported from a wide range of formats, including those of
+popular software (GENETIX, STRUCTURE, Fstat, Genepop), or from simple dataframes of genotypes.
+Polymorphic sites can be extracted from both nucleotide and amino-acid sequences, with special
+methods for handling genome-wide SNPs data with maximum efficiency.
+\\
+
+In this tutorial, we first introduce the \texttt{genind} and \texttt{genpop} classes used to store
+multiallelic markers, and then show how to extract information from these objects using a variety of
+tools.
+Other vignettes are dedicated to some specific topics:
+\begin{itemize}
+\item DAPC: type \texttt{vignette("adegenet-dapc",package='adegenet')} in R to access this vignette.
+\item sPCA: type \texttt{vignette("adegenet-spca",package='adegenet')}
+\item genome-wide SNPs handling and analysis: type \texttt{vignette("adegenet-genomics",package='adegenet')}
+\end{itemize}
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{First steps}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Installing the package}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Current version of the package is \Sexpr{packageDescription("adegenet", fields = "Version")}.
+Please make sure to be using the latest version of R and adegenet
+before sending question about missing functions to the mailing list.
+
+Here, the \textit{adegenet} package is installed along with other recommended packages.
+<<inst,eval=F>>=
+install.packages("adegenet",dep=TRUE)
+@
+Then the first step is to load the package:
+<<>>=
+library(adegenet)
+@
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Object classes}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Two classes of objects are defined, depending on the level at which the genetic information is stored:
+\texttt{genind} is used for individual genotypes, whereas \texttt{genpop} is used for alleles numbers counted by populations.
+Note that the term 'population', here and later, is employed in a broad sense: it simply refers to any grouping of individuals.
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{genind objects}
+% % % % % % % % % % % % % % % % % %
+These objects can be obtained by reading data files from other software,
+from a \texttt{data.frame} of genotypes, by conversion from a table of
+allelic frequencies, or even from aligned DNA sequences (see 'importing data').
+<<genind>>=
+data(nancycats)
+is.genind(nancycats)
+nancycats
+@
+A \texttt{genind} object is formal S4 object with several slots,
+accessed using the '\texttt{@}' operator (see \texttt{class?genind}).
+Note that the '\texttt{\$}' was also implemented for adegenet objects,
+so that slots can be accessed as if they were components of a list.
+The main slot in \texttt{genind} is a table of allelic frequencies of individuals (in rows) for every alleles in every loci.
+Being frequencies, data sum to one per locus, giving the score of 1 for an homozygote and 0.5 for an heterozygote.
+The particular case of presence/absence data will is described in an
+ad-hoc section (see 'Handling presence/absence data').
+For instance:
+<<>>=
+nancycats$tab[10:18,1:10]
+@
+Individual '010' is an homozygote for the allele 09 at locus 1, while '018' is an heterozygote with alleles 06 and 09.
+As user-defined labels are not always valid (for instance, they can
+be duplicated), generic labels are used for individuals, markers, alleles and eventually population.
+The true names are stored in the object (components \texttt{\$[...].names} where ... can be 'ind', 'loc', 'all' or 'pop').
+For instance :
+<<>>=
+nancycats$loc.names
+@
+gives the true marker names, and
+<<>>=
+nancycats$all.names[[3]]
+@
+gives the allele names for marker 3.
+Alternatively, one can use the accessor \texttt{locNames}:
+<<>>=
+locNames(nancycats)
+head(locNames(nancycats, withAlleles=TRUE), 10)
+@
+
+
+\noindent The slot 'ploidy' is an integer giving the level of ploidy
+of the considered organisms (defaults to 2).
+This parameter is essential, in particular when switching from
+individual frequencies (genind object) to allele counts per
+populations (genpop).
+
+\noindent
+The slot 'type' describes the type of marker used: codominant ('codom', e.g. microsatellites) or presence/absence ('PA', e.g. AFLP).
+By default, adegenet considers that markers are codominant.
+Note that actual handling of presence/absence markers has been made available since version 1.2-3.
+See the dedicated section for more information about presence/absence markers.
+
+Optional components are also allowed.
+The slot \texttt{@other} is a list that can include any additionnal information.
+The optional slot \texttt{@pop} (a factor giving a grouping of individuals) is particular in that the behaviour of many functions will check automatically for it and behave accordingly.
+In fact, each time an argument 'pop' is required by a function, it is first seeked in \texttt{@pop}.
+For instance, using the function \texttt{genind2genpop} to convert \texttt{nancycats} to a \texttt{genpop} object, there is no need to give a 'pop' argument as it exists in the \texttt{genind} object:
+<<>>=
+table(nancycats$pop)
+catpop <- genind2genpop(nancycats)
+catpop
+@
+Other additional components can be stored (like here, spatial coordinates of populations in \$xy) but will not be passed during any conversion (\texttt{catpop} has no \$other\$xy).
+
+\noindent Note that the slot 'pop' can be retrieved and set using the \texttt{pop} function:
+<<>>=
+obj <- nancycats[sample(1:50,10)]
+pop(obj)
+pop(obj) <- rep("newPop",10)
+pop(obj)
+@
+
+
+Finally, a \texttt{genind} object generally contains its matched call, \textit{i.e.} the instruction that created it.
+This is not the case, however, for objects loaded using \texttt{data}.
+When call is available, it can be used to regenerate an object.
+<<>>=
+obj <- read.genetix(system.file("files/nancycats.gtx",package="adegenet"))
+obj$call
+toto <- eval(obj$call)
+identical(obj,toto)
+@
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{genpop objects}
+% % % % % % % % % % % % % % % % % %
+We use the previously built \texttt{genpop} object:
+<<>>=
+catpop
+is.genpop(catpop)
+catpop$tab[1:5,1:10]
+@
+The matrix \$tab contains alleles counts per population (here, cat colonies).
+These objects are otherwise very similar to \texttt{genind} in their
+structure, and possess generic names, true names, the matched call and
+an \texttt{@other} slot.
+
+
+
+
+
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Various topics}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Importing data}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{From GENETIX, STRUCTURE, FSTAT, Genepop}
+% % % % % % % % % % % % % % % % % %
+
+Data can be read from the software GENETIX (.gtx), STRUCTURE (.str or
+.stru), FSTAT (.dat) and Genepop (.gen) files, using the corresponding
+\texttt{read} function: \texttt{read.genetix},  \texttt{read.structure},
+\texttt{read.fstat}, and  \texttt{read.genepop}.
+These functions take as main argument the path (as a string character) to an input file, and produce a \texttt{genind} object.
+Alternatively, one can use the function \texttt{import2genind} which detects a file format from its extension and uses the appropriate routine.
+For instance:
+<<import>>=
+obj1 <- read.genetix(system.file("files/nancycats.gtx",package="adegenet"))
+obj2 <- import2genind(system.file("files/nancycats.gtx", package="adegenet"))
+all.equal(obj1,obj2)
+
+@
+
+\noindent The only difference between \texttt{obj1} and \texttt{obj2} is
+their call (which is normal as they were obtained from different
+command lines).
+
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{From other software}
+% % % % % % % % % % % % % % % % % %
+Genetic markers data can most of the time be stored as a table with individuals in row and markers
+in column, where each entry is a character string coding the alleles possessed at one locus.
+Such data are easily imported into R as a \texttt{data.frame}, using for instance \texttt{read.table}
+for text files or \texttt{read.csv} for comma-separated text files.
+Then, the obtained \texttt{data.frame} can be converted into a \texttt{genind} object using \texttt{df2genind}.
+
+There are only a few pre-requisite the data should meet for this conversion to be possible. The
+easiest and clearest way of coding data is using a separator between alleles. For instance,
+"80/78'', "80|78", or "80,78'' are different ways of coding a genotype at a microsatellite locus
+with alleles '80' and 78".
+Note that for haploid data, no separator shall be used.
+As a consequence, SNP data should consist of the raw nucleotides.
+The only contraint when using a separator is that the same separator is used in all the
+dataset. There are no contraints as to i) the type of separator used or ii) the ploidy of the data.
+These parameters can be set in \texttt{df2genind} through arguments 'sep' and 'ploidy', respectively.
+
+Alternatively, no separator may be used provided a fixed number of characters is used to code any allele.
+For instance, in a diploid organism, "0101" is an homozygote 1/1 while "1209" is a heterozygote
+12/09 in a two-character per allele coding scheme.
+In a tetraploid system with one character per allele, "1209" will be understood as 1/2/0/9.
+
+Here, we provide an example using a data set from the library hierfstat.
+<<>>=
+library(hierfstat)
+toto <- read.fstat.data(paste(.path.package("hierfstat"),"/data/diploid.dat",sep="",collapse=""),nloc=5)
+head(toto)
+@
+\texttt{toto} is a data frame containing genotypes and a population factor.
+<<>>=
+obj <- df2genind(X=toto[,-1],pop=toto[,1])
+obj
+@
+\texttt{obj} is a \texttt{genind} containing the same information, but recoded as a matrix of allele
+frequencies (\texttt{\$tab} slot).
+
+
+
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{SNPs data}
+% % % % % % % % % % % % % % % % % %
+In adegenet, SNP data are handled as other codominant markers such as microsatellites.
+The most convenient way to convert SNPs into a \texttt{genind} is using \texttt{df2genind}, which is
+described in the previous section.
+Let \texttt{dat} be an input matrix, as can be read into R using \texttt{read.table} or \texttt{read.csv},
+with genotypes in row and SNP loci in columns.
+<<>>=
+dat <- matrix(sample(c("a","t","g","c"), 15, replace=TRUE),nrow=3)
+rownames(dat) <- paste("genot.", 1:3)
+colnames(dat) <- 1:5
+dat
+obj <- df2genind(dat, ploidy=1)
+truenames(obj)
+@
+
+\texttt{obj} is a \texttt{genind} containing the SNPs information, which can be used for further
+analysis in adegenet.
+
+
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{DNA sequences}
+% % % % % % % % % % % % % % % % % %
+DNA sequences can be read into R using the ape package \cite{tj527,np83}, and
+imported into adegenet using \texttt{DNAbin2genind}.
+There are several ways ape can be used to read in DNA sequences.
+The easiest one is reading data from a usual format such as FASTA or Clustal using \texttt{read.dna}.
+Other options include reading data directly from GenBank using \texttt{read.GenBank}, or from other
+public databases using the seqinr package and transforming the \texttt{alignment} object into a
+\texttt{DNAbin} using \texttt{as.DNAbin}.
+
+Here, we illustrate this approach by re-using the example of \texttt{read.GenBank}. A connection to
+the internet is required, as sequences are read directly from a distant database.
+<<>>=
+library(ape)
+ref <- c("U15717", "U15718", "U15719", "U15720",
+         "U15721", "U15722", "U15723", "U15724")
+myDNA <- read.GenBank(ref)
+myDNA
+class(myDNA)
+summary(myDNA)
+@
+In adegenet, only polymorphic loci are conserved; importing data from a DNA sequence to adegenet
+therefore consist in extracting SNPs from the aligned sequences.
+This conversion is achieved by \texttt{DNAbin2genind}.
+This function allows one to specify a threshold for polymorphism; for instance, one could retain
+only SNPs for which the second largest allele frequency is greater than 1\% (using the \texttt{polyThres} argument).
+This is achieved using:
+<<>>=
+obj <- DNAbin2genind(myDNA, polyThres=0.01)
+obj
+@
+Here, out of the 1045 nucleotides of the sequences, 318 SNPs where extracted and stored as a
+\texttt{genind} object.
+
+
+
+
+
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{Proteic sequences}
+% % % % % % % % % % % % % % % % % %
+Alignments of proteic sequences can be exploited in adegenet in the same way as DNA sequences (see
+section above).
+Alignments are scanned for polymorphic sites, and only those are retained to form a \texttt{genind} object.
+Loci correspond to the position of the residue in the alignment, and alleles correspond to the
+different amino-acids (AA).
+Aligned proteic sequences are stored as objects of class \texttt{alignment} in the \emph{seqinr}
+package \cite{np160}.
+See \texttt{?as.alignment} for a description of this class.
+The function extracting polymorphic sites from \texttt{alignment} objects is \texttt{alignment2genind}
+
+Its use is fairly simple. It is here illustrated using a small dataset of aligned proteic sequences:
+<<seqinr1>>=
+library(seqinr)
+mase.res   <- read.alignment(file = system.file("sequences/test.mase",package = "seqinr"), format = "mase")
+mase.res
+x <- alignment2genind(mase.res)
+x
+@
+The six aligned protein sequences (\texttt{mase.res}) have been scanned for polymorphic sites, and
+these have been extracted to form the \texttt{genind} object \texttt{x}.
+Note that several settings such as the characters corresponding to missing values (i.e., gaps) and
+the for polymorphism threshold for a site to be retained can be specified through the function's
+arguments (see \texttt{?alignment2genind}).
+
+The names of the loci directly provides the indices of polymorphic sites:
+<<>>=
+locNames(x)
+@
+The table of polymorphic sites can be reconstructed easily by:
+<<>>=
+tabAA <- genind2df(x)
+dim(tabAA)
+tabAA[, 1:20]
+@
+The global AA composition of the polymorphic sites is given by:
+<<>>=
+table(unlist(tabAA))
+@
+
+Now that polymorphic sites have been converted into a genind object, simple distances can be
+computed between the sequences.
+Note that adegenet does not implement specific distances for protein sequences, we only use the
+simple Euclidean distance.
+Fancier protein distances are implemented in R; see for instance \texttt{dist.alignment} in the
+\emph{seqinr} package, and \texttt{dist.ml} in the \emph{phangorn} package.
+
+<<>>=
+D <- dist(truenames(x))
+D
+@
+This matrix of distances is small enough for one to interprete the raw numbers.
+However, it is also very straightforward to represent these distances as a tree or in a reduced space.
+We first build a Neighbor-Joining tree using the \emph{ape} package:
+<<njAA, fig=TRUE>>=
+library(ape)
+tre <- nj(D)
+plot(tre, type="unrooted", edge.w=2)
+edgelabels(tex=round(tre$edge.length,1), bg=rgb(.8,.8,1,.8))
+@
+
+The best possible planar representation of these Euclidean distances is achieved by Principal
+Coordinate Analyses (PCoA), which in this case will give identical results to PCA of the original
+(centred, non-scaled) data:
+<<fig=TRUE>>=
+pco1 <- dudi.pco(D, scannf=FALSE,nf=2)
+scatter(pco1, posi="bottomright")
+title("Principal Coordinate Analysis\n-based on proteic distances-")
+@
+
+
+
+
+% % % % % % % % % % % % % % % % % %
+\subsubsection{Using genind/genpop constructors}
+% % % % % % % % % % % % % % % % % %
+Lastly, \texttt{genind} or \texttt{genpop} objects can be constructed from data matrices similar to the \texttt{\$tab} component (respectively, alleles frequencies and alleles counts).
+This is achieved by the constructors \texttt{genind} (or \texttt{as.genind})  and \texttt{genpop}
+(or \texttt{as.genpop}).
+However, these low-level functions are first meant for internal use, and are called for instance by
+functions such as \texttt{read.genetix}.
+Consequently, there is much less control on the arguments and improper specification can lead to
+creating improper \texttt{genind}/\texttt{genpop} objects without issuing a warning or an error, by
+leading to meaningless subsequent analysis.
+
+Therefore, one should use these functions with additional care as to how information is coded.
+The table passed as argument to these constructors must have correct
+names: unique rownames identifying genotypes/populations, and unique colnames
+having the form '[marker].[allele]'.
+
+Here is an example for \texttt{genpop} using a dataset from ade4:
+<<>>=
+library(ade4)
+data(microsatt)
+microsatt$tab[10:15,12:15]
+@
+\texttt{microsatt\$tab} contains alleles counts per populations, and can therefore be used to make a \texttt{genpop} object.
+Moreover, column names are set as required, and row names are unique.
+It is therefore safe to convert these data into a \texttt{genpop} using the constructor:
+<<>>=
+toto <- genpop(microsatt$tab)
+toto
+summary(toto)
+@
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Exporting data}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Genotypes in \texttt{genind} format can be exported to the R packages
+\emph{genetics} (using \texttt{genind2genotype}) and \emph{hierfstat} (using \texttt{genind2hierfstat}).
+The package \emph{genetics} is now deprecated, but the implemented
+class \texttt{genotype} is still used in various packages.
+The package \emph{hierfstat} does not define a class, but requires
+data to be formated in a particular way.
+Here are examples of how to use these functions:
+<<genind2genotype>>=
+obj <- genind2genotype(nancycats)
+class(obj)
+obj[1:4,1:5]
+class(obj$fca8)
+@
+
+<<genind2hierfstat>>=
+obj <- genind2hierfstat(nancycats)
+class(obj)
+obj[1:4,1:5]
+@
+
+\noindent Now we can use the function \texttt{varcomp.glob} from
+\emph{hierfstat} to compute 'variance' components:
+<<echo=FALSE>>=
+detach("package:ape")
+@
+<<>>=
+varcomp.glob(obj$pop,obj[,-1])
+@
+
+
+A more generic way to export data is to produce a data.frame of genotypes
+coded by character strings.
+This is done by \texttt{genind2df}:
+<<genind2df>>=
+obj <- genind2df(nancycats)
+obj[1:5,1:5]
+@
+
+\noindent However, some software will require alleles to be
+separated.
+The argument \texttt{sep} allows one to specify any separator.
+For instance:
+<<>>=
+genind2df(nancycats,sep="|")[1:5,1:5]
+@
+
+Note that tabulations can be obtained as follows using '{$\backslash$}t' character.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Manipulating data}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Data manipulation is meant to be easy in \textit{adegenet} (if it is
+not, complain!).
+First, as \texttt{genind} and \texttt{genpop} objects are basically formed
+by a data matrix (the \texttt{@tab} slot), it is natural to subset these objects like it is done
+with a matrix.
+The \texttt{$[$} operator does this, forming a new object with the retained genotypes/populations and alleles:
+<<>>=
+titi <- toto[1:3,]
+toto$pop.names
+titi
+titi$pop.names
+@
+
+\noindent The object \texttt{toto} has been subsetted, keeping only the
+first three populations.
+Of course, any subsetting available for a matrix can be used with \texttt{genind} and \texttt{genpop} objects.
+For instance, we can subset \texttt{titi} to keep only the third marker:
+<<>>=
+titi <- titi[,titi$loc.fac=="L3"]
+titi
+@
+
+\noindent Now, \texttt{titi} only contains the 11 alleles of the third
+marker of \texttt{toto}.
+\\
+
+To simplify the task of separating data by marker, the function
+\texttt{seploc} can be used.
+It returns a list of objects (optionnaly, of data matrices), each
+corresponding to a marker:
+<<seploc>>=
+sepCats <- seploc(nancycats)
+class(sepCats)
+names(sepCats)
+sepCats$fca45
+@
+
+\noindent The object \texttt{sepCats\$fca45} only contains data of the
+marker fca45.
+\\
+
+Following the same idea, \texttt{seppop} allows one to separate genotypes
+in a \texttt{genind} object by population.
+For instance, we can separate genotype of cattles in the dataset \texttt{microbov}
+by breed:
+<<seppop>>=
+data(microbov)
+obj <- seppop(microbov)
+class(obj)
+names(obj)
+obj$Borgou
+@
+
+\noindent The returned object \texttt{obj} is a list of \texttt{genind}
+objects each containing genotypes of a given breed.
+
+A last, rather vicious trick is to separate data by population and by marker.
+This is easy using \texttt{lapply}; one can first separate population
+then markers, or the contrary.
+Here, we separate markers inside each breed in \texttt{obj}
+<<sepultim>>=
+obj <- lapply(obj,seploc)
+names(obj)
+class(obj$Borgou)
+names(obj$Borgou)
+obj$Borgou$INRA63
+@
+
+For instance, \texttt{obj\$Borgou\$INRA63} contains genotypes of the
+breed Borgou for the marker INRA63.
+\\
+
+Lastly, one may want to pool genotypes in different datasets, but having
+the same markers, into a single dataset.
+This is more than just merging the \texttt{@tab} components of all
+datasets, because alleles can differ (they almost always do) and
+markers are not necessarily sorted the same way.
+The function \texttt{repool} is designed to avoid these problems.
+It can merge any \texttt{genind} provided as arguments as soon as the
+same markers are used.
+For instance, it can be used after a \texttt{seppop} to retain only some populations:
+<<repool>>=
+obj <- seppop(microbov)
+names(obj)
+newObj <- repool(obj$Borgou, obj$Charolais)
+newObj
+newObj$pop.names
+@
+Done !
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Using summaries}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Both \texttt{genind} and \texttt{genpop} objects have a summary providing basic information about data.
+Informations are both printed and invisibly returned as a list.
+
+<<sumry,fig=TRUE,png=FALSE,pdf=TRUE,size=.8>>=
+toto <- summary(nancycats)
+names(toto)
+
+par(mfrow=c(2,2))
+
+plot(toto$pop.eff,toto$pop.nall,xlab="Colonies sample size",ylab="Number of alleles",main="Alleles numbers and sample sizes",type="n")
+text(toto$pop.eff,toto$pop.nall,lab=names(toto$pop.eff))
+
+barplot(toto$loc.nall,ylab="Number of alleles", main="Number of alleles per locus")
+
+barplot(toto$Hexp-toto$Hobs,main="Heterozygosity: expected-observed",ylab="Hexp - Hobs")
+
+barplot(toto$pop.eff,main="Sample sizes per population",ylab="Number of genotypes",las=3)
+@
+
+Is mean observed H significantly lower than mean expected H ?
+
+<<>>=
+bartlett.test(list(toto$Hexp,toto$Hobs))
+t.test(toto$Hexp,toto$Hobs,pair=T,var.equal=TRUE,alter="greater")
+@
+Yes, it is.
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Measuring and testing population structure (a.k.a F statistics)}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Population structure is traditionally measured and tested using F statistics, in particular Fst.
+\emph{adegenet} proposes different tools in this respect: general F statistics (\texttt{fstat}), a test of overall
+population structure (\texttt{gstat.randtest}), and pairwise $Fst$ between all pairs of populations in a
+dataset (\texttt{pairwise.fst}).
+The first two are wrappers for functions implemented in the \emph{hierfstat} package; pairwise Fst
+is implemented in \emph{adegenet}.
+
+We illustrate their use using the dataset of microsatellite of cats from Nancy:
+<<>>=
+library(hierfstat)
+data(nancycats)
+fstat(nancycats)
+@
+This table provides the three F statistics $Fst$ (pop/total), $Fit$ (Ind/total), and $Fis$ (ind/pop).
+These are overall measures which take into account all genotypes and all loci.
+
+Is the structure between populations significant?
+This question can be addressed using the G-statistic test \cite{tj511}; it is implemented for \texttt{genind} objects and produces a \texttt{randtest} object (package ade4).
+<<fig=TRUE>>=
+library(ade4)
+toto <- gstat.randtest(nancycats,nsim=99)
+toto
+plot(toto)
+@
+
+\noindent Yes, it is (the observed value is indicated on the right, while histograms correspond to
+the permuted values).
+Note that \emph{hierfstat} allows for more ellaborated tests, in particular when different levels of
+hierarchical clustering are available.
+Such tests are better done directly in \emph{hierfstat}; for this, \texttt{genind} objects can be
+converted to the adequat format using \texttt{genind2hierfstat}.
+For instance:
+<<>>=
+toto <- genind2hierfstat(nancycats)
+head(toto)
+varcomp.glob(toto$pop,toto[,-1])
+@
+F statistics are provided in \$F; for instance, here, $F_{st}$ is $0.083$.
+
+
+~\\
+Lastly, pairwise $Fst$ is frequently used as a measure of distance between populations.
+The function \texttt{pairwise.fst} computes Nei's estimator \cite{tj814} of pairwise $Fst$, computed as:
+$$
+Fst(A,B) = \frac{H_t - (n_AH_s(A) + n_BH_s(B))/(n_A + n_B)}{Ht}
+$$
+where A and B refer to the two populations of sample size $n_A$ and $n_B$ and respective
+expected heterozygosity $H_s(A)$ and $H_s(B)$, and $H_t$ is the expected heterozygosity in the whole dataset.
+For a given locus, expected heterozygosity is computed as $1 - \sum p_i^2$, where $p_i$ is the
+frequency of the $i$th allele, and the $\sum$ represents summation over all alleles.
+For multilocus data, the heterozygosity is simply averaged over all loci.
+These computations are achieved for all pairs of populations by the function \texttt{pairwise.fst}; we
+illustrate this on a subset of individuals of \texttt{nancycats} (computations for the whole dataset
+would take a few tens of seconds):
+<<>>=
+matFst <- pairwise.fst(nancycats[1:50, treatOther=FALSE])
+matFst
+@
+
+The resulting matrix is Euclidean when there are no missing values:
+<<>>=
+is.euclid(matFst)
+@
+
+It can therefore be used in a Principal Coordinate Analysis (which requires Euclideanity), used to
+build trees, etc.
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Testing for Hardy-Weinberg equilibrium}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The Hardy-Weinberg equilibrium test is implemented for \texttt{genind} objects.
+The function to use is \texttt{HWE.test.genind}, and requires the package \emph{genetics}.
+Here we first produce a matrix of p-values (\texttt{res="matrix"}) using parametric test.
+Monte Carlo procedure are more reliable but also more computer-intensive (use \texttt{permut=TRUE}).
+<<HWE>>=
+toto <- HWE.test.genind(nancycats,res="matrix")
+dim(toto)
+@
+One test is performed per locus and population, \textit{i.e.} 153 tests in this case.
+Thus, the first question is: which tests are highly significant?
+<<>>=
+colnames(toto)
+which(toto<0.0001,TRUE)
+@
+Here, only 4 tests indicate departure from HW.
+Rows give populations, columns give markers.
+Now complete tests are returned, but the significant ones are already known.
+<<>>=
+toto <- HWE.test.genind(nancycats,res="full")
+toto$fca23$P06
+toto$fca90$P10
+toto$fca96$P10
+toto$fca37$P13
+@
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Performing a Principal Component Analysis on \texttt{genind} objects}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+The tables contained in \texttt{genind} objects can be submitted to a Principal Component Analysis (PCA) to seek a typology of individuals.
+Such analysis is straightforward using \textit{adegenet} to prepare data and \textit{ade4} for the analysis \textit{per se}.
+One has first to replace missing data.
+Putting each missing observation at the mean of the concerned allele frequency seems the best choice (NA will be stuck at the origin).
+<<pcaexpl>>=
+data(microbov)
+any(is.na(microbov$tab))
+sum(is.na(microbov$tab))
+@
+There are 6325 missing data.
+Assuming that these are evenly distributed (for illustration purpose
+only!), we replace them using \texttt{na.replace}.
+As we intend to use a PCA, the appropriate replacement method is to
+put each NA at the mean of the corresponding allele (argument 'method'
+set to 'mean').
+<<>>=
+obj <- na.replace(microbov,method="mean")
+@
+
+\noindent Done. Now, the analysis can be performed. Data are centred but not scaled as 'units' are the same.
+<<fig=T,size=.4>>=
+pca1 <- dudi.pca(obj$tab,cent=TRUE,scale=FALSE,scannf=FALSE,nf=3)
+barplot(pca1$eig[1:50],main="Eigenvalues")
+@
+Here we represent the genotypes and 95\% inertia ellipses for populations.
+<<fig=T,size=.6>>=
+s.class(pca1$li,obj$pop,lab=obj$pop.names,sub="PCA 1-2",csub=2)
+add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=2,posi="top")
+@
+
+\noindent This plane shows that the main structuring is between African an French breeds, the second structure reflecting genetic diversity among African breeds.
+The third axis reflects the diversity among French breeds:
+Overall, all breeds seem well differentiated.
+<<fig=T,size=.6>>=
+s.class(pca1$li,obj$pop,xax=1,yax=3,lab=obj$pop.names,sub="PCA 1-3",csub=2)
+add.scatter.eig(pca1$eig[1:20],nf=3,xax=1,yax=3,posi="top")
+@
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\subsection{Performing a Correspondance Analysis on \texttt{genpop} objects}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+Being contingency tables, the \texttt{@tab} in \texttt{genpop} objects can be submitted to a Correspondance Analysis (CA) to seek a typology of populations.
+The approach is very similar to the previous one for PCA.
+Missing data are first replaced during convertion from \texttt{genind},
+but one could create a \texttt{genpop} with NAs and then use
+\texttt{na.replace} to get rid of missing observations.
+<<caexpl,fig=T,size=.4>>=
+data(microbov)
+obj <- genind2genpop(microbov,missing="chi2")
+ca1 <- dudi.coa(as.data.frame(obj$tab),scannf=FALSE,nf=3)
+barplot(ca1$eig,main="Eigenvalues")
+@
+
+Now we display the resulting typologies:
+<<fig=T,size=.6>>=
+s.label(ca1$li,lab=obj$pop.names,sub="CA 1-2",csub=2)
+add.scatter.eig(ca1$eig,nf=3,xax=1,yax=2,posi="top")
+@
+
+<<fig=T,size=.6>>=
+s.label(ca1$li,xax=1,yax=3,lab=obj$pop.names,sub="CA 1-3",csub=2)
+add.scatter.eig(ca1$eig,nf=3,xax=2,yax=3,posi="bottomright")
+@
+
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/adegenet -r 910


More information about the adegenet-commits mailing list