[adegenet-commits] r894 - pkg/inst/doc
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Tue May 31 12:57:11 CEST 2011
Author: jombart
Date: 2011-05-31 12:57:11 +0200 (Tue, 31 May 2011)
New Revision: 894
Modified:
pkg/inst/doc/adegenet-genomics.Rnw
Log:
SNPbin part done.
Modified: pkg/inst/doc/adegenet-genomics.Rnw
===================================================================
--- pkg/inst/doc/adegenet-genomics.Rnw 2011-05-31 09:46:39 UTC (rev 893)
+++ pkg/inst/doc/adegenet-genomics.Rnw 2011-05-31 10:57:11 UTC (rev 894)
@@ -106,18 +106,123 @@
getClassDef("SNPbin")
@
+The slots respectively contain:
+\begin{description}
+ \item \texttt{snp}: SNP data with specific internal coding.
+ \item \texttt{n.loc}: the number of SNPs stored in the object.
+ \item \texttt{NA.posi}: position of the missing data (NAs).
+ \item \texttt{label}: an optional label for the individual.
+ \item \texttt{ploidy}: the ploidy level of the genome.
+\end{description}
+New objects are created using \texttt{new}, with these slots as arguments.
+If no argument is provided, an empty object is created:
+<<>>=
+new("SNPbin")
+@
+In practice, only the \texttt{snp} information and possibly the ploidy has to be provided; various
+formats are accepted for the \texttt{snp} component, but the simplest is a vector of integers (or
+numeric) indicating the number of second allele at each locus.
+The argument \texttt{snp}, if provided alone, does not have to be named:
+<<>>=
+x <- new("SNPbin", c(0,1,1,2,0,0,1))
+x
+@
+If not provided, the ploidy is detected from the data and determined as the largest number in the
+input vector. Obviously, in many cases this will not be adequate, but ploidy can always be rectified
+afterwards; for instance:
+<<>>=
+x
+ploidy(x) <- 3
+x
+@
+\noindent The internal coding of the objects is cryptic, and not meant to be accessed directly:
+<<>>=
+x at snp
+@
+Fortunately, data are easily converted back into integers:
+<<>>=
+as.integer(x)
+@
+~\\
+
+The main interest of this representation is its efficiency in terms of storage.
+For instance:
+<<>>=
+dat <- sample(0:1, 1e6, replace=TRUE)
+print(object.size(dat),unit="auto")
+x <- new("SNPbin", dat)
+print(object.size(x),unit="auto")
+@
+here, we converted a million SNPs into a \texttt{SNPbin} object, which turns out to be
+\Sexpr{round(object.size(dat)/object.size(x))} smaller than the original data.
+However, the information in \texttt{dat} and \texttt{x} is strictly identical:
+<<>>=
+identical(as.integer(x),dat)
+@
+The advantage of this storage is therefore being extremely compact, and allowing to analyse big
+datasets using standard computers.
+Obviously, usual computations demand data to be at one moment coded as numeric values (as opposed to bits).
+However in most cases, we can proceed by only converting one or two genomes back to numeric values
+at a time, therefore keeping RAM requirements low, albeit at a possible increase in computational time.
+This however is minimized by two ways: i) conversion routines are optimized for speed using C code
+ii) smaller objects are handled, therefore decreasing the possibly high computational time taken by memory allocation.
+\\
+
+While \texttt{SNPbin} objects are the very mean by which we store data efficiently, in practice one
+wishes to analyze several genomes at a time.
+This is made possible by the class \texttt{genlight}, which relies on \texttt{SNPbin} but allows for
+storing data from several genomes at a time.
+
+
+
+
%%%%%%%%%%%%%%%%
\subsection{\code{genlight}: storage of multiple genomes}
%%%%%%%%%%%%%%%%
+Like \texttt{SNPbin}, \texttt{genlight} is a formal S4 class.
+The slots of instances of this class are described by:
+<<>>=
+getClassDef("genlight")
+@
+As it can be seen, these objects allow for storing more information in addition to vectors of SNP frequencies.
+More precisely, their content is:
+\begin{description}
+ \item \texttt{gen}: SNP data for different individuals, each stored as a \texttt{SNPbin}; loci
+ have to be identical across all individuals.
+ \item \texttt{n.loc}: the number of SNPs stored in the object.
+ \item \texttt{ind.names}: (optional) labels for the individuals.
+ \item \texttt{loc.names}: (optional) labels for the loci.
+ \item \texttt{loc.all}: (optional) alleles of the loci separated by '/' (e.g. 'a/t', 'g/c', etc.).
+ \item \texttt{chromosome}: (optional) a factor indicating the chromosome to which the SNPs belong.
+ \item \texttt{position}: (optional) the position of each SNPs in their chromosome.
+ \item \texttt{ploidy}: (optional) the ploidy of each individual.
+ \item \texttt{pop}: (optional) a factor grouping individuals into 'populations'.
+ \item \texttt{other}: (optional) a list containing any supplementary information to be stored with
+ the data.
+\end{description}
+\noindent Like \texttt{SNbin} object, \texttt{genlight} object are created using the constructor \texttt{new},
+providing content for the slots above as arguments.
+When none is provided, an empty object is created:
+<<>>=
+new("genlight")
+@
+The most important information to provide is obviously the genotypes (argument \texttt{gen}); these
+can be provided as:
+\begin{description}
+\item
+\end{description}
+
+
+
%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%
\section{In practice}
More information about the adegenet-commits
mailing list