[Vegan-commits] r1369 - in pkg/vegan/inst: . doc

Sun Nov 14 19:11:09 CET 2010

Author: jarioksa
Date: 2010-11-14 19:10:59 +0100 (Sun, 14 Nov 2010)
New Revision: 1369

Modified:
   pkg/vegan/inst/ChangeLog
   pkg/vegan/inst/doc/decision-vegan.Rnw
Log:
updated decision vignette to r1342 and CANOCO 4

Modified: pkg/vegan/inst/ChangeLog
===================================================================

--- pkg/vegan/inst/ChangeLog	2010-11-12 10:59:50 UTC (rev 1368)
+++ pkg/vegan/inst/ChangeLog	2010-11-14 18:10:59 UTC (rev 1369)
@@ -4,6 +4,10 @@
 
 Version 1.18-16 (opened November 9, 2010)
 
+	* vignette on design decision: updated to changes in 'const' in
+	scores.rda()  in 1.18-15 and to Canoco 4. Explains now 'const'
+	more thoroughly.
+
 	* pcnm: gained argument 'dist.ret' to return the distance matrix
 	on which PCNMs were based.
 

Modified: pkg/vegan/inst/doc/decision-vegan.Rnw
===================================================================
--- pkg/vegan/inst/doc/decision-vegan.Rnw	2010-11-12 10:59:50 UTC (rev 1368)
+++ pkg/vegan/inst/doc/decision-vegan.Rnw	2010-11-14 18:10:59 UTC (rev 1369)
@@ -192,49 +192,50 @@
 
 This chapter discusses the scaling of scores (results) in redundancy
 analysis and principal component analysis performed by function
-\texttt{rda} in the \texttt{vegan} library.  Principal component
-analysis, and hence redundancy analysis, is a variant of singular
-value decomposition (\textsc{svd}).  Functions \texttt{rda} and
-\texttt{prcomp} (library \texttt{mva}) even use \textsc{svd}
-internally in their algorithm.  In \textsc{svd} a centred data matrix
-is decomposed into orthogonal components so that $x_{ij} = \sum_k
-\sigma_k u_{ik} v_{jk}$, where $u_{ik}$ and $v_{jk}$ are orthonormal
-coefficient matrices and $\sigma_k$ are singular values.
-Orthonormality means that sum of squared columns is one and their
-cross-product is zero, or $\sum_i u_{ik}^2 = \sum_j v_{jk}^2 = 1$, and
-$\sum_i u_{ik} u_{il} = \sum_j v_{jk} v_{jl} = 0$ for $k \neq l$. This
-is a decomposition, and the original matrix is found exactly from the
-singular vectors and corresponding singular values, and first two
-singular components give the best rank $=2$ least squares estimate of
-the original matrix.
+\texttt{rda} in the \texttt{vegan} library.  
 
+Principal component analysis, and hence redundancy analysis, is a case
+of singular value decomposition (\textsc{svd}).  Functions
+\texttt{rda} and \texttt{prcomp} even use \textsc{svd} internally in
+their algorithm.
+
+In \textsc{svd} a centred data matrix is decomposed into orthogonal
+components so that $x_{ij} = \sum_k \sigma_k u_{ik} v_{jk}$, where
+$u_{ik}$ and $v_{jk}$ are orthonormal coefficient matrices and
+$\sigma_k$ are singular values.  Orthonormality means that sums of
+squared columns is one and their cross-product is zero, or $\sum_i
+u_{ik}^2 = \sum_j v_{jk}^2 = 1$, and $\sum_i u_{ik} u_{il} = \sum_j
+v_{jk} v_{jl} = 0$ for $k \neq l$. This is a decomposition, and the
+original matrix is found exactly from the singular vectors and
+corresponding singular values, and first two singular components give
+the best rank $=2$ least squares estimate of the original matrix.
+
 Principal component analysis is often presented (and performed in
 legacy software) as an eigenanalysis of covariance matrices.  Instead
-of data matrix, we analyse a matrix of covariances and variances
-$\mathbf{S}$.  The result will be orthonormal coefficient matrix
+of a data matrix, we analyse a matrix of covariances and variances
+$\mathbf{S}$.  The result are orthonormal coefficient matrix
 $\mathbf{U}$ and eigenvalues $\mathbf{\Lambda}$.  The coefficients
 $u_{ik}$ ares identical to \textsc{svd} (except for possible sign
 changes), and eigenvalues $\lambda_k$ are related to the corresponding
 singular values by $\lambda_k = \sigma_k^2 /(n-1)$.  With classical
 definitions, the sum of all eigenvalues equals the sum of variances of
 species, or $\sum_k \lambda_k = \sum_j s_j^2$, and it is often said
-that first axes explain a certain maximized proportion of total
-variance in the data.  The other orthonormal matrix $\mathbf{V}$ can
-be found indirectly as well, so that we have the same components in
-both methods.
+that first axes explain a certain proportion of total variance in the
+data.  The orthonormal matrix $\mathbf{V}$ of \textsc{svd} can be
+found indirectly as well, so that we have the same components in both
+methods.
 
-The coefficients $u_{ik}$ and $v_{jk}$ are of the same (unit) length
-for all axes $k$, but singular values $\sigma_k$ or eigenvalues
-$\lambda_k$ give the information of the importance of axes, or the
-`axis lengths.'  Instead of the orthonormal coefficients, or equal
-length axes, it is customary to use eigenvalues to scale at least one
-of the alternative scores to reflect the importance of axes or
-describe the true configuration of points.  Table \ref{tab:scales}
-shows some alternative scalings used in various software.  These
-alternatives apply to principal components analysis in all cases, and
-in redundancy analysis, they apply to species scores and constraints or
-linear combination scores; weighted averaging scores have somewhat
-wider dispersion.
+The coefficients $u_{ik}$ and $v_{jk}$ are scaled similarly for all
+axes $k$. Singular values $\sigma_k$ or eigenvalues $\lambda_k$ give
+the information of the importance of axes, or the `axis lengths.'
+Instead of the orthonormal coefficients, or equal length axes, it is
+customary to scale species (column) or site (row) scores or both by
+eigenvalues to display the importance of axes and to describe the true
+configuration of points.  Table \ref{tab:scales} shows some
+alternative scalings.  These alternatives apply to principal
+components analysis in all cases, and in redundancy analysis, they
+apply to species scores and constraints or linear combination scores;
+weighted averaging scores have somewhat wider dispersion.
 
 \begin{table}
   \caption{\label{tab:scales} Alternative scalings for \textsc{rda} used
@@ -246,7 +247,7 @@
     species standard deviations ($s_j$). In \texttt{rda},
     $\mathrm{const} = \sqrt[4]{(n-1) \sum \lambda_k}$.  Corresponding
     negative scaling in \texttt{vegan}
-    and corresponding positive scaling in \texttt{Canoco} is derived
+    and corresponding positive scaling in \texttt{Canoco 3}  is derived
     dividing each  species by its standard deviation $s_j$ (possibly
     with some additional constant multiplier).  }
 \begin{tabular}{lcc}
@@ -269,14 +270,14 @@
 $u_{ik}^*$ &
 $\sqrt{\sum \lambda_k /(n-1)} s_j^{-1} v_{jk}^*$
 \\
-\texttt{Canoco, scaling=-1} &
+\texttt{Canoco 3, scaling=-1} &
 $u_{ik} \sqrt{n} \sqrt{\lambda_k / \sum \lambda_k}$ &
 $v_{jk} \sqrt{n}$ \\
-\texttt{Canoco, scaling=-2} &
+\texttt{Canoco 3, scaling=-2} &
 $u_{ik} \sqrt{n}$ &
 $v_{jk} \sqrt{n} \sqrt{\lambda_k / \sum \lambda_k}$
 \\
-\texttt{Canoco, scaling=-3} &
+\texttt{Canoco 3, scaling=-3} &
 $u_{ik} \sqrt{n} \sqrt[4]{\lambda_k / \sum \lambda_k}$ &
 $v_{jk} \sqrt{n} \sqrt[4]{\lambda_k / \sum \lambda_k}$
 \end{tabular}
@@ -288,38 +289,61 @@
 is called a biplot.  The graph is a biplot if the transformed scores
 satisfy $x_{ij} = c \sum_k u_{ij}^* v_{jk}^*$ where $c$ is a scaling
 constant.  In functions \texttt{princomp}, \texttt{prcomp} and
-\texttt{rda}, $c=1$ or the plotting scores are the straight biplot
-scores so that the singular values (or eigenvalues) are expressed for
-sites, and species are left unscaled.  For \texttt{Canoco} $c = n^{-1}
-\sqrt{n-1} \sqrt{\sum \lambda_k}$ with positive \texttt{Canoco}
-scaling values. All these $c$ are constants for a matrix, so these are
-all biplots with different internal scaling of species and site scores
+\texttt{rda}, $c=1$ and the plotted scores are a biplot so that the
+singular values (or eigenvalues) are expressed for sites, and species
+are left unscaled.  For \texttt{Canoco 3} $c = n^{-1} \sqrt{n-1}
+\sqrt{\sum \lambda_k}$ with negative \texttt{Canoco} scaling
+values. All these $c$ are constants for a matrix, so these are all
+biplots with different internal scaling of species and site scores
 with respect to each other.  For \texttt{Canoco} with positive scaling
 values and \texttt{vegan} with negative scaling values, no constant
 $c$ can be found, but the correction is dependent on species standard
-deviations $s_j$, so this alternative does not define a biplot.
+deviations $s_j$, and these scores do not define a biplot.
 
 There is no natural way of scaling species and site scores to each
-other, but all functions and programs above selected different
-strategies.  The eigenvalues in redundancy and principal components
-analysis are scale dependent and change when the the data are
+other.  The eigenvalues in redundancy and principal components
+analysis are scale-dependent and change when the the data are
 multiplied by a constant.  If we have percent cover data, the
 eigenvalues are typically very high, and the scores scaled by
 eigenvalues will have much wider dispersion than the orthonormal set.
-If we express the percentages as proportions, or divide the matrix by
+If we express the percentages as proportions, and divide the matrix by
 $100$, the eigenvalues will be reduced by factor $100^2$, and the
-scores scaled by eigenvalues will have much narrower dispersion than
-the orthonormal set.  For graphical biplots we should be able to fix
-the relation and make it invariant for scale changes.  The solution
-adoption in the R standard function \texttt{biplot.princomp} is to
-scale site and species scores independently, and typically very
-differently, but plot each with separate scales so that both sets fill
-the graph area.  The solution in \texttt{Canoco} and \texttt{rda} is
-to use proportional eigenvalues $\lambda_k / \sum \lambda_k$ instead
-of original eigenvalues.  These proportions are invariant with scale
-changes, and typically they have a nice range for plotting two data
-sets in the same graph.
+scores scaled by eigenvalues will have a narrower dispersion.  For
+graphical biplots we should be able to fix the relations of row and
+column scores to be invariant against scaling of data.  The solution
+in R standard function \texttt{biplot} is to scale site and species
+scores independently, and typically very differently, but plot each
+independenty to fill the graph area.  The solution in \texttt{Canoco} and 
+and \texttt{rda} is to use proportional eigenvalues $\lambda_k / \sum
+\lambda_k$ instead of original eigenvalues.  These proportions are
+invariant with scale changes, and typically they have a nice range for
+plotting two data sets in the same graph.
 
+The \textbf{vegan} package uses a scaling constant $c = \sqrt[4]{(n-1)
+  \sum \lambda_k}$ in order to be able to use scaling by proportional
+eigenvalues (like in \texttt{Canoco}) and still be able to have a
+biplot scaling. Because of this, the scaling of \texttt{rda} scores is
+non-standard. However, the \texttt{scores} function lets you to set
+the scaling constant to any desired values. It is also possible to
+have two separate scaling constants: the first for the species, and
+the second for sites and friends, and this allows getting scores of
+other software or R functions (Table \ref{tab:rdaconst}). 
+\begin{table}
+  \caption{\label{tab:rdaconst} Values of the \texttt{const} argument in
+    \textbf{vegan} to get the scores that are equal to those from
+    other functions and software. Number of sites (rows) is $n$, 
+    the number of species (columns) is $m$, and the sum of all
+    eigenvalues is $\sum_k \lambda_k$ (this is saved as the item
+    \texttt{tot.chi} in the \texttt{rda} result)}.
+\begin{tabular}{lccc}
+& \textbf{Scaling} &\textbf{Species costant} & \textbf{Site constant} \\
+\texttt{vegan} & any  & $\sqrt[4]{(n-1) \sum \lambda_k}$ & $\sqrt[4]{(n-1) \sum \lambda_k}$\\
+\texttt{prcomp}, \texttt{princomp} & \texttt{1} & $1$ & $\sqrt{(n-1) \sum_k \lambda_k}$\\
+\texttt{Canoco 3} & \texttt{-1, -2, -3} & $\sqrt{n-1}$ & $\sqrt{n}$\\
+\texttt{Canoco 4} & \texttt{-1, -2, -3} & $\sqrt{m}$ & $\sqrt{n}$
+\end{tabular}
+\end{table}
+
 In this chapter, I used always centred data matrices.  In principle
 \textsc{svd} could be done with original, non-centred data, but
 there is no option for this in \texttt{rda}, because I think that