[adegenet-commits] r553 - pkg/man
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Mon Feb 8 15:39:53 CET 2010
Author: jombart
Date: 2010-02-08 15:39:53 +0100 (Mon, 08 Feb 2010)
New Revision: 553
Modified:
pkg/man/find.clusters.Rd
Log:
Almost done with find.clusters documentation.
Modified: pkg/man/find.clusters.Rd
===================================================================
--- pkg/man/find.clusters.Rd 2010-02-08 12:19:19 UTC (rev 552)
+++ pkg/man/find.clusters.Rd 2010-02-08 14:39:53 UTC (rev 553)
@@ -4,88 +4,128 @@
\alias{find.clusters.data.frame}
\alias{find.clusters.matrix}
\alias{find.clusters.genind}
-\title{}
-\description{ == IN PROGRESS ==
- These functions implement the Discriminant Analysis of Principal Components
- (FIND.CLUSTERS). See 'details' section for a succint description of the method. FIND.CLUSTERS
- implementation calls upon \code{dudi.pca} from the \code{ade4} package and
- \code{lda} from the \code{MASS} package.
+\alias{.find.sub.clusters}
+\title{find.cluster: cluster identification using successive K-means}
+\description{
+ These functions implement the clustering procedure used in Discriminant
+ Analysis of Principal Components (DAPC, Jombart et al. submitted). This
+ procedure consists in running successive K-means with an increasing number of
+ clusters (\code{k}), after transforming data using a principal component
+ analysis (PCA). For each model, a statistical measure of goodness of fit (by
+ default, BIC) is computed, which allows to choose the optimal \code{k}. See
+ \code{details} for a description of how to select the optimal \code{k}.
- \code{find.clusters} performs the FIND.CLUSTERS on a \code{data.frame}, a \code{matrix}, or a
- \code{\linkS4class{genind}} object, and returns an object with class
- \code{find.clusters}. If data are stored in a \code{data.frame} or a \code{matrix},
- these have to be quantitative data (i.e., \code{numeric} or \code{integers}),
- as opposed to \code{characters} or \code{factors}.
+ Optionally, hierarchical clustering can be sought by providing a prior
+ clustering of individuals (argument \code{clust}). In such case, clusters will
+ be sought within each prior group.
+ \code{.find.sub.clusters} is a hidden function called in some instances of
+ \code{find.clusters}, and should not be called directely by the user.
+
+ The K-means procedure used in \code{find.clusters} is \code{kmeans} function
+ from the \code{stat} package. The PCA function is \code{dudi.pca} from the
+ \code{ade4} package.
}
\usage{
-\method{find.clusters}{data.frame}()
+\method{find.clusters}{data.frame}(x, clust=NULL, n.pca=NULL, n.clust=NULL, stat=c("BIC",
+ "AIC", "WSS"), choose.n.clust=TRUE,
+ criterion=c("min","diff", "conserv"),
+ max.n.clust=round(nrow(x)/10), n.iter=1e3,
+ n.start=10, center=TRUE, scale=TRUE)
-\method{find.clusters}{matrix}()
+\method{find.clusters}{matrix}(x, \ldots)
-\method{find.clusters}{genind}()
+\method{find.clusters}{genind}(x, clust=NULL, n.pca=NULL, n.clust=NULL, stat=c("BIC",
+ "AIC", "WSS"), choose.n.clust=TRUE, criterion=c("min","diff",
+ "conserv"), max.n.clust=round(nrow(x at tab)/10), n.iter=1e3,
+ n.start=10, scale=FALSE, scale.method=c("sigma", "binom"),
+ truenames=TRUE, \ldots)
}
\arguments{
\item{x}{\code{a data.frame}, \code{matrix}, or \code{\linkS4class{genind}}
object. For the \code{data.frame} and \code{matrix} arguments, only
quantitative variables should be provided.}
-\item{grp,pop}{a \code{factor} indicating the group membership of individuals}
+\item{clust}{an optional \code{factor} indicating a prior group membership of
+ individuals. If provided, sub-clusters will be sought within each prior
+ group.}
\item{n.pca}{an \code{integer} indicating the number of axes retained in the
- Principal Component Analysis (PCA) step. If \code{NULL}, interactive selection is triggered.}
-\item{n.da}{an \code{integer} indicating the number of axes retained in the
- Discriminant Analysis step. If \code{NULL}, interactive selection is triggered.}
+ Principal Component Analysis (PCA) step. If \code{NULL}, interactive selection
+ is triggered.}
+\item{n.clust}{ an optinal \code{integer} indicating the number of clusters to
+ be sought. If provided, the function will only run K-means once, for this
+ number of clusters. If left as \code{NULL}, several K-means are run for a
+ range of k (number of clusters) values.}
+\item{stat}{ a \code{character} string matching 'BIC', 'AIC', or 'WISS', which
+ indicates the statistic to be computed for each model (i.e., for each value of
+ \code{k}). BIC: Bayesian Information Criterion. AIC: Aikaike's Information
+ Criterion. WISS: within-groups sum of squares, that is, residual variance.}
+\item{choose.n.clust}{ a \code{logical} indicating whether the number of
+clusters should be chosen by the user (TRUE, default), or automatically, based
+on a given criterion (argument \code{criterion}). IT IS HIGHLY RECOMMENDED to
+choose the number of clusters interactively, as automatic procedures are being
+evaluated.}
+\item{criterion}{ a \code{character} string matching "min", "diff", or
+ "conserv", indicating the criterion for automatic selection of the optimal
+ number of clusters. See \code{details}.}
+\item{max.n.clust}{ an \code{integer} indicating the maximum number of clusters
+ to be tried. Values of 'k' will be picked up between 1 and \code{max.n.clust}}
+\item{n.iter}{ an \code{integer} indicating the number of iterations to be used
+ in each run of K-means algorithm. Corresponds to \code{iter.max} of
+ \code{kmeans} function.}
+\item{n.start}{ an \code{integer} indicating the number of randomly chosen
+ starting points to be used in each run of K-means algorithm. Using more
+ starting points ensures convergence of the algorithm. Corresponds to
+ \code{nstart} of \code{kmeans} function.}
\item{center}{a \code{logical} indicating whether variables should be centred to
-mean 0 (TRUE, default) or not (FALSE). Always TRUE for \linkS4class{genind} objects.}
+mean 0 (TRUE, default) or not (FALSE). Always TRUE for \linkS4class{genind}
+objects.}
\item{scale}{a \code{logical} indicating whether variables should be scaled
- (TRUE) or not (FALSE, default). Scaling consists in dividing variables by their
- (estimated) standard deviation to account for trivial differences in
+ (TRUE) or not (FALSE, default). Scaling consists in dividing variables by
+ their (estimated) standard deviation to account for trivial differences in
variances. Further scaling options are available for \linkS4class{genind}
objects (see argument \code{scale.method}).}
-\item{var.contrib,all.contrib}{a \code{logical} indicating whether the
- contribution of original variables (alleles, for \linkS4class{genind} objects)
- should be provided (TRUE) or not (FALSE, default). Such output can be useful,
- but can also create huge matrices when there the original size of the dataset
- is huge.}
-\item{pca.select}{a \code{character} indicating the mode of selection of PCA
- axes, matching approximately "nbEig" or "percVar". For "nbEig", the user
- has to specify the number of axes retained (interactively, or via
- \code{n.pca}). For "percVar", the user has to specify the minimum amount of
- the total variance to be preserved by the retained axes, expressed as a
- percentage (interactively, or via \code{perc.pca}). }
-\item{perc.pca}{a \code{numeric} value between 0 and 100 indicating the
- minimal percentage of the total variance of the data to be expressed by the
- retained axes of PCA.}
-\item{\ldots}{further arguments to be passed to other functions. For
- \code{find.clusters.matrix}, arguments are to match those of \code{find.clusters.data.frame}.}
\item{scale.method}{a \code{character} specifying the scaling method to be used
for allele frequencies, which must match "sigma" (usual estimate of standard
- deviation) or "binom" (based on binomial distribution). See \code{\link{scaleGen}} for
- further details.}
+ deviation) or "binom" (based on binomial distribution). See
+ \code{\link{scaleGen}} for further details.}
\item{truenames}{a \code{logical} indicating whether true (i.e., user-specified)
labels should be used in object outputs (TRUE, default) or not (FALSE).}
-\item{xax,yax}{\code{integers} specifying which principal components of FIND.CLUSTERS
- should be shown in x and y axes. }
-\item{col}{a suitable color to be used for groups. Not that the specified vector
-should match the number of groups, not the number of individuals.}
-\item{posi,bg,ratio,csub}{arguments used to customize the inset in scatterplots
- of FIND.CLUSTERS results. See \code{\link[pkg:ade4]{add.scatter}} documentation in the
- ade4 package for
- more details.}
-\item{only.grp}{a \code{character} vector indicating which groups should be
- displayed. Values should match values of \code{x$grp}. If \code{NULL}, all
- results are displayed}
-\item{subset}{\code{integer} or \code{logical} vector indicating which
- individuals should be displayed. If \code{NULL}, all
- results are displayed}
-\item{cex.lab}{a \code{numeric} indicating the size of labels.}
-\item{pch}{a \code{numeric} indicating the type of point to be used to indicate
- the prior group of individuals (see \code{\link{points}} documentation for
- more details).}
+\item{\ldots}{further arguments to be passed to other functions. For
+ \code{find.clusters.matrix}, arguments are to match those of \code{find.clusters.data.frame}.}
}
\details{
+ === ON THE SELECTION OF K ===
+ (where K is the 'optimal' number of clusters)
-}
+ So far, the analysis of data simulated under various population genetics
+ models (see reference) suggested an ad hoc rule for the selection of the
+ optimal number of clusters. First, BIC seems for efficient than AIC and WISS
+ to select the appropriate number of clusters. The rule of thumb consists in
+ increasing K until it no longer leads to an appreciable improve of fit (i.e.,
+ decrease of BIC). In the most simple models (island models), BIC decreases
+ until it reaches the optimal K, and then increases. In these cases, our rule
+ amounts to choosing the lowest K. In other models such as stepping stones, the
+ decrease of BIC often continues after the optimal K, but is much less steep.
+
+
+ An alternative approach that we do not recommend is automatic selection based
+ on a fixed criterion. For this, set \code{choose.n.clust} to FALSE and specify
+ the \code{criterion} you want to use, from the following values:
+
+ - "min": the model with the minimum statistics (as specified by \code{stat}
+ argument) is retained. Is likely to work for simple island model with BIC.
+
+ - "diff": model selection based on successive improvement of the test
+ statistic. This procedure attempts to increase K until the model improvement
+ (difference in successive BIC, AIC, or WISS) is no longer important. May be
+ more appropriate to models relating to stepping stones.
+
+ "conserv": another criterion meant to be conservative, in that it seeks a good
+ fit with a minimum number of clusters. Unlike "diff", it does not rely on
+ differences between successive statistics, but rather on absolute fit. It
+ selects the model with the smallest K so that the overall fit is above a given
+ threshold. }
\value{
The class \code{find.clusters} is a list with the following
components:\cr
More information about the adegenet-commits
mailing list