[Gtdb-commits] r18 - in pkg/gt.db: R man
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Sun Oct 18 01:57:21 CEST 2009
Author: dahinds
Date: 2009-10-18 01:57:20 +0200 (Sun, 18 Oct 2009)
New Revision: 18
Added:
pkg/gt.db/man/ibd.gt.data.Rd
Modified:
pkg/gt.db/R/relate.R
pkg/gt.db/man/ibd.dataset.Rd
Log:
- Added missing help page for ibd.gt.data
- rearranged page for ibd.dataset
- changed ibd.gt.data and ibd.dataset to exclude non-autosomal loci
Modified: pkg/gt.db/R/relate.R
===================================================================
--- pkg/gt.db/R/relate.R 2009-10-16 00:47:48 UTC (rev 17)
+++ pkg/gt.db/R/relate.R 2009-10-17 23:57:20 UTC (rev 18)
@@ -61,6 +61,10 @@
{
nc <- max(nchar(gt.data$genotype))
m0 <- m1 <- m2 <- matrix(0, nc, nc)
+ if (any(gt.data$ploidy != 'A')) {
+ gt.data <- subset(gt.data, ploidy=='A')
+ warning('non-autosomal data will be excluded')
+ }
# remove outlier samples
gt <- with(summary.gt.data(gt.data, by.sample=TRUE), AA+AB+BB)
@@ -119,6 +123,7 @@
{
g <- fetch.gt.data(dataset.name, part=part, parts=parts,
by='position', binsz=binsz)
+ g <- subset(g, ploidy=='A')
g <- with(summary.gt.data(g), subset(g, gt.rate>=gt.rate.min &
pmin(freq.a,freq.b)>=maf.min & hw.p.value>=hw.p.min))
ibd.gt.data(g, binsz=binsz, ...)
Modified: pkg/gt.db/man/ibd.dataset.Rd
===================================================================
--- pkg/gt.db/man/ibd.dataset.Rd 2009-10-16 00:47:48 UTC (rev 17)
+++ pkg/gt.db/man/ibd.dataset.Rd 2009-10-17 23:57:20 UTC (rev 18)
@@ -39,55 +39,16 @@
\item{\dots}{additional arguments for \code{\link{ibd.gt.data}}.}
}
\details{
- Pairwise identity-by-descent (IBD) is estimated using data from a
- series of intervals of size \code{binsz}, equally spaced across the
- genome. The algorithm determines the minimum number of alleles
- identical by state (IBS) for SNPs within each bin, and averages
- these values across bins. The minimum IBS value gives an upper
- limit on IBD for that bin, and approaches IBD as the number of
- assayed SNPs increases.
-
- Determining IBD for close relatives requires only a fraction of the
- available data from a whole genome scan. Very accurate estimates of
- IBD are not particularly useful, because of intrinsic variability in
- the recombination process. The default values for \code{binsz},
- \code{part}, and \code{parts} should not need to be changed.
-
- Bins with few SNPs may not be sufficiently informative for IBD, if
- there is a substantial probability of unrelated samples sharing the
- same haplotype by chance. Rare genotyping errors can also impact
- apparent IBD, since a single error within a bin will change the
- minimum IBS. The \code{min.snps} setting ensures reasonable
- informativeness, and \code{max.snps}, \code{maf.min},
- \code{hw.p.min}, and \code{gt.rate.min} control the expected error
- rate.
-
- The estimated genomic proportion with IBD=1 is biased upwards
- because a small proportion of bins are not sufficiently informative
- to distinguish IBD=0 from IBD=1. The estimate for IBD=2 seems to
- be less biased. The following table shows expected genomic IBD
- proportions for common familial relationships.
-
- \tabular{cccl}{
- IBD=0 \tab IBD=1 \tab IBD=2 \tab Relationship \cr
- 1.00 \tab 0.00 \tab 0.00 \tab unrelated \cr
- 0.75 \tab 0.25 \tab 0.00 \tab first cousins \cr
- 0.50 \tab 0.50 \tab 0.00 \tab half siblings \cr
- 0.00 \tab 1.00 \tab 0.00 \tab parent-child \cr
- 0.25 \tab 0.50 \tab 0.25 \tab full siblings \cr
- 0.00 \tab 0.00 \tab 1.00 \tab duplicate samples \cr
- }
-
- It is important to note that the actual IBD proportions for most
- relative pairs (except for parent-child) show substantial natural
- variation, independent of the variation due to the estimation
- procedure.
+ This is a wrapper around \code{\link{ibd.gt.data}} that helps to
+ select a suitable genome-wide autosomal subset of a genotype dataset
+ for IBD analysis. The \code{maf.min}, \code{hw.p.min}, and
+ \code{gt.rate.min} filters improve the IBD estimates by excluding
+ problematic data that may have elevated error rates.
}
\value{
- A list with three elements. The first two elements are matrices,
- containing the estimated proportions of the genome with IBD=1 and
- IBD=2, for all pairs of samples in the dataset. The third element
- is an ordered list of sample names from the GTPUB dataset.
+ A list of two square matrices with row and column names set to sample
+ names in the input data, containing the estimated proportions of the
+ genome with IBD=1 and IBD=2, for all pairs of samples in the dataset.
}
\seealso{
\code{\link{ibd.plot}},
Added: pkg/gt.db/man/ibd.gt.data.Rd
===================================================================
--- pkg/gt.db/man/ibd.gt.data.Rd (rev 0)
+++ pkg/gt.db/man/ibd.gt.data.Rd 2009-10-17 23:57:20 UTC (rev 18)
@@ -0,0 +1,99 @@
+%
+% Copyright (C) 2009, Perlegen Sciences, Inc.
+%
+% Written by David A. Hinds <dhinds at sonic.net>
+%
+% This is free software; you can redistribute it and/or modify it
+% under the terms of the GNU General Public License as published by
+% the Free Software Foundation; either version 3 of the license, or
+% (at your option) any later version.
+%
+% This program is distributed in the hope that it will be useful,
+% but WITHOUT ANY WARRANTY; without even the implied warranty of
+% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+% GNU General Public License for more details.
+%
+% You should have received a copy of the GNU General Public License
+% along with this program. If not, see <http://www.gnu.org/licenses/>
+%
+\name{ibd.gt.data}
+\alias{ibd.gt.data}
+\title{Estimate Pairwise Identity by Descent for Genotype Data}
+\description{
+ Estimate identity-by-descent for all sample pairs in a set of
+ genotype data.
+}
+\usage{
+ibd.gt.data(gt.data, binsz=1e6, min.snps=25, max.snps=50,
+ min.gt=0.8, ibs.limit=2)
+}
+\arguments{
+ \item{gt.data}{a data frame of genotypes from \code{\link{fetch.gt.data}}.}
+ \item{binsz}{Size (in base pairs) of bins to use for estimating IBD.}
+ \item{min.snps}{The minimum number of SNPs to consider when estimating
+ IBD. Bins with fewer SNPs will be excluded.}
+ \item{max.snps}{The maximim number of SNPs to consider. Additional
+ SNPs in a bin will be ignored.}
+ \item{min.gt}{minimum sample call rate for inclusion in the analysis.}
+ \item{ibs.limit}{bins with average IBS \code{ibs.limit} fold higher
+ than the median IBS are excluded due to low information content.}
+}
+\details{
+ Pairwise identity-by-descent (IBD) is estimated by subdividing the
+ input data into intervals of intervals of size \code{binsz}, equally
+ spaced across the genome. The algorithm determines the minimum number
+ of alleles identical by state (IBS) for SNPs within each bin, and
+ averages these values across bins. The minimum IBS value gives an
+ upper limit on IBD for that bin, and approaches IBD as the number of
+ assayed SNPs increases.
+
+ Determining IBD for close relatives requires only a fraction of the
+ available data from a whole genome scan. Very accurate estimates of
+ IBD are not particularly useful, because of intrinsic variability in
+ the recombination process. The default values for \code{binsz},
+ \code{part}, and \code{parts} should not need to be changed.
+
+ Bins with few SNPs may not be sufficiently informative for IBD, if
+ there is a substantial probability of unrelated samples sharing the
+ same haplotype by chance. Rare genotyping errors can also impact
+ apparent IBD, since a single error within a bin will change the
+ minimum IBS. The \code{min.snps} setting ensures reasonable
+ informativeness, and \code{max.snps} helps to limit the error rate.
+
+ The estimated genomic proportion with IBD=1 is biased upwards
+ because a small proportion of bins are not sufficiently informative
+ to distinguish IBD=0 from IBD=1. The estimate for IBD=2 seems to
+ be less biased. The following table shows expected genomic IBD
+ proportions for common familial relationships.
+
+ \tabular{cccl}{
+ IBD=0 \tab IBD=1 \tab IBD=2 \tab Relationship \cr
+ 1.00 \tab 0.00 \tab 0.00 \tab unrelated \cr
+ 0.75 \tab 0.25 \tab 0.00 \tab first cousins \cr
+ 0.50 \tab 0.50 \tab 0.00 \tab half siblings \cr
+ 0.00 \tab 1.00 \tab 0.00 \tab parent-child \cr
+ 0.25 \tab 0.50 \tab 0.25 \tab full siblings \cr
+ 0.00 \tab 0.00 \tab 1.00 \tab duplicate samples \cr
+ }
+
+ It is important to note that the actual IBD proportions for most
+ relative pairs (except for parent-child) show substantial natural
+ variation, independent of the variation due to the estimation
+ procedure.
+}
+\value{
+ A list of two square matrices with row and column names set to sample
+ names in the input data, containing the estimated proportions of the
+ genome with IBD=1 and IBD=2, for all pairs of samples in the dataset.
+}
+\seealso{
+ \code{\link{ibd.plot}},
+ \code{\link{ibd.summary}},
+ \code{\link{ibd.outliers}}.
+}
+\examples{
+gt.demo.check()
+gt <- fetch.gt.data('Demo_2')
+ibd <- ibd.gt.data(subset(gt, ploidy=='A'))
+}
+\keyword{manip}
More information about the Gtdb-commits
mailing list