[IPSUR-commits] r130 - pkg/IPSUR/inst/doc

Fri Jan 8 22:52:48 CET 2010

Author: gkerns
Date: 2010-01-08 22:52:47 +0100 (Fri, 08 Jan 2010)
New Revision: 130

Modified:
   pkg/IPSUR/inst/doc/IPSUR.Rnw
Log:
some changes


Modified: pkg/IPSUR/inst/doc/IPSUR.Rnw
===================================================================

--- pkg/IPSUR/inst/doc/IPSUR.Rnw	2010-01-08 16:34:22 UTC (rev 129)
+++ pkg/IPSUR/inst/doc/IPSUR.Rnw	2010-01-08 21:52:47 UTC (rev 130)
@@ -15269,7 +15269,7 @@
 
 Computers have changed the face of statistics. Their quick computational
 speed and flawless accuracy, coupled with large datasets acquired
-by the researcher, make them indispensable for any modern analysis.
+by the researcher, make them indispensable for many modern analyses.
 In particular, resampling methods (due in large part to Bradley Efron)
 have gained prominence in the modern statistician's repertoire. Let
 us look at a classical problem to get some insight why.
@@ -15368,15 +15368,15 @@
 illustrate the bootstrap procedure in the special case that the statistic
 $S$ is the standard error
 \begin{example}
-\textbf{Bootstrapping the standard error of the mean.\label{exa:Bootstrap-se-mean}}
+\textbf{Standard error of the mean.\label{exa:Bootstrap-se-mean}}
 In this example we illustrate the bootstrap by estimating the standard
 error of the sample mean. We do this in the special case when the
 underlying population is $\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1)$. 
 
-Of course, for this example we do not need a bootstrap distribution
-because from Section \ref{sec:Sampling-from-Normal} we know that
-$\Xbar\sim\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1/\sqrt{n})$.
-We will use what we already know to see how the bootstrap method performs.
+Of course, we do not really need a bootstrap distribution here because
+from Section \ref{sec:Sampling-from-Normal} we know that $\Xbar\sim\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1/\sqrt{n})$,
+but we will investigate how the bootstrap performs when we know what
+the answer should be ahead of time.
 
 We will take a random sample of size $n=25$ from the population.
 Then we will \emph{resample} the data 1000 times to get 1000 resamples
@@ -15409,32 +15409,69 @@
 curve(dnorm(x, 3, 0.2), add = TRUE)  # overlay true normal density
 @
 
-Take a look at this:
-\end{example}
+We have overlain what we know to be the true sampling distribution
+of $\Xbar$, namely, a $\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1/\sqrt{25})$
+distribution. The histogram matches the true sampling distribution
+pretty well with respect to shape and spread\ldots{}but notice how
+the histogram is off-center a little bit. This is not a coincidence
+-- in fact, it can be shown that the mean of the bootstrap distribution
+is exactly the mean of the original sample, that is, the value of
+the statistic that we originally observed. Let us calculate the mean
+of the bootstrap distribution and compare it to the mean of the original
+sample:
+
 <<>>=
 mean(xbarstar)
 mean(srs)
+mean(xbarstar) - mean(srs)
 @
 
-Now what we originally wanted.
+\end{example}
+Notice how close the two values are. The difference between them is
+an estimate of how biased the original statistic is, the so-called
+\emph{bootstrap estimate of bias}. Since the estimate is so small
+we would expect our original statistic ($\Xbar$) to have small bias,
+but this is no surprise to us because we already knew from Section
+BLANK that $\Xbar$ is an unbiased estimator of the population mean.
 
+Now back to our original problem, we would like to estimate the standard
+error of $\Xbar$. Looking at the histogram, we see that the spread
+of the bootstrap distribution is similar to the spread of the sampling
+distribution. Therefore, it stands to reason that we could estimate
+the standard error of $\Xbar$ with the sample standard deviation
+of the resample statistics. Let us try and see.
+
 <<>>=
 sd(xbarstar)
 @
 
+We know from theory that the true standard error is $1/\sqrt{25}=0.20$.
+Our bootstrap estimate is not very far from the theoretical value.
 
+
+\begin{rem}
+What would happen if we take more resamples? Instead of 1000 resamples,
+we could increase to, say, 2000, 3000, or even 4000\ldots{}would
+it help? The answer is both yes and no. Keep in mind that with resampling
+methods there are two sources of randomness: that from the original
+sample, and that from the subsequent resampling procedure. An increased
+number of resamples would reduce the variation due to the second part,
+but would be powerless to reduce the variation due to the first part.
+We only took an original sample of size $n=25$, and resampling more
+and more would never generate more information about the population
+than was already there. In this sense, the statistician is limited
+by the information contained in the original sample. \end{rem}
 \begin{example}
-\textbf{Bootstrapping the Standard Error of the Median.\label{exa:Bootstrap-se-median}}
-In this example we extend our study to include more complicated statistics
-and data where we do not know the answer ahead of time. This example
-uses the \inputencoding{latin9}\lstinline[showstringspaces=false]!rivers!\inputencoding{utf8}\index{Data sets!rivers@\texttt{rivers}}
+\textbf{Standard error of the median.\label{exa:Bootstrap-se-median}}
+We look at one where we do not know the answer ahead of time. This
+example uses the \inputencoding{latin9}\lstinline[showstringspaces=false]!rivers!\inputencoding{utf8}\index{Data sets!rivers@\texttt{rivers}}
 dataset. Recall the stemplot on page \vpageref{ite:stemplot-rivers}
 that we made for these data which shows them to be markedly right-skewed,
 so a natural estimate of center would be the sample median. Unfortunately,
 its sampling distribution falls out of our reach. We use the bootstrap
 to help us with this problem, and the modifications to the last example
 are trivial.
-\end{example}
+
 <<>>=
 resamps <- replicate(1000, sample(rivers, 141, TRUE), simplify = FALSE)
 medstar <- sapply(resamps, median, simplify = TRUE)
@@ -15461,15 +15498,22 @@
 <<eval = FALSE, keep.source = TRUE>>=
 hist(medstar, breaks = 40, prob = TRUE)
 @
+
+<<>>=
+median(rivers)
+mean(medstar)
+mean(medstar) - median(rivers)
+@
+
+\end{example}
+
 \begin{example}
 The boot package in \texttt{R}. It turns out that there are many bootstrap
 procedures and commands already built into base \texttt{R}, in the
 \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}
 package. Further, inside the \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}
 package there is even a function called \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}\index{boot@\texttt{boot}}.
-The basic syntax is of the form:
-
-\inputencoding{latin9}
+The basic syntax is of the form:\inputencoding{latin9}
 \begin{lstlisting}[showstringspaces=false]
 boot(data, statistic, R)
 \end{lstlisting}
@@ -15486,7 +15530,7 @@
 <<>>=
 library(boot)
 mean_fun <- function(x, indices) mean(x[indices])
-boot(data = rnorm(25, mean = 2), statistic = mean_fun, R = 1000)
+boot(data = srs, statistic = mean_fun, R = 1000)
 @
 
 For the standard error of the median (Example \ref{exa:Bootstrap-se-median}):
@@ -15555,8 +15599,7 @@
 We then plug \inputencoding{latin9}\lstinline[showstringspaces=false]!data.boot!\inputencoding{utf8}
 into the function \inputencoding{latin9}\lstinline[showstringspaces=false]!boot.ci!\inputencoding{utf8}.
 \begin{example}
-Please see the handout, {}``Bootstrapping Confidence Intervals for
-the Median''.
+Confidence interval for expected value of the median.
 
 <<>>=
 btsamps <- replicate(2000, sample(stack.loss, 21, TRUE), simplify = FALSE)
@@ -15569,8 +15612,8 @@
 \end{example}
 
 \begin{example}
-Please see the handout, {}``Bootstrapping Confidence Intervals for
-the Median, $2^{\mathrm{nd}}$ try.''
+Confidence interval for expected value of the median, $2^{\mathrm{nd}}$
+try.
 
 <<>>=
 library(boot)
@@ -15586,8 +15629,8 @@
 The idea is to use confidence intervals that we already know and let
 the bootstrap help us when we get into trouble. We know that a $100(1-\alpha)\%$
 confidence interval for the mean of a $SRS(n)$ from a normal distribution
-is given by \begin{equation}
-\Xbar\pm\mathsf{t}_{\alpha/2}(\mathtt{df}=n-1)\frac{S}{\sqrt{n}}\end{equation}
+is \begin{equation}
+\Xbar\pm\mathsf{t}_{\alpha/2}(\mathtt{df}=n-1)\frac{S}{\sqrt{n}},\end{equation}
 where $\mathsf{t}_{\alpha/2}(\mathtt{df}=n-1)$ is the appropriate
 critical value from Student's $t$ distribution, and we remember that
 an estimate for the standard error of $\Xbar$ is $S/\sqrt{n}$. Of
@@ -15602,7 +15645,7 @@
 $\E(\mathrm{statistic})$.
 \begin{example}
 We will use the t-interval method to find the bootstrap CI for the
-Median. We have looked at the bootstrap distribution; it appears to
+median. We have looked at the bootstrap distribution; it appears to
 be symmetric and approximately mound shaped. Further, we may check
 that the bias is approximately 40, which on the scale of these data
 is practically negligible. Thus, we may consider looking at the $t$-intervals.