[IPSUR-commits] r149 - pkg/IPSUR/inst/doc

Tue Jan 19 15:53:16 CET 2010

Author: gkerns
Date: 2010-01-19 15:53:15 +0100 (Tue, 19 Jan 2010)
New Revision: 149

Modified:
   pkg/IPSUR/inst/doc/IPSUR.Rnw
Log:
small changes


Modified: pkg/IPSUR/inst/doc/IPSUR.Rnw
===================================================================

--- pkg/IPSUR/inst/doc/IPSUR.Rnw	2010-01-18 20:38:11 UTC (rev 148)
+++ pkg/IPSUR/inst/doc/IPSUR.Rnw	2010-01-19 14:53:15 UTC (rev 149)
@@ -224,7 +224,7 @@
 <<echo = FALSE>>=
 seed <- 42
 set.seed(seed)
-options(width = 70)
+options(width = 75)
 #library(random)
 #i_seed <- randomNumbers(n = 624, col = 1, min = -1e+09, max = 1e+09)
 #.Random.seed[2:626] <- as.integer(c(1, i_seed))
@@ -2888,10 +2888,12 @@
 \subsection{Standardizing variables}
 
 It is sometimes useful to compare data sets with each other on a scale
-that is independent of the measurement units.
+that is independent of the measurement units. The \inputencoding{latin9}\lstinline[showstringspaces=false]!scale!\inputencoding{utf8}
+function will rescale a numeric vector (or data frame) by subtracting
+the sample mean from each value (column) and/or 
 
 
-\section{Multivariate Data and Data Frames\label{sec:Multivariate-Data}}
+\section{Multivariate Data and Data Frames\label{sec:multivariate-data}}
 
 We have had experience with vectors of data, which are long lists
 of numbers. Typically, each entry in the vector is a single measurement
@@ -13915,7 +13917,7 @@
 \begin{xca}
 Prove the ANOVA equality, Equation \ref{eq:anovaeq}. \emph{Hint}:
 show that\[
-\sum\]
+\sum_{i=1}^{n}(Y_{i}-\hat{Y_{i}})(\hat{Y_{i}}-\Ybar)=0.\]
 
 \end{xca}
 
@@ -15549,7 +15551,8 @@
 methods has given us:
 \begin{description}
 \item [{Fewer~assumptions.}] We are no longer required to assume the population
-is normal or the sample size is large.
+is normal or the sample size is large (though, as before, the larger
+the sample the better).
 \item [{Greater~accuracy.}] Many classical methods are based on rough
 upper bounds or Taylor expansions. The bootstrap procedures can be
 iterated long enough to give results accurate to several decimal places,
@@ -15587,24 +15590,17 @@
 
 Since the bootstrap distribution gives us information about a statistic's
 sampling distribution, we can use the bootstrap distribution to estimate
-properties of the statistic. of We have seen a procedure to help us
-gain information about the sampling distribution of a statistic of
-interest, and in this section we bring that information to bear to
-help us with estimation.Once we have a bootstrap distribution the
-next question is, what are we going to do with it?One statistic whose
-sampling distribution is often of interest is the sampling We will
-illustrate the bootstrap procedure in the special case that the statistic
-$S$ is the standard error
+properties of the statistic. We will illustrate the bootstrap procedure
+in the special case that the statistic $S$ is a standard error.
 \begin{example}
 \textbf{Standard error of the mean.\label{exa:Bootstrap-se-mean}}
 In this example we illustrate the bootstrap by estimating the standard
-error of the sample mean. We do this in the special case when the
-underlying population is $\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1)$. 
-
+error of the sample mean, and we will do it in the special case that
+the underlying population is $\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1)$.
 Of course, we do not really need a bootstrap distribution here because
 from Section \ref{sec:sampling-from-normal-dist} we know that $\Xbar\sim\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1/\sqrt{n})$,
-but we will investigate how the bootstrap performs when we know what
-the answer should be ahead of time.
+but we proceed anyway to investigate how the bootstrap performs when
+we know what the answer should be ahead of time.
 
 We will take a random sample of size $n=25$ from the population.
 Then we will \emph{resample} the data 1000 times to get 1000 resamples
@@ -15628,6 +15624,19 @@
 
 \caption{Bootstrapping the standard error of the mean, simulated data\label{fig:Bootstrap-se-mean}}
 
+
+{\small ~}{\small \par}
+
+{\small The original data were 25 observations generated from a $\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=1)$
+distribution. We next resampled to get 1000 resamples, each of size
+25, and calculated the sample mean for each resample. A histogram
+of the 1000 values of $\xbar$ is shown above. Also shown (with a
+solid line) is the true sampling distribution of $\Xbar$, which is
+a $\mathsf{norm}(\mathtt{mean}=3,\,\mathtt{sd}=0.2)$ distribution.
+Note that the histogram is centered at the sample mean of the original
+data, while the true sampling distribution is centered at the true
+value of $\mu=3$. The shape and spread of the histogram is similar
+to the shape and spread of the true sampling distribution.}
 \end{figure}
 A histogram of the 1000 values of $\xbar$ is shown in Figure \ref{fig:Bootstrap-se-mean},
 and was produced by the following code.
@@ -15685,7 +15694,7 @@
 methods there are two sources of randomness: that from the original
 sample, and that from the subsequent resampling procedure. An increased
 number of resamples would reduce the variation due to the second part,
-but would be powerless to reduce the variation due to the first part.
+but would do nothing to reduce the variation due to the first part.
 We only took an original sample of size $n=25$, and resampling more
 and more would never generate more information about the population
 than was already there. In this sense, the statistician is limited
@@ -15722,7 +15731,7 @@
 
 
 The graph is shown in Figure \ref{fig:Bootstrapping-se-median}, and
-was produced by the following.
+was produced by the following code.
 
 <<eval = FALSE, keep.source = TRUE>>=
 hist(medstar, breaks = 40, prob = TRUE)
@@ -15737,9 +15746,9 @@
 \end{example}
 
 \begin{example}
-The boot package in \texttt{R}. It turns out that there are many bootstrap
-procedures and commands already built into base \texttt{R}, in the
-\inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}
+\textbf{The boot package in }\texttt{\textbf{R}}\textbf{.} It turns
+out that there are many bootstrap procedures and commands already
+built into base \texttt{R}, in the \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}
 package. Further, inside the \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}
 package there is even a function called \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}\index{boot@\texttt{boot}}.
 The basic syntax is of the form:\inputencoding{latin9}
@@ -15828,7 +15837,10 @@
 We then plug \inputencoding{latin9}\lstinline[showstringspaces=false]!data.boot!\inputencoding{utf8}
 into the function \inputencoding{latin9}\lstinline[showstringspaces=false]!boot.ci!\inputencoding{utf8}.
 \begin{example}
-Confidence interval for expected value of the median.
+\label{exa:percentile-interval-median-first}\textbf{Percentile interval
+for the expected value of the median.} We will try the naive approach
+where we generate the resamples and calculate the percentile interval
+by hand.
 
 <<>>=
 btsamps <- replicate(2000, sample(stack.loss, 21, TRUE), simplify = FALSE)
@@ -15842,7 +15854,8 @@
 
 \begin{example}
 Confidence interval for expected value of the median, $2^{\mathrm{nd}}$
-try.
+try. Now we will do it the right way with the \inputencoding{latin9}\lstinline[showstringspaces=false]!boot!\inputencoding{utf8}
+function.
 
 <<>>=
 library(boot)
@@ -16458,7 +16471,8 @@
 
 See \inputencoding{latin9}\lstinline[showstringspaces=false]!?read.spss!\inputencoding{utf8}
 for the available options to customize the file import. Note that
-the R Commander
+the R Commander will import many of the common file types with a menu
+driven interface.
 
 
 \subsection{Importing a Data Frame}
@@ -16472,7 +16486,7 @@
 
 Using \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!scan!\inputencoding{utf8}
 
-Using R Commander.
+Using the \textsf{R} Commander.
 
 
 \section{Editing Data\label{sec:Editing-Data-Sets}}
@@ -16489,7 +16503,57 @@
 
 \subsection{Sorting Data}
 
+We can sort a vector with the \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!sort!\inputencoding{utf8}
+function. 
 
+Normally we have a data frame of several columns (variables) and many,
+many rows (observations). The goal is to shuffle the rows so that
+they are ordered by the values of one or more columns. This is done
+with the \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!order!\inputencoding{utf8}
+function. 
+
+For example, we may sort all of the rows of the \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!Puromycin!\inputencoding{utf8}
+data (in ascending order) by the variable \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!conc!\inputencoding{utf8}
+with the following:
+
+<<>>=
+Tmp <- Puromycin[order(Puromycin$conc), ]
+head(Tmp)
+@
+
+We can accomplish the same thing with the command 
+
+<<eval = FALSE>>=
+with(Puromycin, Puromycin[order(conc), ])
+@
+
+We can sort by more than one variable. To sort first by \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!state!\inputencoding{utf8}
+and then next by \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!conc!\inputencoding{utf8}
+do
+
+<<eval = FALSE>>=
+with(Puromycin, Puromycin[order(state, conc), ])
+@
+
+If we would like to sort a numeric variable in descending order then
+we put a minus sign in front of it.
+
+<<>>=
+Tmp <- with(Puromycin, Puromycin[order(-conc), ])
+head(Tmp)
+@
+
+If we would like to sort by a character (or factor) in decreasing
+order then we can use the \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!xtfrm!\inputencoding{utf8}
+function which produces a numeric vector in the same order as the
+character vector.
+
+<<>>=
+Tmp <- with(Puromycin, Puromycin[order(-xtfrm(state)), ])
+head(Tmp)
+@
+
+
 \section{Exporting Data\label{sec:Exporting-a-Data}}
 
 The basic function is \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!write.table!\inputencoding{utf8}
@@ -16500,15 +16564,15 @@
 
 
 \section{Reshaping Data\label{sec:Reshaping-a-Data}}
+\begin{itemize}
+\item Aggregation
+\item Convert Tables to Data Frames and back
+\end{itemize}
+\inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!rbind!\inputencoding{utf8} 
 
-Aggregation
-
-Convert Tables to Data Frames and back
-
-\inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!rbind!\inputencoding{utf8}
 \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!cbind!\inputencoding{utf8}
 
-ab{[}order(ab{[},1{]}),{]}
+\inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!ab[order(ab[ ,1]), ]!\inputencoding{utf8}
 
 \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!complete.cases!\inputencoding{utf8}
 
@@ -16516,19 +16580,7 @@
 
 \inputencoding{latin9}\lstinline[showstringspaces=false,tabsize=2]!stack!\inputencoding{utf8}
 
-\# sorting examples using built-in mtcars data set
 
-\# sort by mpg newdata <- mtcars{[}order(mpg),{]}
-
-\# sort by mpg and cyl newdata <- mtcars{[}order(mpg, cyl),{]}
-
-\#sort by mpg (ascending) and cyl (descending) newdata <- mtcars{[}order(mpg,
--cyl),{]} 
-
-
-\section{Chapter Exercises}
-
-
 \chapter{Mathematical Machinery\label{cha:Mathematical-Machinery}}
 
 This appendix houses many of the standard definitions and theorems