[IPSUR-commits] r170 - pkg/IPSUR/inst/doc

Sun Jan 31 00:12:36 CET 2010

Author: gkerns
Date: 2010-01-31 00:12:35 +0100 (Sun, 31 Jan 2010)
New Revision: 170

Added:
   pkg/IPSUR/inst/doc/IPSURsolutions.Rnw
Modified:
   pkg/IPSUR/inst/doc/IPSUR.Rnw
Log:
added main branch and separated answers/solutions


Modified: pkg/IPSUR/inst/doc/IPSUR.Rnw
===================================================================

--- pkg/IPSUR/inst/doc/IPSUR.Rnw	2010-01-29 19:49:01 UTC (rev 169)
+++ pkg/IPSUR/inst/doc/IPSUR.Rnw	2010-01-30 23:12:35 UTC (rev 170)
@@ -175,6 +175,13 @@
 morestring=[b]"
 }
 
+% Turn on questions and answers
+\newcommand{\question}[1]{#1}
+\newcommand{\answer}[1]{#1}
+% Turn off questions and answers
+%\newcommand{\question}[1]{}
+%\newcommand{\answer}[1]{}
+
 \@ifundefined{showcaptionsetup}{}{%
  \PassOptionsToPackage{caption=false}{subfig}}
 \usepackage{subfig}
@@ -415,7 +422,7 @@
 
 \tableofcontents{}
 
-\cleardoublepage
+\noindent \cleardoublepage
 \phantomsection
 \addcontentsline{toc}{chapter}{Preface}
 
@@ -729,10 +736,10 @@
 
 \pagenumbering{arabic} 
 
-This chapter has proved to be the hardest to write, by far. The trouble
-is that there is so much to say -- and so many people have already
-said it so much better than I could. When I get something I like I
-will release it here.
+\noindent \noindent This chapter has proved to be the hardest to write, by far.
+The trouble is that there is so much to say -- and so many people
+have already said it so much better than I could. When I get something
+I like I will release it here.
 
 In the meantime, there is a lot of information already available to
 a person with an Internet connection. I recommend to start at Wikipedia,
@@ -781,14 +788,22 @@
 This book is devoted mostly to the frequentist viewpoint because that
 is how I was trained, with the conspicuous exception of Sections \ref{sec:Bayes'-Rule}
 and \ref{sec:Conditional-Distributions}. I plan to add more bayesian
-material in later editions of this book. 
+material in later editions of this book.
 
+\pagebreak{}
 
-\chapter{An Introduction to \textsf{R\label{cha:An-Introduction-to-R}}}
 
+\section*{Chapter Exercises}
 
-\section{Downloading and Installing \textsf{R\label{sec:Downloading-and-Installing-R}}}
+\addcontentsline{toc}{section}{Chapter Exercises}
+\setcounter{thm}{0}
 
+
+\chapter{An Introduction to \textsf{R\label{cha:introduction-to-R}}}
+
+
+\section{Downloading and Installing \textsf{R\label{sec:download-install-R}}}
+
 The instructions for obtaining \textsf{R} largely depend on the user's
 hardware and operating system. The \textsf{R} Project has written
 an \textsf{R} Installation and Administration manual with complete,
@@ -806,8 +821,8 @@
 \item [{MacOS:}] \url{http://cran.r-project.org/bin/macosx/}
 \item [{Linux:}] \url{http://cran.r-project.org/bin/linux/}
 \end{description}
-On MS-Windows, click the \inputencoding{latin9}\lstinline[showstringspaces=false]!.exe!\inputencoding{utf8}
-program file to start installation. When it asks for \textquotedbl{}Customized
+On Microsoft Windows, click the \inputencoding{latin9}\lstinline[showstringspaces=false]!R-x.y.z.exe!\inputencoding{utf8}
+installer to start installation. When it asks for \textquotedbl{}Customized
 startup options\textquotedbl{}, specify \textsf{Yes}. In the next
 window, be sure to select the SDI (single document interface) option;
 this is useful later when we discuss three dimensional plots with
@@ -854,7 +869,7 @@
 will no longer be pointing to the right place. 
 
 
-\subsection{Installing and Loading Add-on Packages\label{sub:Installing-and-Loading-packages}}
+\subsection{Installing and Loading Add-on Packages\label{sub:installing-loading-packages}}
 
 There are \emph{base} packages (which come with \textsf{R} automatically),
 and \emph{contributed} packages (which must be downloaded for installation).
@@ -1524,8 +1539,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 
@@ -2837,7 +2853,7 @@
 out to be 36.6).
 
 
-\subsection{Hinges and the Five Number Summary\label{sub:Hinges-and-the} }
+\subsection{Hinges and the Five Number Summary\label{sub:hinges-and-5NS} }
 
 Given a data set $x_{1}$, $x_{2}$, \ldots{}, $x_{n}$, the hinges
 are found by the following method: 
@@ -2865,7 +2881,7 @@
 function.
 
 
-\subsection{Boxplots\label{sub:Boxplots} }
+\subsection{Boxplots\label{sub:boxplots} }
 
 A boxplot is essentially a graphical representation of the $5NS$.
 It can be a handy alternative to a stripchart when the sample size
@@ -2931,11 +2947,41 @@
 
 \subsection{How to do it with \textsf{R}}
 
+The quickest way to visually identify outliers is with a boxplot,
+described above. Another way is with the \inputencoding{latin9}\lstinline[showstringspaces=false]!boxplot.stats!\inputencoding{utf8}
+function.
+\begin{example}
+The \inputencoding{latin9}\lstinline[showstringspaces=false]!rivers!\inputencoding{utf8}
+data. We will look for potential outliers in the \inputencoding{latin9}\lstinline[showstringspaces=false]!rivers!\inputencoding{utf8}
+data.
 
+<<>>=
+boxplot.stats(rivers)$out
+@
+
+We may change the \inputencoding{latin9}\lstinline[showstringspaces=false]!coef!\inputencoding{utf8}
+argument to 3 (it is 1.5 by default) to identify suspected outliers.
+
+<<>>=
+boxplot.stats(rivers, coef = 3)$out
+@
+
+\end{example}
+
 \subsection{Standardizing variables}
 
 It is sometimes useful to compare data sets with each other on a scale
-that is independent of the measurement units. The \inputencoding{latin9}\lstinline[showstringspaces=false]!scale!\inputencoding{utf8}
+that is independent of the measurement units. Given a set of observed
+data $x_{1}$, $x_{2}$, \ldots{}, $x_{n}$ we get $z$ scores, denoted
+$z_{1}$, $z_{2}$, \ldots{}, $z_{n}$, by means of the following
+formula\[
+z_{i}=\frac{x_{i}-\xbar}{s},\quad i=1,\,2,\,\ldots,\, n.\]
+
+
+
+\subsection{How to do it with \textsf{R}}
+
+The \inputencoding{latin9}\lstinline[showstringspaces=false]!scale!\inputencoding{utf8}
 function will rescale a numeric vector (or data frame) by subtracting
 the sample mean from each value (column) and/or by dividing each observation
 by the sample standard deviation.
@@ -2955,10 +3001,10 @@
 the measured information in a rectangular array in which each row
 corresponds to a subject, and the columns contain the measurements
 for each respective variable. For instance, if one were to measure
-the height and weight of each of 11 persons in a research study, the
-information could be represented with a rectangular array. There would
-be 11 rows. Each row would have the person's height in the first column
-and weight in the second column.
+the height and weight and hair color of each of 11 persons in a research
+study, the information could be represented with a rectangular array.
+There would be 11 rows. Each row would have the person's height in
+the first column and hair color in the second column.
 
 The corresponding objects in \textsf{R} are called \emph{data frames},
 and they can be constructed with the \inputencoding{latin9}\lstinline[showstringspaces=false]!data.frame!\inputencoding{utf8}
@@ -2967,14 +3013,15 @@
 Suppose we have two vectors \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
 and \inputencoding{latin9}\lstinline[showstringspaces=false]!y!\inputencoding{utf8}
 and we want to make a data frame out of them.
-\end{example}
+
 <<>>=
 x <- 5:8
 y <- letters[3:6]
-data.frame(x,y)
+A <- data.frame(v1 = x, v2 = y)
 @
 
-Notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
+\end{example}
+The data frame has Notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
 and \inputencoding{latin9}\lstinline[showstringspaces=false]!y!\inputencoding{utf8}
 are the same length. This is \emph{necessary}. Also notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
 is a numeric vector and \inputencoding{latin9}\lstinline[showstringspaces=false]!y!\inputencoding{utf8}
@@ -2986,7 +3033,36 @@
 (numeric) and \inputencoding{latin9}\lstinline[showstringspaces=false]!gender!\inputencoding{utf8}
 (character or factor) information in the same column.
 
+Indexing of data frames is similar to indexing of vectors. To get
+the entry in row $i$ and column $j$ do \inputencoding{latin9}\lstinline[showstringspaces=false]!A[i,j]!\inputencoding{utf8}.
+We can get entire rows and columns by omitting the other index. 
 
+<<>>=
+A[3,]
+A[1, ]
+A[ ,2]
+@
+
+There are several things happening above. Notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!A[3,]!\inputencoding{utf8}
+gave a data frame (with the same entries as the third row of \inputencoding{latin9}\lstinline[showstringspaces=false]!A!\inputencoding{utf8})
+yet \inputencoding{latin9}\lstinline[showstringspaces=false]!A[1, ]!\inputencoding{utf8}
+is a numeric vector. \inputencoding{latin9}\lstinline[showstringspaces=false]!A[ ,2]!\inputencoding{utf8}
+is a factor vector because the default setting for \inputencoding{latin9}\lstinline[showstringspaces=false]!data.frame!\inputencoding{utf8}
+is \inputencoding{latin9}\lstinline[showstringspaces=false]!stringsAsFactors = TRUE!\inputencoding{utf8}.
+
+Data frames have a \inputencoding{latin9}\lstinline[showstringspaces=false]!names!\inputencoding{utf8}
+attribute and the names may be extracted with the \inputencoding{latin9}\lstinline[showstringspaces=false]!names!\inputencoding{utf8}
+function. Once we have the names we may extract given columns by way
+of the dollar sign.
+
+<<>>=
+names(A)
+A$v1
+@
+
+The above is identical to \inputencoding{latin9}\lstinline[showstringspaces=false]!A[ ,1]!\inputencoding{utf8}. 
+
+
 \subsection{Bivariate Data\label{sub:Bivariate-Data}}
 \begin{itemize}
 \item Introduce the sample correlation coefficient.
@@ -3005,7 +3081,7 @@
 or in \textsf{R} Commander by following \textsf{Statistics} \textsf{$\triangleright$}
 \textsf{Contingency Tables} \textsf{$\triangleright$} \textsf{Multi-way
 Tables}.
-\item Scatterplot Matrix. used for displaying pairwise scatterplots simultaneously.
+\item Scatterplot matrix. used for displaying pairwise scatterplots simultaneously.
 Again, look for linear association and correlation.
 \item 3D Scatterplot. See Figure \pageref{fig:3D-scatterplot-trees}
 \item \inputencoding{latin9}\lstinline[showstringspaces=false]!plot(state.region, state.division)!\inputencoding{utf8} 
@@ -3179,8 +3255,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 
@@ -3221,11 +3298,6 @@
 summary statistics for each variable.
 
 
-\paragraph*{Answers:}
-
-<<"Find summary statistics">>=
-summary(RcmdrTestDrive)
-@
 \end{xca}
 
 \begin{xca}
@@ -3240,28 +3312,8 @@
 \end{enumerate}
 \end{xca}
 
-\paragraph*{Solution:}
 
-First we will make a table of the \emph{race} variable with the \inputencoding{latin9}\lstinline[showstringspaces=false]!table!\inputencoding{utf8}
-function.
 
-<<>>=
-table(race)
-@
-\begin{enumerate}
-\item For these data, \Sexpr{names(table(race))[which(table(race)==max(table(race)))]}
-has the highest frequency.
-\item For these data, \Sexpr{names(table(race))[which(table(race)==min(table(race)))]}
-has the lowest frequency.
-\item The graph is shown below.
-\end{enumerate}
-\begin{center}
-<<echo = FALSE, fig=true, height = 4, width = 6>>=
-barplot(table(RcmdrTestDrive$race), main="", xlab="race", ylab="Frequency", legend.text=FALSE, col=NULL) 
-@
-\par\end{center}
-
-
 \begin{xca}
 Calculate the average \emph{salary} by the factor \emph{gender}. Do
 this with \textsf{Statistics} \textsf{$\triangleright$ Summaries}
@@ -3287,78 +3339,9 @@
 
 \end{enumerate}
 \end{xca}
+\noindent 
 
-\paragraph*{Solution:}
 
-We can generate a table listing the average salaries by gender with
-two methods. The first uses \inputencoding{latin9}\lstinline[showstringspaces=false]!tapply!\inputencoding{utf8}:
-
-<<keep.source = TRUE>>=
-x <- tapply(salary, list(gender = gender), mean)
-x
-@
-
-The second method uses the \inputencoding{latin9}\lstinline[showstringspaces=false]!by!\inputencoding{utf8}
-function:
-
-<<keep.source = TRUE>>=
-by(salary, gender, mean, na.rm = TRUE)
-@
-
-Now to answer the questions:
-\begin{enumerate}
-\item Which gender has the highest mean salary? 
-
-
-We can answer this by looking above. For these data, the gender with
-the highest mean salary is \Sexpr{names(x)[which(x==max(x))]}.
-
-\item Report the highest mean salary.
-
-
-Depending on our answer above, we would do something like \inputencoding{latin9}
-\begin{lstlisting}[showstringspaces=false]
-mean(salary[gender == Male])
-\end{lstlisting}
-\inputencoding{utf8} for example. For these data, the highest mean salary is 
-
-<<>>=
-x[which(x==max(x))]
-@
-
-\item Compare the spreads for the genders by calculating the standard deviation
-of \emph{salary} by \emph{gender}. Which gender has the biggest standard
-deviation?
-
-
-<<>>=
-y <- tapply(salary, list(gender = gender), sd)
-y
-@
-
-For these data, the the largest standard deviation is approximately
-\Sexpr{round(y[which(y==max(y))],2)} which was attained by the \Sexpr{names(y)[which(y==max(y))]}
-gender.
-
-\item Make boxplots of \emph{salary} by \emph{gender}. How does the boxplot
-compare to your answers to (1) and (3)?
-
-
-The graph is shown below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-boxplot(salary~gender, xlab="salary", ylab="gender", main="", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive) 
-@
-\par\end{center}
-
-Answers will vary. There should be some remarks that the center of
-the box is farther to the right for the \Sexpr{names(x)[which(x==max(x))]}
-gender, and some recognition that the box is wider for the \Sexpr{names(y)[which(y==max(y))]}
-gender.\end{enumerate}
-
-
-
 \begin{xca}
 For this problem we will study the variable \emph{reduction}.
 \begin{enumerate}
@@ -3382,46 +3365,8 @@
 \end{enumerate}
 \end{xca}
 
-\paragraph*{Answers:}
 
-<<echo = FALSE, results = hide>>=
-x = sort(reduction)
-@
 
-<<>>=
-x[137]
-IQR(x)
-fivenum(x)
-fivenum(x)[4] - fivenum(x)[2]
-@
-
-\noindent Compare your answers (3) and (5). Are they the same? If
-not, are they close?
-
-Yes, they are close, within \Sexpr{abs(IQR(x)-(fivenum(x)[4] - fivenum(x)[2]))}
-of each other.
-
-\noindent The boxplot of \emph{reduction} is below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4, width = 6>>=
-boxplot(reduction, xlab="reduction", main="", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive) 
-@
-\par\end{center}
-
-<<>>=
-temp <- fivenum(x)
-inF <- 1.5 * (temp[4] - temp[2]) + temp[4]
-outF <- 3 * (temp[4] - temp[2]) + temp[4]
-which(x > inF)
-which(x > outF)
-@
-
-Observations \Sexpr{which(x > inF)} would be considered potential
-outliers, while observation(s) \Sexpr{which(x > outF)} would be considered
-a suspected outlier.
-
-
 \begin{xca}
 In this problem we will compare the variables \emph{before} and \emph{after}.
 Don't forget \inputencoding{latin9}\lstinline[showstringspaces=false]!library(e1071)!\inputencoding{utf8}.
@@ -3444,146 +3389,6 @@
 \end{enumerate}
 \end{xca}
 
-\paragraph*{Solution:}
-\begin{enumerate}
-\item Examine the two measures of center for both variables that you found
-in problem 1. Judging from these measures, which variable has a higher
-center?
-
-
-We may take a look at the \inputencoding{latin9}\lstinline[showstringspaces=false]!summary(RcmdrTestDrive)!\inputencoding{utf8}
-output from Exercise \ref{xca:summary-RcmdrTestDrive}. Here we will
-repeat the relevant summary statistics.
-
-<<>>=
-c(mean(before), median(before))
-c(mean(after), median(after))
-@
-
-The idea is to look at the two measures and compare them to make a
-decision. In a nice world, both the mean and median of one variable
-will be larger than the other which sends a nice message. If We get
-a mixed message, then we should look for other information, such as
-extreme values in one of the variables, which is one of the reasons
-for the next part of the problem.
-
-\item Which measure of center is more appropriate for \emph{before}? (You
-may want to look at a boxplot.) Which measure of center is more appropriate
-for \emph{after}?
-
-
-The boxplot of \emph{before} is shown below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-boxplot(before, xlab="before", main="", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive) 
-@
-\par\end{center}
-
-We want to watch out for extreme values (shown as circles separated
-from the box) or large departures from symmetry. If the distribution
-is fairly symmetric then the mean and median should be approximately
-the same. But if the distribution is highly skewed with extreme values
-then we should be skeptical of the sample mean, and fall back to the
-median which is resistant to extremes. By design, the before variable
-is set up to have a fairly symmetric distribution.
-
-A boxplot of \emph{after} is shown next.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-boxplot(after, xlab="after", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive) 
-@
-\par\end{center}
-
-The same remarks apply to the \emph{after} variable. The \emph{after}
-variable has been designed to be left-skewed\ldots{} thus, the median
-would likely be a good choice for this variable.
-
-\item Based on your answer to (2), choose an appropriate measure of spread
-for each variable, calculate it, and report its value. Which variable
-has the biggest spread? (Note that you need to make sure that your
-measures are on the same scale.) 
-
-
-Since \emph{before} has a symmetric, mound shaped distribution, an
-excellent measure of center would be the sample standard deviation.
-And since \emph{after} is left-skewed, we should use the median absolute
-deviation. It is also acceptable to use the IQR, but we should rescale
-it appropriately, namely, by dividing by 1.349. The exact values are
-shown below.
-
-<<>>=
-sd(before)
-mad(after)
-IQR(after)/1.349
-@
-
-Judging from the values above, we would decide which variable has
-the higher spread. Look at how close the \inputencoding{latin9}\lstinline[showstringspaces=false]!mad!\inputencoding{utf8}
-and the \inputencoding{latin9}\lstinline[showstringspaces=false]!IQR!\inputencoding{utf8}
-(after suitable rescaling) are; it goes to show why the rescaling
-is important.
-
-\item Calculate and report the skewness and kurtosis for \emph{before}.
-Based on these values, how would you describe the shape of \emph{before}?
-
-
-The values of these descriptive measures are shown below.
-
-<<>>=
-library(e1071)
-skewness(before)
-kurtosis(before)
-@
-
-We should take the sample skewness value and compare it to $2\sqrt{6/n}\approx$\Sexpr{round(2*sqrt(6/length(before)),3)}
-in absolute value to see if it is substantially different from zero.
-The direction of skewness is decided by the sign (positive or negative)
-of the skewness value. 
-
-We should take the sample kurtosis value and compare it to $2\cdot\sqrt{24/168}\approx$\Sexpr{round(4*sqrt(6/length(before)),3)}),
-in absolute value to see if the excess kurtosis is substantially different
-from zero. And take a look at the sign to see whether the distribution
-is platykurtic or leptokurtic.
-
-\item Calculate and report the skewness and kurtosis for \emph{after}. Based
-on these values, how would you describe the shape of \emph{after}?
-
-
-The values of these descriptive measures are shown below.
-
-<<>>=
-skewness(after)
-kurtosis(after)
-@
-
-We should do for this one just like we did previously. We would again
-compare the sample skewness and kurtosis values (in absolute value)
-to \Sexpr{round(2*sqrt(6/length(after)),3)} and \Sexpr{round(4*sqrt(6/length(after)),3)},
-respectively.
-
-\item Plot histograms of \emph{before} and \emph{after} and compare them
-to your answers to (4) and (5).
-
-
-The graphs are shown below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-hist(before, xlab="before", data=RcmdrTestDrive) 
-@
-\par\end{center}
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-hist(after, xlab="after", data=RcmdrTestDrive) 
-@
-\par\end{center}
-
-Answers will vary. We are looking for visual consistency in the histograms
-to our statements above.\end{enumerate}
-
 \begin{xca}
 Describe the following data sets just as if you were communicating
 with an alien, but one who has had a statistics class. Mention the
@@ -5374,7 +5179,7 @@
 
 \begin{example}
 We saw the \inputencoding{latin9}\lstinline[showstringspaces=false]!RcmdrTestDrive!\inputencoding{utf8}
-data set in Chapter \ref{cha:An-Introduction-to-R} in which a two-way
+data set in Chapter \ref{cha:introduction-to-R} in which a two-way
 table of the smoking status versus the gender was
 
 <<echo = FALSE>>=
@@ -5979,8 +5784,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 <<echo = FALSE, results = hide>>=
@@ -5993,22 +5799,12 @@
 (\emph{Hint}: think about Pascal's triangle.)
 \end{xca}
 
-\paragraph*{Answer:}
 
-The events must satisfy the product equalities two at a time, of which
-there are ${n \choose 2}$, then they must satisfy an additional ${n \choose 3}$
-conditions three at a time, and so on, until they satisfy the ${n \choose n}=1$
-condition including all $n$ events. In total, there are \[
-{n \choose 2}+{n \choose 3}+\cdots+{n \choose n}=\sum_{k=0}^{n}{n \choose k}-\left[{n \choose 0}+{n \choose 1}\right]\]
-conditions to be satisfied, but the binomial series in the expression
-on the right is the sum of the entries of the $n$$^{\text{th}}$
-row of Pascal's triangle, which is $2^{n}$.
 
 
 
 
 
-
 \chapter{Discrete Distributions\label{cha:Discrete-Distributions}}
 
 In this chapter we introduce discrete random variables, those who
@@ -6020,7 +5816,7 @@
 generating functions.
 
 We give special attention to the empirical distribution since it plays
-such a fundamental role with respect to re sampling and Chapter \ref{cha:Resampling-Methods};
+such a fundamental role with respect to re sampling and Chapter \ref{cha:resampling-methods};
 it will also be needed in Section \ref{sub:Kolmogorov-Smirnov-Goodness-of-Fit-Test}
 where we discuss the Kolmogorov-Smirnov test. Following this is a
 section in which we introduce a catalogue of discrete random variables
@@ -6776,7 +6572,7 @@
 usually used by itself in this form, by itself. More commonly it is
 used as an intermediate step in a more complicated calculation, for
 instance, in hypothesis testing (see Chapter \ref{cha:Hypothesis-Testing})
-or resampling (see Chapter \ref{cha:Resampling-Methods}). It is nevertheless
+or resampling (see Chapter \ref{cha:resampling-methods}). It is nevertheless
 instructive to see what the \inputencoding{latin9}\lstinline[showstringspaces=false]!ecdf!\inputencoding{utf8}
 looks like, and there is a special plot method for \inputencoding{latin9}\lstinline[showstringspaces=false]!ecdf!\inputencoding{utf8}
 objects.
@@ -6829,7 +6625,7 @@
 As we hinted above, the empirical distribution is significant more
 because of how and where it appears in more sophisticated applications.
 We will explore some of these in later chapters -- see, for instance,
-Chapter \ref{cha:Resampling-Methods}.
+Chapter \ref{cha:resampling-methods}.
 
 
 \section{Other Discrete Distributions\label{sec:other-discrete-distributions}}
@@ -7356,15 +7152,15 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
-\begin{enumerate}
-\item A recent national study showed that approximately 44.7\% of college
+\begin{xca}
+A recent national study showed that approximately 44.7\% of college
 students have used Wikipedia as a source in at least one of their
 term papers. Let $X$ equal the number of students in a random sample
 of size $n=31$ who have used Wikipedia as a source. 
-
 \begin{enumerate}
 \item How is $X$ distributed? \[
 X\sim\mathsf{binom}(\mathtt{size}=31,\,\mathtt{prob}=0.447)\]
@@ -7468,7 +7264,7 @@
 @
 
 \end{enumerate}
-\end{enumerate}
+\end{xca}
 <<echo = FALSE, results = hide>>=
 rnorm(1)
 @
@@ -8570,8 +8366,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 
@@ -9447,7 +9244,7 @@
 
 \end{prop}
 There are a few things to note about Proposition \ref{pro:mvnorm-cond-dist}
-which will be important in Chapter \ref{cha:Simple-Linear-Regression}.
+which will be important in Chapter \ref{cha:simple-linear-regression}.
 First, the conditional mean of $Y|x$ is linear in $x$, with slope\begin{equation}
 \rho\,\frac{\sigma_{Y}}{\sigma_{X}}.\label{eq:population-slope-slr}\end{equation}
 Second, the conditional variance of $Y|x$ is independent of $x$. 
@@ -9727,7 +9524,7 @@
 f_{\mathbf{X}}(\mathbf{x})=\frac{1}{(2\pi)^{n/2}\left|\Sigma\right|^{1/2}}\exp\left\{ -\frac{1}{2}\left(\mathbf{x}-\upmu\right)^{\top}\Sigma^{-1}\left(\mathbf{x}-\upmu\right)\right\} ,\end{equation}
 and the MGF is\begin{equation}
 M_{\mathbf{X}}(\mathbf{t})=\exp\left\{ \upmu^{\top}\mathbf{t}+\frac{1}{2}\mathbf{t}^{\top}\Sigma\mathbf{t}\right\} .\end{equation}
-We will need the following in Chapter \ref{cha:Multiple-Linear-Regression}.
+We will need the following in Chapter \ref{cha:multiple-linear-regression}.
 \begin{thm}
 \label{thm:mvnorm-dist-matrix-prod}If $\mathbf{X}\sim\mathsf{mvnorm}(\mathtt{mean}=\upmu,\,\mathtt{sigma}=\Sigma)$
 and $\mathbf{A}$ is any matrix, then the random vector $\mathbf{Y}=\mathbf{AX}$
@@ -9878,8 +9675,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 
@@ -10480,8 +10278,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 <<echo = FALSE, results = hide>>=
@@ -11609,8 +11408,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 \begin{xca}
 Let $X_{1}$, $X_{2}$, \ldots{}, $X_{n}$ be an $SRS(n)$ from a
@@ -12205,7 +12005,7 @@
 \item The equal variance assumption can be relaxed as long as both sample
 sizes $n$ and $m$ are large. However, if one (or both) samples is
 small, then the test does not perform well; we should instead use
-the methods of Chapter \ref{cha:Resampling-Methods}.
+the methods of Chapter \ref{cha:resampling-methods}.
 \end{itemize}
 \end{rem}
 For a nonparametric alternative to the two-sample $F$ test see Chapter
@@ -12403,12 +12203,13 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 
-\chapter{\label{cha:Simple-Linear-Regression}Simple Linear Regression}
+\chapter{Simple Linear Regression\label{cha:simple-linear-regression}}
 
 
 \paragraph*{What do I want them to know?}
@@ -12901,7 +12702,7 @@
 $b_{1}$ and $b_{0}$.
 
 To that end, we can see from Equation \ref{eq:regline-slope-formula}
-(and it is made clear in Chapter \ref{cha:Multiple-Linear-Regression})
+(and it is made clear in Chapter \ref{cha:multiple-linear-regression})
 that $b_{1}$ is just a linear combination of normally distributed
 random variables, so $b_{1}$ is normally distributed too. Further,
 it can be shown that\begin{equation}
@@ -12925,7 +12726,7 @@
 
 It is also sometimes of interest to construct a confidence interval
 for $\beta_{0}$ in which case we will need the sampling distribution
-of $b_{0}$. It is shown in Chapter \ref{cha:Multiple-Linear-Regression}
+of $b_{0}$. It is shown in Chapter \ref{cha:multiple-linear-regression}
 that\begin{equation}
 b_{0}\sim\mathsf{norm}\left(\mathtt{mean}=\beta_{0},\,\mathtt{sd}=\sigma_{b_{0}}\right),\end{equation}
 where $\sigma_{b_{0}}$ is given by\begin{equation}
@@ -13280,7 +13081,7 @@
 \inputencoding{latin9}\lstinline[showstringspaces=false]!summary(cars.lm)!\inputencoding{utf8}
 output where it was called {}``\inputencoding{latin9}\lstinline[breaklines=true,showstringspaces=false]!Multiple R-squared!\inputencoding{utf8}''.
 Listed right beside it is the \inputencoding{latin9}\lstinline[breaklines=true,showstringspaces=false]!Adjusted R-squared!\inputencoding{utf8}
-which we will discuss in Chapter \ref{cha:Multiple-Linear-Regression}.
+which we will discuss in Chapter \ref{cha:multiple-linear-regression}.
 
 For the \inputencoding{latin9}\lstinline[showstringspaces=false]!cars!\inputencoding{utf8}
 data, we find $r$ to be
@@ -13317,7 +13118,7 @@
 $t$ statistic and be done with it? The answer is that the $F$ statistic
 has a more complicated interpretation and plays a more important role
 in the multiple linear regression model which we will study in Chapter
-\ref{cha:Multiple-Linear-Regression}. See Section \ref{sub:mlr-Overall-F-Test}
+\ref{cha:multiple-linear-regression}. See Section \ref{sub:mlr-Overall-F-Test}
 for details.
 
 
@@ -13959,8 +13760,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 \begin{xca}
 Prove the ANOVA equality, Equation \ref{eq:anovaeq}. \emph{Hint}:
@@ -13986,7 +13788,7 @@
 
 \end{xca}
 
-\chapter{\label{cha:Multiple-Linear-Regression}Multiple Linear Regression}
+\chapter{Multiple Linear Regression\label{cha:multiple-linear-regression}}
 
 We know a lot about simple linear regression models, and a next step
 is to study multiple regression models that have more than one independent
@@ -14029,7 +13831,7 @@
 1 & x_{1n} & x_{2n} & \cdots & x_{pn}\end{bmatrix}.\end{equation}
 The vector $\mathbf{Y}$ is called the \emph{response vector\index{response vector}}
 and the matrix $\mathbf{X}$ is called the \emph{model matrix}\index{model matrix}.
-As in Chapter \ref{cha:Simple-Linear-Regression}, the most general
+As in Chapter \ref{cha:simple-linear-regression}, the most general
 assumption that relates $\mathbf{Y}$ to $\mathbf{X}$ is\begin{equation}
 \mathbf{Y}=\mu(\mathbf{X})+\upepsilon,\end{equation}
 where $\mu$ is some function (the \emph{signal}) and $\upepsilon$
@@ -15360,7 +15162,7 @@
 percentile are extreme. 
 \end{description}
 Note that plugging the value $p=1$ into the formulas will recover
-all of the ones we saw in Chapter \ref{cha:Simple-Linear-Regression}.
+all of the ones we saw in Chapter \ref{cha:simple-linear-regression}.
 
 
 \section{Additional Topics\label{sec:Additional-Topics-MLR}}
@@ -15503,7 +15305,7 @@
 \item What to do when data are not normal
 
 \begin{itemize}
-\item Bootstrap (see Chapter \ref{cha:Resampling-Methods}).
+\item Bootstrap (see Chapter \ref{cha:resampling-methods}).
 \end{itemize}
 \end{itemize}
 
@@ -15516,8 +15318,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 \begin{xca}
 \label{xca:anova-equality}Use Equations \ref{eq:mlr-sse-matrix},
@@ -15527,7 +15330,7 @@
 
 \end{xca}
 
-\chapter{Resampling Methods\label{cha:Resampling-Methods}}
+\chapter{Resampling Methods\label{cha:resampling-methods}}
 
 Computers have changed the face of statistics. Their quick computational
 speed and flawless accuracy, coupled with large data sets acquired
@@ -16139,8 +15942,9 @@
 \newpage{}
 
 
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
 
+\addcontentsline{toc}{section}{Chapter Exercises}
 \setcounter{thm}{0}
 
 
@@ -16187,8 +15991,517 @@
 
 \appendix
 
-\chapter{Data\label{cha:Data}}
+\chapter{\textsf{R} Session Information\label{cha:R-Session-Information}}
 
+If you ever write the \textsf{R} help mailing list with a question,
+then you should include your session information in the email; it
+makes the reader's job easier and is requested by the Posting Guide.
+Here is how to do that, and below is what the output looks like. 
+
+<<keep.source = TRUE>>=
+sessionInfo()
+@
+
+\vfill{}
+
+
+
+\chapter{GNU Free Documentation License\label{cha:GNU-Free-Documentation}}
+
+\begin{center}
+\textbf{\large Version 1.3, 3 November 2008}\bigskip{}
+
+\par\end{center}
+
+\noindent Copyright (C) 2000, 2001, 2002, 2007, 2008 Free Software
+Foundation, Inc.
+
+\begin{center}
+\url{http://fsf.org/}
+\par\end{center}
+
+\noindent Everyone is permitted to copy and distribute verbatim copies
+of this license document, but changing it is not allowed.
+
+
+\section*{0. PREAMBLE}
+
+The purpose of this License is to make a manual, textbook, or other
+functional and useful document \textquotedbl{}free\textquotedbl{}
+in the sense of freedom: to assure everyone the effective freedom
+to copy and redistribute it, with or without modifying it, either
+commercially or noncommercially. Secondarily, this License preserves
+for the author and publisher a way to get credit for their work, while
+not being considered responsible for modifications made by others.
+
+This License is a kind of \textquotedbl{}copyleft\textquotedbl{},
+which means that derivative works of the document must themselves
+be free in the same sense. It complements the GNU General Public License,
+which is a copyleft license designed for free software.
+
+We have designed this License in order to use it for manuals for free
+software, because free software needs free documentation: a free program
+should come with manuals providing the same freedoms that the software
+does. But this License is not limited to software manuals; it can
+be used for any textual work, regardless of subject matter or whether
+it is published as a printed book. We recommend this License principally
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/ipsur -r 170