[IPSUR-commits] r170 - pkg/IPSUR/inst/doc
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Sun Jan 31 00:12:36 CET 2010
Author: gkerns
Date: 2010-01-31 00:12:35 +0100 (Sun, 31 Jan 2010)
New Revision: 170
Added:
pkg/IPSUR/inst/doc/IPSURsolutions.Rnw
Modified:
pkg/IPSUR/inst/doc/IPSUR.Rnw
Log:
added main branch and separated answers/solutions
Modified: pkg/IPSUR/inst/doc/IPSUR.Rnw
===================================================================
--- pkg/IPSUR/inst/doc/IPSUR.Rnw 2010-01-29 19:49:01 UTC (rev 169)
+++ pkg/IPSUR/inst/doc/IPSUR.Rnw 2010-01-30 23:12:35 UTC (rev 170)
@@ -175,6 +175,13 @@
morestring=[b]"
}
+% Turn on questions and answers
+\newcommand{\question}[1]{#1}
+\newcommand{\answer}[1]{#1}
+% Turn off questions and answers
+%\newcommand{\question}[1]{}
+%\newcommand{\answer}[1]{}
+
\@ifundefined{showcaptionsetup}{}{%
\PassOptionsToPackage{caption=false}{subfig}}
\usepackage{subfig}
@@ -415,7 +422,7 @@
\tableofcontents{}
-\cleardoublepage
+\noindent \cleardoublepage
\phantomsection
\addcontentsline{toc}{chapter}{Preface}
@@ -729,10 +736,10 @@
\pagenumbering{arabic}
-This chapter has proved to be the hardest to write, by far. The trouble
-is that there is so much to say -- and so many people have already
-said it so much better than I could. When I get something I like I
-will release it here.
+\noindent \noindent This chapter has proved to be the hardest to write, by far.
+The trouble is that there is so much to say -- and so many people
+have already said it so much better than I could. When I get something
+I like I will release it here.
In the meantime, there is a lot of information already available to
a person with an Internet connection. I recommend to start at Wikipedia,
@@ -781,14 +788,22 @@
This book is devoted mostly to the frequentist viewpoint because that
is how I was trained, with the conspicuous exception of Sections \ref{sec:Bayes'-Rule}
and \ref{sec:Conditional-Distributions}. I plan to add more bayesian
-material in later editions of this book.
+material in later editions of this book.
+\pagebreak{}
-\chapter{An Introduction to \textsf{R\label{cha:An-Introduction-to-R}}}
+\section*{Chapter Exercises}
-\section{Downloading and Installing \textsf{R\label{sec:Downloading-and-Installing-R}}}
+\addcontentsline{toc}{section}{Chapter Exercises}
+\setcounter{thm}{0}
+
+\chapter{An Introduction to \textsf{R\label{cha:introduction-to-R}}}
+
+
+\section{Downloading and Installing \textsf{R\label{sec:download-install-R}}}
+
The instructions for obtaining \textsf{R} largely depend on the user's
hardware and operating system. The \textsf{R} Project has written
an \textsf{R} Installation and Administration manual with complete,
@@ -806,8 +821,8 @@
\item [{MacOS:}] \url{http://cran.r-project.org/bin/macosx/}
\item [{Linux:}] \url{http://cran.r-project.org/bin/linux/}
\end{description}
-On MS-Windows, click the \inputencoding{latin9}\lstinline[showstringspaces=false]!.exe!\inputencoding{utf8}
-program file to start installation. When it asks for \textquotedbl{}Customized
+On Microsoft Windows, click the \inputencoding{latin9}\lstinline[showstringspaces=false]!R-x.y.z.exe!\inputencoding{utf8}
+installer to start installation. When it asks for \textquotedbl{}Customized
startup options\textquotedbl{}, specify \textsf{Yes}. In the next
window, be sure to select the SDI (single document interface) option;
this is useful later when we discuss three dimensional plots with
@@ -854,7 +869,7 @@
will no longer be pointing to the right place.
-\subsection{Installing and Loading Add-on Packages\label{sub:Installing-and-Loading-packages}}
+\subsection{Installing and Loading Add-on Packages\label{sub:installing-loading-packages}}
There are \emph{base} packages (which come with \textsf{R} automatically),
and \emph{contributed} packages (which must be downloaded for installation).
@@ -1524,8 +1539,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
@@ -2837,7 +2853,7 @@
out to be 36.6).
-\subsection{Hinges and the Five Number Summary\label{sub:Hinges-and-the} }
+\subsection{Hinges and the Five Number Summary\label{sub:hinges-and-5NS} }
Given a data set $x_{1}$, $x_{2}$, \ldots{}, $x_{n}$, the hinges
are found by the following method:
@@ -2865,7 +2881,7 @@
function.
-\subsection{Boxplots\label{sub:Boxplots} }
+\subsection{Boxplots\label{sub:boxplots} }
A boxplot is essentially a graphical representation of the $5NS$.
It can be a handy alternative to a stripchart when the sample size
@@ -2931,11 +2947,41 @@
\subsection{How to do it with \textsf{R}}
+The quickest way to visually identify outliers is with a boxplot,
+described above. Another way is with the \inputencoding{latin9}\lstinline[showstringspaces=false]!boxplot.stats!\inputencoding{utf8}
+function.
+\begin{example}
+The \inputencoding{latin9}\lstinline[showstringspaces=false]!rivers!\inputencoding{utf8}
+data. We will look for potential outliers in the \inputencoding{latin9}\lstinline[showstringspaces=false]!rivers!\inputencoding{utf8}
+data.
+<<>>=
+boxplot.stats(rivers)$out
+@
+
+We may change the \inputencoding{latin9}\lstinline[showstringspaces=false]!coef!\inputencoding{utf8}
+argument to 3 (it is 1.5 by default) to identify suspected outliers.
+
+<<>>=
+boxplot.stats(rivers, coef = 3)$out
+@
+
+\end{example}
+
\subsection{Standardizing variables}
It is sometimes useful to compare data sets with each other on a scale
-that is independent of the measurement units. The \inputencoding{latin9}\lstinline[showstringspaces=false]!scale!\inputencoding{utf8}
+that is independent of the measurement units. Given a set of observed
+data $x_{1}$, $x_{2}$, \ldots{}, $x_{n}$ we get $z$ scores, denoted
+$z_{1}$, $z_{2}$, \ldots{}, $z_{n}$, by means of the following
+formula\[
+z_{i}=\frac{x_{i}-\xbar}{s},\quad i=1,\,2,\,\ldots,\, n.\]
+
+
+
+\subsection{How to do it with \textsf{R}}
+
+The \inputencoding{latin9}\lstinline[showstringspaces=false]!scale!\inputencoding{utf8}
function will rescale a numeric vector (or data frame) by subtracting
the sample mean from each value (column) and/or by dividing each observation
by the sample standard deviation.
@@ -2955,10 +3001,10 @@
the measured information in a rectangular array in which each row
corresponds to a subject, and the columns contain the measurements
for each respective variable. For instance, if one were to measure
-the height and weight of each of 11 persons in a research study, the
-information could be represented with a rectangular array. There would
-be 11 rows. Each row would have the person's height in the first column
-and weight in the second column.
+the height and weight and hair color of each of 11 persons in a research
+study, the information could be represented with a rectangular array.
+There would be 11 rows. Each row would have the person's height in
+the first column and hair color in the second column.
The corresponding objects in \textsf{R} are called \emph{data frames},
and they can be constructed with the \inputencoding{latin9}\lstinline[showstringspaces=false]!data.frame!\inputencoding{utf8}
@@ -2967,14 +3013,15 @@
Suppose we have two vectors \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
and \inputencoding{latin9}\lstinline[showstringspaces=false]!y!\inputencoding{utf8}
and we want to make a data frame out of them.
-\end{example}
+
<<>>=
x <- 5:8
y <- letters[3:6]
-data.frame(x,y)
+A <- data.frame(v1 = x, v2 = y)
@
-Notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
+\end{example}
+The data frame has Notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
and \inputencoding{latin9}\lstinline[showstringspaces=false]!y!\inputencoding{utf8}
are the same length. This is \emph{necessary}. Also notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!x!\inputencoding{utf8}
is a numeric vector and \inputencoding{latin9}\lstinline[showstringspaces=false]!y!\inputencoding{utf8}
@@ -2986,7 +3033,36 @@
(numeric) and \inputencoding{latin9}\lstinline[showstringspaces=false]!gender!\inputencoding{utf8}
(character or factor) information in the same column.
+Indexing of data frames is similar to indexing of vectors. To get
+the entry in row $i$ and column $j$ do \inputencoding{latin9}\lstinline[showstringspaces=false]!A[i,j]!\inputencoding{utf8}.
+We can get entire rows and columns by omitting the other index.
+<<>>=
+A[3,]
+A[1, ]
+A[ ,2]
+@
+
+There are several things happening above. Notice that \inputencoding{latin9}\lstinline[showstringspaces=false]!A[3,]!\inputencoding{utf8}
+gave a data frame (with the same entries as the third row of \inputencoding{latin9}\lstinline[showstringspaces=false]!A!\inputencoding{utf8})
+yet \inputencoding{latin9}\lstinline[showstringspaces=false]!A[1, ]!\inputencoding{utf8}
+is a numeric vector. \inputencoding{latin9}\lstinline[showstringspaces=false]!A[ ,2]!\inputencoding{utf8}
+is a factor vector because the default setting for \inputencoding{latin9}\lstinline[showstringspaces=false]!data.frame!\inputencoding{utf8}
+is \inputencoding{latin9}\lstinline[showstringspaces=false]!stringsAsFactors = TRUE!\inputencoding{utf8}.
+
+Data frames have a \inputencoding{latin9}\lstinline[showstringspaces=false]!names!\inputencoding{utf8}
+attribute and the names may be extracted with the \inputencoding{latin9}\lstinline[showstringspaces=false]!names!\inputencoding{utf8}
+function. Once we have the names we may extract given columns by way
+of the dollar sign.
+
+<<>>=
+names(A)
+A$v1
+@
+
+The above is identical to \inputencoding{latin9}\lstinline[showstringspaces=false]!A[ ,1]!\inputencoding{utf8}.
+
+
\subsection{Bivariate Data\label{sub:Bivariate-Data}}
\begin{itemize}
\item Introduce the sample correlation coefficient.
@@ -3005,7 +3081,7 @@
or in \textsf{R} Commander by following \textsf{Statistics} \textsf{$\triangleright$}
\textsf{Contingency Tables} \textsf{$\triangleright$} \textsf{Multi-way
Tables}.
-\item Scatterplot Matrix. used for displaying pairwise scatterplots simultaneously.
+\item Scatterplot matrix. used for displaying pairwise scatterplots simultaneously.
Again, look for linear association and correlation.
\item 3D Scatterplot. See Figure \pageref{fig:3D-scatterplot-trees}
\item \inputencoding{latin9}\lstinline[showstringspaces=false]!plot(state.region, state.division)!\inputencoding{utf8}
@@ -3179,8 +3255,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
@@ -3221,11 +3298,6 @@
summary statistics for each variable.
-\paragraph*{Answers:}
-
-<<"Find summary statistics">>=
-summary(RcmdrTestDrive)
-@
\end{xca}
\begin{xca}
@@ -3240,28 +3312,8 @@
\end{enumerate}
\end{xca}
-\paragraph*{Solution:}
-First we will make a table of the \emph{race} variable with the \inputencoding{latin9}\lstinline[showstringspaces=false]!table!\inputencoding{utf8}
-function.
-<<>>=
-table(race)
-@
-\begin{enumerate}
-\item For these data, \Sexpr{names(table(race))[which(table(race)==max(table(race)))]}
-has the highest frequency.
-\item For these data, \Sexpr{names(table(race))[which(table(race)==min(table(race)))]}
-has the lowest frequency.
-\item The graph is shown below.
-\end{enumerate}
-\begin{center}
-<<echo = FALSE, fig=true, height = 4, width = 6>>=
-barplot(table(RcmdrTestDrive$race), main="", xlab="race", ylab="Frequency", legend.text=FALSE, col=NULL)
-@
-\par\end{center}
-
-
\begin{xca}
Calculate the average \emph{salary} by the factor \emph{gender}. Do
this with \textsf{Statistics} \textsf{$\triangleright$ Summaries}
@@ -3287,78 +3339,9 @@
\end{enumerate}
\end{xca}
+\noindent
-\paragraph*{Solution:}
-We can generate a table listing the average salaries by gender with
-two methods. The first uses \inputencoding{latin9}\lstinline[showstringspaces=false]!tapply!\inputencoding{utf8}:
-
-<<keep.source = TRUE>>=
-x <- tapply(salary, list(gender = gender), mean)
-x
-@
-
-The second method uses the \inputencoding{latin9}\lstinline[showstringspaces=false]!by!\inputencoding{utf8}
-function:
-
-<<keep.source = TRUE>>=
-by(salary, gender, mean, na.rm = TRUE)
-@
-
-Now to answer the questions:
-\begin{enumerate}
-\item Which gender has the highest mean salary?
-
-
-We can answer this by looking above. For these data, the gender with
-the highest mean salary is \Sexpr{names(x)[which(x==max(x))]}.
-
-\item Report the highest mean salary.
-
-
-Depending on our answer above, we would do something like \inputencoding{latin9}
-\begin{lstlisting}[showstringspaces=false]
-mean(salary[gender == Male])
-\end{lstlisting}
-\inputencoding{utf8} for example. For these data, the highest mean salary is
-
-<<>>=
-x[which(x==max(x))]
-@
-
-\item Compare the spreads for the genders by calculating the standard deviation
-of \emph{salary} by \emph{gender}. Which gender has the biggest standard
-deviation?
-
-
-<<>>=
-y <- tapply(salary, list(gender = gender), sd)
-y
-@
-
-For these data, the the largest standard deviation is approximately
-\Sexpr{round(y[which(y==max(y))],2)} which was attained by the \Sexpr{names(y)[which(y==max(y))]}
-gender.
-
-\item Make boxplots of \emph{salary} by \emph{gender}. How does the boxplot
-compare to your answers to (1) and (3)?
-
-
-The graph is shown below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-boxplot(salary~gender, xlab="salary", ylab="gender", main="", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive)
-@
-\par\end{center}
-
-Answers will vary. There should be some remarks that the center of
-the box is farther to the right for the \Sexpr{names(x)[which(x==max(x))]}
-gender, and some recognition that the box is wider for the \Sexpr{names(y)[which(y==max(y))]}
-gender.\end{enumerate}
-
-
-
\begin{xca}
For this problem we will study the variable \emph{reduction}.
\begin{enumerate}
@@ -3382,46 +3365,8 @@
\end{enumerate}
\end{xca}
-\paragraph*{Answers:}
-<<echo = FALSE, results = hide>>=
-x = sort(reduction)
-@
-<<>>=
-x[137]
-IQR(x)
-fivenum(x)
-fivenum(x)[4] - fivenum(x)[2]
-@
-
-\noindent Compare your answers (3) and (5). Are they the same? If
-not, are they close?
-
-Yes, they are close, within \Sexpr{abs(IQR(x)-(fivenum(x)[4] - fivenum(x)[2]))}
-of each other.
-
-\noindent The boxplot of \emph{reduction} is below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4, width = 6>>=
-boxplot(reduction, xlab="reduction", main="", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive)
-@
-\par\end{center}
-
-<<>>=
-temp <- fivenum(x)
-inF <- 1.5 * (temp[4] - temp[2]) + temp[4]
-outF <- 3 * (temp[4] - temp[2]) + temp[4]
-which(x > inF)
-which(x > outF)
-@
-
-Observations \Sexpr{which(x > inF)} would be considered potential
-outliers, while observation(s) \Sexpr{which(x > outF)} would be considered
-a suspected outlier.
-
-
\begin{xca}
In this problem we will compare the variables \emph{before} and \emph{after}.
Don't forget \inputencoding{latin9}\lstinline[showstringspaces=false]!library(e1071)!\inputencoding{utf8}.
@@ -3444,146 +3389,6 @@
\end{enumerate}
\end{xca}
-\paragraph*{Solution:}
-\begin{enumerate}
-\item Examine the two measures of center for both variables that you found
-in problem 1. Judging from these measures, which variable has a higher
-center?
-
-
-We may take a look at the \inputencoding{latin9}\lstinline[showstringspaces=false]!summary(RcmdrTestDrive)!\inputencoding{utf8}
-output from Exercise \ref{xca:summary-RcmdrTestDrive}. Here we will
-repeat the relevant summary statistics.
-
-<<>>=
-c(mean(before), median(before))
-c(mean(after), median(after))
-@
-
-The idea is to look at the two measures and compare them to make a
-decision. In a nice world, both the mean and median of one variable
-will be larger than the other which sends a nice message. If We get
-a mixed message, then we should look for other information, such as
-extreme values in one of the variables, which is one of the reasons
-for the next part of the problem.
-
-\item Which measure of center is more appropriate for \emph{before}? (You
-may want to look at a boxplot.) Which measure of center is more appropriate
-for \emph{after}?
-
-
-The boxplot of \emph{before} is shown below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-boxplot(before, xlab="before", main="", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive)
-@
-\par\end{center}
-
-We want to watch out for extreme values (shown as circles separated
-from the box) or large departures from symmetry. If the distribution
-is fairly symmetric then the mean and median should be approximately
-the same. But if the distribution is highly skewed with extreme values
-then we should be skeptical of the sample mean, and fall back to the
-median which is resistant to extremes. By design, the before variable
-is set up to have a fairly symmetric distribution.
-
-A boxplot of \emph{after} is shown next.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-boxplot(after, xlab="after", notch=FALSE, varwidth=TRUE, horizontal=TRUE, data=RcmdrTestDrive)
-@
-\par\end{center}
-
-The same remarks apply to the \emph{after} variable. The \emph{after}
-variable has been designed to be left-skewed\ldots{} thus, the median
-would likely be a good choice for this variable.
-
-\item Based on your answer to (2), choose an appropriate measure of spread
-for each variable, calculate it, and report its value. Which variable
-has the biggest spread? (Note that you need to make sure that your
-measures are on the same scale.)
-
-
-Since \emph{before} has a symmetric, mound shaped distribution, an
-excellent measure of center would be the sample standard deviation.
-And since \emph{after} is left-skewed, we should use the median absolute
-deviation. It is also acceptable to use the IQR, but we should rescale
-it appropriately, namely, by dividing by 1.349. The exact values are
-shown below.
-
-<<>>=
-sd(before)
-mad(after)
-IQR(after)/1.349
-@
-
-Judging from the values above, we would decide which variable has
-the higher spread. Look at how close the \inputencoding{latin9}\lstinline[showstringspaces=false]!mad!\inputencoding{utf8}
-and the \inputencoding{latin9}\lstinline[showstringspaces=false]!IQR!\inputencoding{utf8}
-(after suitable rescaling) are; it goes to show why the rescaling
-is important.
-
-\item Calculate and report the skewness and kurtosis for \emph{before}.
-Based on these values, how would you describe the shape of \emph{before}?
-
-
-The values of these descriptive measures are shown below.
-
-<<>>=
-library(e1071)
-skewness(before)
-kurtosis(before)
-@
-
-We should take the sample skewness value and compare it to $2\sqrt{6/n}\approx$\Sexpr{round(2*sqrt(6/length(before)),3)}
-in absolute value to see if it is substantially different from zero.
-The direction of skewness is decided by the sign (positive or negative)
-of the skewness value.
-
-We should take the sample kurtosis value and compare it to $2\cdot\sqrt{24/168}\approx$\Sexpr{round(4*sqrt(6/length(before)),3)}),
-in absolute value to see if the excess kurtosis is substantially different
-from zero. And take a look at the sign to see whether the distribution
-is platykurtic or leptokurtic.
-
-\item Calculate and report the skewness and kurtosis for \emph{after}. Based
-on these values, how would you describe the shape of \emph{after}?
-
-
-The values of these descriptive measures are shown below.
-
-<<>>=
-skewness(after)
-kurtosis(after)
-@
-
-We should do for this one just like we did previously. We would again
-compare the sample skewness and kurtosis values (in absolute value)
-to \Sexpr{round(2*sqrt(6/length(after)),3)} and \Sexpr{round(4*sqrt(6/length(after)),3)},
-respectively.
-
-\item Plot histograms of \emph{before} and \emph{after} and compare them
-to your answers to (4) and (5).
-
-
-The graphs are shown below.
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-hist(before, xlab="before", data=RcmdrTestDrive)
-@
-\par\end{center}
-
-\begin{center}
-<<echo = FALSE, fig=true, height = 4.5, width = 6>>=
-hist(after, xlab="after", data=RcmdrTestDrive)
-@
-\par\end{center}
-
-Answers will vary. We are looking for visual consistency in the histograms
-to our statements above.\end{enumerate}
-
\begin{xca}
Describe the following data sets just as if you were communicating
with an alien, but one who has had a statistics class. Mention the
@@ -5374,7 +5179,7 @@
\begin{example}
We saw the \inputencoding{latin9}\lstinline[showstringspaces=false]!RcmdrTestDrive!\inputencoding{utf8}
-data set in Chapter \ref{cha:An-Introduction-to-R} in which a two-way
+data set in Chapter \ref{cha:introduction-to-R} in which a two-way
table of the smoking status versus the gender was
<<echo = FALSE>>=
@@ -5979,8 +5784,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
<<echo = FALSE, results = hide>>=
@@ -5993,22 +5799,12 @@
(\emph{Hint}: think about Pascal's triangle.)
\end{xca}
-\paragraph*{Answer:}
-The events must satisfy the product equalities two at a time, of which
-there are ${n \choose 2}$, then they must satisfy an additional ${n \choose 3}$
-conditions three at a time, and so on, until they satisfy the ${n \choose n}=1$
-condition including all $n$ events. In total, there are \[
-{n \choose 2}+{n \choose 3}+\cdots+{n \choose n}=\sum_{k=0}^{n}{n \choose k}-\left[{n \choose 0}+{n \choose 1}\right]\]
-conditions to be satisfied, but the binomial series in the expression
-on the right is the sum of the entries of the $n$$^{\text{th}}$
-row of Pascal's triangle, which is $2^{n}$.
-
\chapter{Discrete Distributions\label{cha:Discrete-Distributions}}
In this chapter we introduce discrete random variables, those who
@@ -6020,7 +5816,7 @@
generating functions.
We give special attention to the empirical distribution since it plays
-such a fundamental role with respect to re sampling and Chapter \ref{cha:Resampling-Methods};
+such a fundamental role with respect to re sampling and Chapter \ref{cha:resampling-methods};
it will also be needed in Section \ref{sub:Kolmogorov-Smirnov-Goodness-of-Fit-Test}
where we discuss the Kolmogorov-Smirnov test. Following this is a
section in which we introduce a catalogue of discrete random variables
@@ -6776,7 +6572,7 @@
usually used by itself in this form, by itself. More commonly it is
used as an intermediate step in a more complicated calculation, for
instance, in hypothesis testing (see Chapter \ref{cha:Hypothesis-Testing})
-or resampling (see Chapter \ref{cha:Resampling-Methods}). It is nevertheless
+or resampling (see Chapter \ref{cha:resampling-methods}). It is nevertheless
instructive to see what the \inputencoding{latin9}\lstinline[showstringspaces=false]!ecdf!\inputencoding{utf8}
looks like, and there is a special plot method for \inputencoding{latin9}\lstinline[showstringspaces=false]!ecdf!\inputencoding{utf8}
objects.
@@ -6829,7 +6625,7 @@
As we hinted above, the empirical distribution is significant more
because of how and where it appears in more sophisticated applications.
We will explore some of these in later chapters -- see, for instance,
-Chapter \ref{cha:Resampling-Methods}.
+Chapter \ref{cha:resampling-methods}.
\section{Other Discrete Distributions\label{sec:other-discrete-distributions}}
@@ -7356,15 +7152,15 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
-\begin{enumerate}
-\item A recent national study showed that approximately 44.7\% of college
+\begin{xca}
+A recent national study showed that approximately 44.7\% of college
students have used Wikipedia as a source in at least one of their
term papers. Let $X$ equal the number of students in a random sample
of size $n=31$ who have used Wikipedia as a source.
-
\begin{enumerate}
\item How is $X$ distributed? \[
X\sim\mathsf{binom}(\mathtt{size}=31,\,\mathtt{prob}=0.447)\]
@@ -7468,7 +7264,7 @@
@
\end{enumerate}
-\end{enumerate}
+\end{xca}
<<echo = FALSE, results = hide>>=
rnorm(1)
@
@@ -8570,8 +8366,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
@@ -9447,7 +9244,7 @@
\end{prop}
There are a few things to note about Proposition \ref{pro:mvnorm-cond-dist}
-which will be important in Chapter \ref{cha:Simple-Linear-Regression}.
+which will be important in Chapter \ref{cha:simple-linear-regression}.
First, the conditional mean of $Y|x$ is linear in $x$, with slope\begin{equation}
\rho\,\frac{\sigma_{Y}}{\sigma_{X}}.\label{eq:population-slope-slr}\end{equation}
Second, the conditional variance of $Y|x$ is independent of $x$.
@@ -9727,7 +9524,7 @@
f_{\mathbf{X}}(\mathbf{x})=\frac{1}{(2\pi)^{n/2}\left|\Sigma\right|^{1/2}}\exp\left\{ -\frac{1}{2}\left(\mathbf{x}-\upmu\right)^{\top}\Sigma^{-1}\left(\mathbf{x}-\upmu\right)\right\} ,\end{equation}
and the MGF is\begin{equation}
M_{\mathbf{X}}(\mathbf{t})=\exp\left\{ \upmu^{\top}\mathbf{t}+\frac{1}{2}\mathbf{t}^{\top}\Sigma\mathbf{t}\right\} .\end{equation}
-We will need the following in Chapter \ref{cha:Multiple-Linear-Regression}.
+We will need the following in Chapter \ref{cha:multiple-linear-regression}.
\begin{thm}
\label{thm:mvnorm-dist-matrix-prod}If $\mathbf{X}\sim\mathsf{mvnorm}(\mathtt{mean}=\upmu,\,\mathtt{sigma}=\Sigma)$
and $\mathbf{A}$ is any matrix, then the random vector $\mathbf{Y}=\mathbf{AX}$
@@ -9878,8 +9675,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
@@ -10480,8 +10278,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
<<echo = FALSE, results = hide>>=
@@ -11609,8 +11408,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
\begin{xca}
Let $X_{1}$, $X_{2}$, \ldots{}, $X_{n}$ be an $SRS(n)$ from a
@@ -12205,7 +12005,7 @@
\item The equal variance assumption can be relaxed as long as both sample
sizes $n$ and $m$ are large. However, if one (or both) samples is
small, then the test does not perform well; we should instead use
-the methods of Chapter \ref{cha:Resampling-Methods}.
+the methods of Chapter \ref{cha:resampling-methods}.
\end{itemize}
\end{rem}
For a nonparametric alternative to the two-sample $F$ test see Chapter
@@ -12403,12 +12203,13 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
-\chapter{\label{cha:Simple-Linear-Regression}Simple Linear Regression}
+\chapter{Simple Linear Regression\label{cha:simple-linear-regression}}
\paragraph*{What do I want them to know?}
@@ -12901,7 +12702,7 @@
$b_{1}$ and $b_{0}$.
To that end, we can see from Equation \ref{eq:regline-slope-formula}
-(and it is made clear in Chapter \ref{cha:Multiple-Linear-Regression})
+(and it is made clear in Chapter \ref{cha:multiple-linear-regression})
that $b_{1}$ is just a linear combination of normally distributed
random variables, so $b_{1}$ is normally distributed too. Further,
it can be shown that\begin{equation}
@@ -12925,7 +12726,7 @@
It is also sometimes of interest to construct a confidence interval
for $\beta_{0}$ in which case we will need the sampling distribution
-of $b_{0}$. It is shown in Chapter \ref{cha:Multiple-Linear-Regression}
+of $b_{0}$. It is shown in Chapter \ref{cha:multiple-linear-regression}
that\begin{equation}
b_{0}\sim\mathsf{norm}\left(\mathtt{mean}=\beta_{0},\,\mathtt{sd}=\sigma_{b_{0}}\right),\end{equation}
where $\sigma_{b_{0}}$ is given by\begin{equation}
@@ -13280,7 +13081,7 @@
\inputencoding{latin9}\lstinline[showstringspaces=false]!summary(cars.lm)!\inputencoding{utf8}
output where it was called {}``\inputencoding{latin9}\lstinline[breaklines=true,showstringspaces=false]!Multiple R-squared!\inputencoding{utf8}''.
Listed right beside it is the \inputencoding{latin9}\lstinline[breaklines=true,showstringspaces=false]!Adjusted R-squared!\inputencoding{utf8}
-which we will discuss in Chapter \ref{cha:Multiple-Linear-Regression}.
+which we will discuss in Chapter \ref{cha:multiple-linear-regression}.
For the \inputencoding{latin9}\lstinline[showstringspaces=false]!cars!\inputencoding{utf8}
data, we find $r$ to be
@@ -13317,7 +13118,7 @@
$t$ statistic and be done with it? The answer is that the $F$ statistic
has a more complicated interpretation and plays a more important role
in the multiple linear regression model which we will study in Chapter
-\ref{cha:Multiple-Linear-Regression}. See Section \ref{sub:mlr-Overall-F-Test}
+\ref{cha:multiple-linear-regression}. See Section \ref{sub:mlr-Overall-F-Test}
for details.
@@ -13959,8 +13760,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
\begin{xca}
Prove the ANOVA equality, Equation \ref{eq:anovaeq}. \emph{Hint}:
@@ -13986,7 +13788,7 @@
\end{xca}
-\chapter{\label{cha:Multiple-Linear-Regression}Multiple Linear Regression}
+\chapter{Multiple Linear Regression\label{cha:multiple-linear-regression}}
We know a lot about simple linear regression models, and a next step
is to study multiple regression models that have more than one independent
@@ -14029,7 +13831,7 @@
1 & x_{1n} & x_{2n} & \cdots & x_{pn}\end{bmatrix}.\end{equation}
The vector $\mathbf{Y}$ is called the \emph{response vector\index{response vector}}
and the matrix $\mathbf{X}$ is called the \emph{model matrix}\index{model matrix}.
-As in Chapter \ref{cha:Simple-Linear-Regression}, the most general
+As in Chapter \ref{cha:simple-linear-regression}, the most general
assumption that relates $\mathbf{Y}$ to $\mathbf{X}$ is\begin{equation}
\mathbf{Y}=\mu(\mathbf{X})+\upepsilon,\end{equation}
where $\mu$ is some function (the \emph{signal}) and $\upepsilon$
@@ -15360,7 +15162,7 @@
percentile are extreme.
\end{description}
Note that plugging the value $p=1$ into the formulas will recover
-all of the ones we saw in Chapter \ref{cha:Simple-Linear-Regression}.
+all of the ones we saw in Chapter \ref{cha:simple-linear-regression}.
\section{Additional Topics\label{sec:Additional-Topics-MLR}}
@@ -15503,7 +15305,7 @@
\item What to do when data are not normal
\begin{itemize}
-\item Bootstrap (see Chapter \ref{cha:Resampling-Methods}).
+\item Bootstrap (see Chapter \ref{cha:resampling-methods}).
\end{itemize}
\end{itemize}
@@ -15516,8 +15318,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
\begin{xca}
\label{xca:anova-equality}Use Equations \ref{eq:mlr-sse-matrix},
@@ -15527,7 +15330,7 @@
\end{xca}
-\chapter{Resampling Methods\label{cha:Resampling-Methods}}
+\chapter{Resampling Methods\label{cha:resampling-methods}}
Computers have changed the face of statistics. Their quick computational
speed and flawless accuracy, coupled with large data sets acquired
@@ -16139,8 +15942,9 @@
\newpage{}
-\section{Chapter Exercises}
+\section*{Chapter Exercises}
+\addcontentsline{toc}{section}{Chapter Exercises}
\setcounter{thm}{0}
@@ -16187,8 +15991,517 @@
\appendix
-\chapter{Data\label{cha:Data}}
+\chapter{\textsf{R} Session Information\label{cha:R-Session-Information}}
+If you ever write the \textsf{R} help mailing list with a question,
+then you should include your session information in the email; it
+makes the reader's job easier and is requested by the Posting Guide.
+Here is how to do that, and below is what the output looks like.
+
+<<keep.source = TRUE>>=
+sessionInfo()
+@
+
+\vfill{}
+
+
+
+\chapter{GNU Free Documentation License\label{cha:GNU-Free-Documentation}}
+
+\begin{center}
+\textbf{\large Version 1.3, 3 November 2008}\bigskip{}
+
+\par\end{center}
+
+\noindent Copyright (C) 2000, 2001, 2002, 2007, 2008 Free Software
+Foundation, Inc.
+
+\begin{center}
+\url{http://fsf.org/}
+\par\end{center}
+
+\noindent Everyone is permitted to copy and distribute verbatim copies
+of this license document, but changing it is not allowed.
+
+
+\section*{0. PREAMBLE}
+
+The purpose of this License is to make a manual, textbook, or other
+functional and useful document \textquotedbl{}free\textquotedbl{}
+in the sense of freedom: to assure everyone the effective freedom
+to copy and redistribute it, with or without modifying it, either
+commercially or noncommercially. Secondarily, this License preserves
+for the author and publisher a way to get credit for their work, while
+not being considered responsible for modifications made by others.
+
+This License is a kind of \textquotedbl{}copyleft\textquotedbl{},
+which means that derivative works of the document must themselves
+be free in the same sense. It complements the GNU General Public License,
+which is a copyleft license designed for free software.
+
+We have designed this License in order to use it for manuals for free
+software, because free software needs free documentation: a free program
+should come with manuals providing the same freedoms that the software
+does. But this License is not limited to software manuals; it can
+be used for any textual work, regardless of subject matter or whether
+it is published as a printed book. We recommend this License principally
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/ipsur -r 170
More information about the IPSUR-commits
mailing list