[Genabel-commits] r1152 - tutorials/GenABEL_general
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Sat Mar 16 00:02:44 CET 2013
Author: lckarssen
Date: 2013-03-16 00:02:40 +0100 (Sat, 16 Mar 2013)
New Revision: 1152
Modified:
tutorials/GenABEL_general/introR.Rnw
Log:
Tutorial: Again a few small spelling fixes and some minor updates.
Modified: tutorials/GenABEL_general/introR.Rnw
===================================================================
--- tutorials/GenABEL_general/introR.Rnw 2013-03-15 18:19:59 UTC (rev 1151)
+++ tutorials/GenABEL_general/introR.Rnw 2013-03-15 23:02:40 UTC (rev 1152)
@@ -648,21 +648,21 @@
rm(list=ls())
load("RData/assocbase.RData")
@
-A \emph{data frame} is a class of R data, which, basically,
+A \emph{data frame} is a class of R data which, basically,
is a data table. In such tables,
it is usually assumed that rows correspond to subjects (observations)
and columns correspond to variables (characteristics) measured on
these subjects. A nice feature of data frames is
that columns (variables) have names, and the data can be addressed by
referencing to these names\footnote{This
-may also be true for matrices; more fundamental
-difference is though that a matrix \emph{always} contains variables
-of the same data type, \eg character or numeric, while a data frame
+may also be true for matrices; however, a more fundamental
+difference is that a matrix \emph{always} contains variables
+of the same data type, \eg character or numeric, whereas a data frame
may contain variables of different types}.
\index{data frame}
-We will explore R data frames using example data set \texttt{assoc}.
-Start R with double-click on the file named \texttt{assocbase.RData}.
+We will explore R data frames using the example data set \texttt{assoc}.
+Start R with a double-click on the file named \texttt{assocbase.RData}.
You can see the names of the loaded objects by using the ''list'' command:
<<>>=
ls()
@@ -686,8 +686,8 @@
who are characterised by \Sexpr{dim(assoc)[2]} variables each.
-Let us now figure out what are the names of the \Sexpr{dim(assoc)[2]}
-variables present in the data frame. To see what are the variable names,
+Let us now figure out what the names are of the \Sexpr{dim(assoc)[2]}
+variables present in the data frame. To see what the variable names are,
use the command \texttt{names()}:
<<>>=
names(assoc)
@@ -698,22 +698,22 @@
to the personal identifier (ID, variable \texttt{subj}), sex, affection status,
quantitative trait \texttt{qt} and several SNPs.
Each variable can have its own type (numeric, character,
-logic), but all variables must have the same length -- thus forming
+logical), but all variables must have the same length -- thus forming
a matrix-like data structure.
A variable from a data frame
(say, \texttt{fram}), which has some name (say, \texttt{nam}) can be
accessed through \texttt{fram\$nam}. This will return a conventional
vector, containing the values of the variable.
-For example to see the affection status (\texttt{aff}) in the
+For example, to see the affection status (\texttt{aff}) in the
data frame \texttt{assoc}, use
<<>>=
assoc$aff
@
-The \texttt{aff} (affected) variable here codes for a case/control status,
-conventinally, the cases are coded as ''1'' and controls as ''0''.
-You can also see several ''NA''s, which stays for missing observation.
+The \texttt{aff} (affected) variable here codes for a case/control status.
+Conventinally, cases are coded as \texttt{1} and controls as \texttt{0}.
+You can also see several ''NA''s, which denotes a missing observation.
%In a case we need to fWe can easily check how many people are described in the data set
@@ -724,25 +724,29 @@
%@
\begin{Exercise}[title=Exploring \texttt{assoc}]
-\Question Investigate types of the variables presented in data frame \texttt{assoc}.
+\Question Investigate the types of the variables present in data frame \texttt{assoc}.
For each variable, write down the class.
\end{Exercise}
\begin{Answer}
-Here is an automatic script which explores the classes of variables in \texttt{assoc}:
+Here is an script which automatically explores the classes of variables in
+\texttt{assoc}:
<<>>=
- for (i in names(assoc)) {
- cat("Variable '",i,"' has class '",class(assoc[,i]),"'\n",sep="")
- }
+for (i in names(assoc)) {
+ cat("Variable '", i ,"' has class '", class(assoc[, i]), "'\n", sep="")
+}
@
-
+In this so-called for-loop the variable \texttt{i} cycles through all names
+in \texttt{assoc} and for each of them it uses the \texttt{cat} function to
+print the name of the variable and its class. The \texttt{\textbackslash n}
+is the code for a new line.
\end{Answer}
-Data frame may be thought of as a matrix which is a collection
+A data frame may be thought of as a matrix which is a collection
of (potentily different-type) vectors.
All sub-setting operations discussed before for matrices are
applicable to a data frame, while all operations dicussed
-for vectors are applicable to data frame's variables.
+for vectors are applicable to a data frame's variables.
Thus, as any particular variable present in a data frame is a conventional
vector, its elements can be accessed using the vector's indices.
@@ -766,50 +770,50 @@
@
\label{dat515}
-The result is actually a new data frame containing data only on people with index from 5 to 15:
+The result is actually a new data frame containing data only on people with index ranging from 5 to 15:
<<>>=
-x<-assoc[5:15,]
+x <- assoc[5:15,]
class(x)
dim(x)
@
As well as with matrices and vectors, it is possible to sub-set elements of a data frame
based on (a combination of) logical conditions. For example, if you are interested
-in people who have the \texttt{qt} values over 1.4, you can find out what are the indices
-of these people
+in people who have \texttt{qt} values over 1.4, you can find out what the indices of
+these people are:
<<>>=
vec <- which(assoc$qt>1.4)
vec
@
-and then show the compelte data with
+and then show the complete data with
<<>>=
assoc$subj[vec]
@
At the same time, if you only want to
-check what are the IDs of these people, try
+check what the IDs of these people are, try
<<>>=
assoc$subj[vec]
@
-Or, if we are interested to find what are the IDs and what are the SNP genotypes
+Or, if we are interested to find what the IDs and the SNP genotypes are
of these people, we can try
<<>>=
-assoc[vec,c(1,5,6,7)]
+assoc[vec, c(1, 5, 6, 7)]
@
here, we select people identified by \texttt{vec} in the first
-dimension (subjects), and by \texttt{c(1,5,6,7)} we select first,
+dimension (subjects), and by \texttt{c(1, 5, 6, 7)} we select the first,
fifth, sixth and seventh column (variable).
-The same result can be obtained using variables' names insted of
-the variables' indices. To remind you the variables' names:
+The same result can be obtained using variable names instead of
+the variables' indices. To remind you the variable names can be found with:
<<>>=
names(assoc)
@
-And now make a vector of the variables' names of interest and
+And now make a vector of the variable names of interest and
filter the data based on it:
<<>>=
-namstoshow <- c("subj","snp4","snp5","snp6")
-assoc[vec,namstoshow]
+namstoshow <- c("subj", "snp4", "snp5", "snp6")
+assoc[vec, namstoshow]
@
A more convenient way to access data presented in a data frame is
@@ -828,7 +832,7 @@
elements using the assignment (''\texttt{<-}'') operation,
you can also explore and modify the data contained in a data frame\footnote{and also
a matrix} by
-using \texttt{fix()} command (\eg try \texttt{fix(assoc)}).
+using the \texttt{fix()} command (\eg try \texttt{fix(assoc)}).
However, normally this is not necessary.
@@ -874,15 +878,15 @@
With attached data frames, a possible complication is that
later on you may have several
-data frames which contain the variables with the same names.
+data frames which contain variables with the same names.
The variable which will be used when you directly use the name
would be the one from the data frame attached last. You can use
-\texttt{detach()} function to remove a certain data frame from
+the \texttt{detach()} function to remove a certain data frame from
the search path, \eg after
<<>>=
detach(assoc)
@
-we can not use direct reference to the name (try \texttt{subj[75]})
+we cannot use a direct reference to the name (try \texttt{subj[75]})
anymore, but have to use the full path instead:
<<>>=
assoc$subj[75]
@@ -901,23 +905,23 @@
\begin{summary}
\item The list of available objects can be viewed with \texttt{ls()};
-a class of some object \texttt{obj} can be interrogated with
+the class of some object \texttt{obj} can be examined with
\texttt{class(obj)}.
\item Simple summary statistics for numeric variables can be
-generated by using \texttt{summary} function
-\item Histogram for some variable \texttt{var} can be generated
-by \texttt{hist(var)}
+generated by using the \texttt{summary} function
+\item A histogram for some variable \texttt{var} can be generated
+by \texttt{hist(var)}.
\item A variable with name \texttt{name} from a data frame
\texttt{frame}, can be
accessed through \texttt{frame\$name}.
-\item You can attach the data frame to the search path by
+\item You can attach a data frame to the search path by
\texttt{attach(frame)}. Then the variables contained in this
data frame may be accessed directly. To detach the data
-frame (because, \eg, you are now interested in other data
+frame (because, \eg, you are now interested in another data
frame), use \texttt{detach(frame)}.
\end{summary}
-\begin{Exercise}[title=Explore phenotypic part of \texttt{srdta}]%\label{ex3}
+\begin{Exercise}[title=Explore the phenotypic part of \texttt{srdta}]%\label{ex3}
Load the \texttt{srdta} data object supplied with
GenABEL by loading the package with
\texttt{library(GenABEL)} and then loading the
@@ -926,11 +930,11 @@
with phenotypes. This data frame may be accessed through
\texttt{phdata(srdta)}. Explore this data frame and
answer the questions
-\Question What is the value of the 4th variable for the subject
+\Question What is the value of the $4^\text{th}$ variable for subject
number 75?
-\Question What is the value of variable 1 for person 75? Check what is
-the value of this variable for the first ten people.
-Can you guess what first variable is?
+\Question What is the value of variable 1 for person 75? Check the value
+of this variable for the first ten people.
+Can you guess what the first variable is?
\Question What is the sum of variable 2? Can you guess what data variable 2
contains?
\end{Exercise}
@@ -938,28 +942,29 @@
Load the data and look at the few first rows of the
phenotypic data frame:
<<>>=
- data(srdta)
- phdata(srdta)[1:5,]
+data(srdta)
+phdata(srdta)[1:5, ]
@
-Value of the 4th variable of person 75:
+The value of the $4^\text{th}$ variable of person 75:
<<>>=
-phdata(srdta)[75,4]
+phdata(srdta)[75, 4]
@
-Value for the variable 1 is
+The value for variable 1 is
<<>>=
-phdata(srdta)[75,1]
+phdata(srdta)[75, 1]
@
-Also, if we check first 10 elements we see
+Also, if we check the first 10 elements we see
<<>>=
-phdata(srdta)[1:10,1]
+phdata(srdta)[1:10, 1]
@
-This is personal ID.
+This is the individual ID.
The sum for variable 2 is
<<>>=
-sum(phdata(srdta)[,2])
+sum(phdata(srdta)[, 2])
@
-This is sex variable -- so there are \Sexpr{sum(phdata(srdta)[,2])} males in the data set.
+This is the sex variable -- so there are \Sexpr{sum(phdata(srdta)[,2])}
+males in the data set.
\end{Answer}
@@ -975,11 +980,11 @@
Let us first check how many of the subjects are males. In the
\texttt{sex} variable, males are coded with ''1'' and females with ''0''.
-Therefore to see the numer of males, you can use
+Therefore to see the total number of males, you can use
<<>>=
sum(sex==1)
@
-and to determine what is male sex proportion you can use
+and to determine what is the proportion of males you can use
<<>>=
sum(sex==1)/length(sex)
@
@@ -996,7 +1001,7 @@
\eg with ''1'' for males and ''2'' for females.
Let us now try to find out the mean of the quantitative trait \texttt{qt}.
-By definition, the mean of a variable, say $x$ (with i-th element denoted
+By definition, the mean of a variable, say $x$ (with the $i$-th element denoted
as $x_i$) is
$$
\bar{x} = \frac{\Sigma_{i=1}^{N} x_i}{N}
@@ -1004,35 +1009,35 @@
where $N$ is the number of measurements.
If we try to find out the mean of \texttt{qt} by direct use of this formula, we first
-need to find out the sum of the \texttt{qt}'s elements. The \texttt{sum()}
+need to find out the sum of the elements of \texttt{qt}. The \texttt{sum()}
function of R precisely does the operation we need. However, if we try it
<<>>=
sum(qt)
@
this returns ''NA''. The problem is that the \texttt{qt} variable contains
''NA''s (try \texttt{qt} to see these)
-and by default the ''NA'' is returned. We can, however, instruct the \texttt{sum()}
+and then, by default, ''NA'' is returned. We can, however, instruct the \texttt{sum()}
function to remove ''NA''s from consideration:
<<>>=
-sum(qt,na.rm=T)
+sum(qt, na.rm=TRUE)
@
-where \texttt{na.rm=T} tells R that missing variables should be be removed
+where \texttt{na.rm=TRUE} tells R that missing variables should be be removed
(NonAvailable.ReMove=True)\footnote{The same argument works for
a number of R statistical functions such as \texttt{mean}, \texttt{median},
-\texttt{var}, etc}.
+\texttt{var}, etc.}.
We can now try to compute the mean with
<<>>=
-sum(qt,na.rm=T)/length(qt)
+sum(qt, na.rm=TRUE)/length(qt)
@
This result, however, is not correct. The \texttt{length()} function returns
the total length of a vector, which includes ''NA''s as well. Thus we need to
-compute the number of the \texttt{qt}'s elements, which are not missing.
+compute the number of elements in \texttt{qt} that are not missing.
For this, we can use R function \texttt{is.na()}. This function returns
-\texttt{TRUE} if supplied argument is missing (\texttt{NA}) and \texttt{FALSE}
-otherwise.
+\texttt{TRUE} if the supplied argument is missing (\texttt{NA}) and
+\texttt{FALSE} otherwise.
\index{is.na()}
Let us apply this function to the vector \texttt{assoc\$qt}:
<<>>=
@@ -1049,22 +1054,22 @@
Thus the number of elements which are not missing\footnote{A hidden
trick here is that arithmetic operations treat \texttt{TRUE} as one
-and \texttt{FALSE} as zero} is
+and \texttt{FALSE} as zero.} is
<<>>=
sum(!is.na(qt))
@
Finally, we can compute the mean of the \texttt{qt} with
<<>>=
-sum(qt,na.rm=T)/sum(!is.na(qt))
+sum(qt, na.rm=TRUE)/sum(!is.na(qt))
@
While this way of computing the mean is enlightening in the sense of
-how to treat the missing values, the same correct result should be normally
-achieved by supplying the \texttt{na.rm=T} argument to the \texttt{mean()}
+how missing values are treated, the same correct result should be normally
+achieved by supplying the \texttt{na.rm=TRUE} argument to the \texttt{mean()}
function:
<<>>=
-mean(qt,na.rm=T)
+mean(qt, na.rm=TRUE)
@
@@ -1075,7 +1080,7 @@
@
which, again, tells us that there are \Sexpr{table(assoc$sex)[2]} males and
\Sexpr{table(assoc$sex)[1]} females in this data set. This function excludes
-missing observations form consideration.
+missing observations from consideration.
Tables of other qualitative variables, such as affection and SNPs, can
be generated in the same manner.
More information about the Genabel-commits
mailing list