[Genabel-commits] r1153 - tutorials/GenABEL_general

Sat Mar 16 13:33:49 CET 2013

Author: lckarssen
Date: 2013-03-16 13:33:49 +0100 (Sat, 16 Mar 2013)
New Revision: 1153

Modified:
   tutorials/GenABEL_general/introR.Rnw
Log:
GenABEL tutorial: More small fixes. Finished the introR.Rnw chapter.

Modified: tutorials/GenABEL_general/introR.Rnw
===================================================================

--- tutorials/GenABEL_general/introR.Rnw	2013-03-15 23:02:40 UTC (rev 1152)
+++ tutorials/GenABEL_general/introR.Rnw	2013-03-16 12:33:49 UTC (rev 1153)
@@ -613,7 +613,7 @@
 Among the 4$^\text{th}$ hundred subjects there are 
 \Sexpr{sum(male(srdta)[301:400]==0)} females:
 <<>>=
-100-sum(male(srdta)[301:400])
+100 - sum(male(srdta)[301:400])
 @
 The male proportion among the first 1000 people is 
 <<>>=
@@ -1080,9 +1080,9 @@
 @
 which, again, tells us that there are \Sexpr{table(assoc$sex)[2]} males and 
 \Sexpr{table(assoc$sex)[1]} females in this data set. This function excludes
-missing observations from consideration.
+missing observations.
 
-Tables of other qualitative variables, such as affection and SNPs, can 
+Tables of other qualitative variables, such as affection status and SNPs, can 
 be generated in the same manner.
 
 As with arithmetic operations and mathematical functions, most of the 
@@ -1096,7 +1096,7 @@
 @
 
 \begin{tip}
-On \texttt{R} command line pressing the "up-arrow" button 
+On the R command line pressing the ``up-arrow'' button 
 makes the last typed command re-appear (pressing it one more time will 
 bring you to the one before the last, so on). This is very handy when 
 you have to repeat the same analysis of different variables
@@ -1105,7 +1105,7 @@
 
 \begin{Exercise}[title=Explore \texttt{assoc}]
 %\label{exas1}
-Explore phenotypic variables presented in \texttt{assoc}
+Explore the phenotypic variables present in \texttt{assoc}
 \Question How many affected and unaffected are present in the data set?
 \Question What is the proportion of affected?
 \Question What is the distribution of snp4 (how many different genotype 
@@ -1114,17 +1114,17 @@
 \begin{Answer}
 The number of affected (coded with '1') and unaffected ('0') is
 <<>>=
-	table(aff)
+table(aff)
 @
 The proportion of unaffected and affected is
 <<>>=
-	prop.table(table(aff))
+prop.table(table(aff))
 @
 Distribution of the 'snp4' is
 <<>>=
-	t <- table(snp4)
-	t
-	prop.table(t)
+t <- table(snp4)
+t
+prop.table(t)
 @
 \end{Answer}
 
@@ -1134,7 +1134,7 @@
 explore frequency distributions. For example, if you want cross-tabulate 
 sex and affection status in the data frame \texttt{assoc}, you can use
 <<>>=
-table(sex,aff)
+table(sex, aff)
 @
 Here, the first variable (sex) is presented in rows and the second (affection 
 status) in columns.
@@ -1142,7 +1142,7 @@
 As is usually the case with R, the output may be saved as a new object
 (of class 'table', which is a variety of a matrix):
 <<>>=
-a<-table(sex,aff)
+a <- table(sex, aff)
 class(a)
 a
 @
@@ -1150,7 +1150,7 @@
 
 For example, we can easily get the number of affected male with 
 <<>>=
-a[2,2]
+a[2, 2]
 @
 
 Alternatively, we can analyse the resulting contingency table \texttt{a} 
@@ -1161,48 +1161,53 @@
 @
 Needless to say, this is equivalent to 
 <<>>=
-prop.table(table(assoc$sex,assoc$aff))
+prop.table(table(assoc$sex, assoc$aff))
 @
 
 In the above table, we see what proportion of people belong to four 
 different classes (affected male, affected female, unaffected male 
-and unaffected female). We may be though interested in the proportion 
+and unaffected female). We may also be interested in the proportion 
 of males in affected and unaffected. This may be achieved by 
 <<>>=
-prop.table(a,2)
+prop.table(a, 2)
 @
-saying us that \Sexpr{round(100*prop.table(a,2)[2,2],dig=1)}\% of affected are male.
+telling us that \Sexpr{round(100*prop.table(a, 2)[2, 2], dig=1)}\% of 
+affected individuals are male.
 
-Alternatively, we may be interested in proportion of affected among males/females.
-To answer this question, run
+Alternatively, we may be interested in the proportion of affected among 
+males/females. To answer this question, run
 <<>>=
-prop.table(a,1)
+prop.table(a, 1)
 @
-saying us that \Sexpr{round(100*prop.table(a,2)[2,2],dig=1)}\% of male are affected.
+telling us that \Sexpr{round(100*prop.table(a,2)[2,2],dig=1)}\% of male 
+are affected.
 
-Other useful contingency table analysis function is \texttt{fisher.test}, 
+Another useful contingency table analysis function is \texttt{fisher.test}, 
 which implements the Fisher Exact Test of independence: 
 <<>>=
 fisher.test(a)
 @
 
-Exploration of genetic data within base R, though possible, may be a bit of a pain. 
-For example, we can easily generate contingency table of SNP5 vs affected status:
+Exploration of genetic data within base R, though possible, may be a bit of 
+a pain. For example, we can easily generate contingency table of SNP5 
+vs.~affected status:
 <<>>=
-a <- table(aff,snp5)
+a <- table(aff, snp5)
 a
 @
-We can also look up what is the proportion of affected among different genotypic groups
+We can also look up the proportion of affected among different genotypic groups
 <<>>=
-prop.table(a,2)
+prop.table(a, 2)
 @
-showing that proportion of cases is similar in 'A/A' and 'A/B' genotypic groups 
-and somewhat decreased in 'B/B'. 
-It is easy to test if this affection is statistically independent of genotype by
+showing that proportion of cases is similar in the 'A/A' and 'A/B' genotypic 
+groups and somewhat decreased in 'B/B'. 
+It is easy to test if this affection is statistically independent of genotype 
+by doing a $\chi^2$ test
 <<>>=
 chisq.test(a)
 @
-which gives (insignificant) genotypic association test on two degrees of freedom. 
+which gives a (insignificant) genotypic association test with two degrees of 
+freedom. 
 
 However, testing Hardy-Weinberg equilibrium, testing allelic effects, and even computation 
 of allelic frequency is not so straightforward. Such specific genetic tests are implemented 
@@ -1218,148 +1223,150 @@
 At this moment we will switch to exploratory analysis 
 of quantitative traits. 
 We will make use of the \texttt{srdta} 
-data supplied with \GA{}. As you can remember from an exercise,
-the library is loaded with \texttt{library(GenABEL)} and the 
-data are loaded with \texttt{data(srdta)}:
+data supplied with the \GA{}. As you can remember from an earlier exercise,
+that library is loaded with \texttt{library(GenABEL)} and the 
+data are loaded with \texttt{data(srdta)}.
 <<echo=false,hide=true>>=
 library(GenABEL)
 data(srdta)
 @
-Then the 
-phenotypic data frame may be accessed through \texttt{phdata(srdta)}.
+Then the phenotypic data frame may be accessed through \texttt{phdata(srdta)}.
 
 \begin{Exercise}[title=Explore phenotypes in \texttt{srdta}]
-Explore phenotypic data content of \texttt{srdta} object 
+Explore the phenotypic data content of \texttt{srdta} object 
 (\texttt{phdata(srdta)}). 
-\Question How many observations and variables are presented in the data 
+\Question How many observations and variables are present in the data 
 frame?
 \Question What are the classes of these variables?
 \end{Exercise}
 \begin{Answer}
 Number of people:
 <<>>=
-	nids(srdta)
+nids(srdta)
 @
 Number of variables:
 <<>>=
-	length(names(phdata(srdta)))
+length( names( phdata(srdta) ) )
 @
 The same -- dimensions of phenotypic data frame:
 <<>>=
-	dim(phdata(srdta))
+dim(phdata(srdta))
 @
-Class of variables in phenotypic data frame:
+The class of variables in phenotypic data frame:
 <<>>=
-	for (i in names(phdata(srdta))) {
-		cat("class of variable '",i,"' is '",class(phdata(srdta)[,i]),"'\n",sep="")
-		}
+for (i in names(phdata(srdta))) {
+  cat("The class of variable '", i ,"' is '", 
+      class(phdata(srdta)[, i]), "'\n", sep="")
+}
 @
 \end{Answer}
 
-As it was mentioned before, the function \texttt{summary()} generates a summary 
+As was mentioned before, the function \texttt{summary()} generates a summary 
 statistics for an object. For example, 
-to see summary for trait \texttt{qt1}, we can use
+to see the summary for trait \texttt{qt1}, we can use
 <<>>=
-summary(phdata(srdta)$qt1)
+summary( phdata(srdta)$qt1 )
 @
 
 \begin{tip}
-\texttt{summary} is quite useful function which may operate 
+\texttt{summary} is quite a useful function which may operate 
 in different ways for objects of different classes. 
 Try \texttt{summary(phdata(srdta))}.
 \end{tip}
 
 With R, it is also easy to explore the data graphically. 
-For example, the histogram for \texttt{qt1} may be generated by
+For example, a histogram for \texttt{qt1} may be generated by
 <<>>=
-hist(phdata(srdta)$qt1)
+hist( phdata(srdta)$qt1 )
 @
-(resulting histogram is shown at figure \ref{fig:histqt1})
+The resulting histogram is shown in Figure~\ref{fig:histqt1}.
 \begin{figure}
 \centering
 <<fig=true,echo=false>>=
 hist(phdata(srdta)$qt1)
 @
-\caption{Histogram of qt1}
+\caption{Histogram of \texttt{qt1}.}
 \label{fig:histqt1}
 \end{figure}
 
-In similar manner, scatter-plots may be generated. To see relation 
+In a similar manner scatter-plots may be generated. To see the relation 
 between \texttt{qt1} and \texttt{qt3}, you can run
 <<>>=
-plot(phdata(srdta)$qt1,phdata(srdta)$qt3)
+plot( phdata(srdta)$qt1, phdata(srdta)$qt3 )
 @
-(resulting plot is shown at figure \ref{fig:qt1qt3})
+The resulting plot is shown in Figure~\ref{fig:qt1qt3}.
 \begin{figure}
 \centering
 <<fig=true,echo=false>>=
-plot(phdata(srdta)$qt1,phdata(srdta)$qt3)
+plot( phdata(srdta)$qt1, phdata(srdta)$qt3 )
 @
-\caption{Scatter-plot of qt1 against qt3}
+\caption{Scatter-plot of \texttt{qt1} against \texttt{qt3}.}
 \label{fig:qt1qt3}
 \end{figure}
 
-The mean, median, minimum and maximum of the distribution of the 
-trait may be found out using functions \texttt{mean}, \texttt{median}, 
+The mean, median, minimum and maximum of the distribution of a
+trait may be found using the functions \texttt{mean}, \texttt{median}, 
 \texttt{min} and \texttt{max}, respectively. The variance and standard 
 deviation can be computed with \texttt{var} and \texttt{sd}.
 
-To compute correlation between two variables (or all variables in a 
+To compute the correlation between two variables (or all variables in a 
 matrix/data frame), use \texttt{cor}. 
 
 %\section{Function \texttt{check.trait}}
 In \GA{}, there is a special function designed to facilitate 
 phenotypic quality control. 
-This function takes names of variables and a data frame as an input, 
+This function takes the names of variables and a data frame as an input, 
 and returns summary statistics, list of outliers (using False Discovery 
 Rate) and graphs. 
 
-For example, to do QC of sex, age and qt3, try
+For example, to do QC of \texttt{sex}, \texttt{age} and \texttt{qt3}, try
 <<>>=
-check.trait(c("sex","age","qt3"),phdata(srdta))
+check.trait( c("sex", "age", "qt3"), phdata(srdta) )
 @
-The corresponding graph is depicted at figure \ref{qcgra}.
+The corresponding graph is shown in figure \ref{qcgra}.
 \begin{figure}[t]
 \centering
 <<fig=true,echo=false>>=
-par(mfcol=c(2,2))
-plot(phdata(srdta)$age,phdata(srdta)$age)
-plot(phdata(srdta)$age,phdata(srdta)$qt3)
-plot(phdata(srdta)$qt3,phdata(srdta)$age)
-plot(phdata(srdta)$qt3,phdata(srdta)$qt3)
+par(mfcol=c(2, 2))
+plot(phdata(srdta)$age, phdata(srdta)$age)
+plot(phdata(srdta)$age, phdata(srdta)$qt3)
+plot(phdata(srdta)$qt3, phdata(srdta)$age)
+plot(phdata(srdta)$qt3, phdata(srdta)$qt3)
 @
-\caption{Quality control graph for \texttt{sex, age, qt3}}
+\caption{Quality control graphs for \texttt{sex, age, qt3}}
 \label{fig:qcgra}
 \end{figure}
 
 \begin{tip}
 Before you start with the exercise: 
 if a function returns unexpected results, and you are confident that 
-syntax was right, checking help page is always a good idea!
+syntax was right, checking the help page is always a good idea!
 \end{tip}
 
-\begin{Exercise}[title=Exploring phenotypic part of \texttt{srdta}]%\label{ex5}
-Explore \texttt{phdata} part of \texttt{srdta} object
-\Question How many people has age over 65 years?
-\Question What is the sex distribution (proportion of males) in the people over 65 years old?
-\Question What is the mean, median, minimum and maximum age in the sample?
-\Question Compare the distribution of qt3 in people younger and older 
-than 65 years. Use function \texttt{sd(A)} to get standard deviation 
-of A
-\Question Produce distributions of different traits. Do you see something special?
-\Question What is correlation between qt3 and age?
+\begin{Exercise}[title=Exploring the phenotypic part of \texttt{srdta}]
+%\label{ex5}
+Explore the \texttt{phdata} part of the \texttt{srdta} object
+\Question How many people have age over 65 years?
+\Question What is the sex distribution (proportion of males) in the people 
+over 65 years old?
+\Question What are the mean, median, minimum and maximum ages in the sample?
+\Question Compare the distribution of \texttt{qt3} in people younger and older 
+than 65 years. Use the function \texttt{sd(A)} to get standard deviation 
+of A.
+\Question Produce distributions of different traits. Do you see something special? 
+\Question What is the correlation between \texttt{qt3} and \texttt{age}?
 \end{Exercise}
 \begin{Answer}
-To obtain the number of people with age >65 y.o., you can use 
+To obtain the number of people with age $> 65$ y.o., you can use 
 any of the following
 <<>>=
-sum(phdata(srdta)$age>65)
-vec <- which(phdata(srdta)$age>65)
+sum( phdata(srdta)$age > 65 )
+vec <- which( phdata(srdta)$age > 65 )
 length(vec)
 @
 To get sex of these people use any of:
 <<>>=
-sx65 <- phdata(srdta)$sex[phdata(srdta)$age>65]
+sx65 <- phdata(srdta)$sex[ phdata(srdta)$age > 65 ]
 sx65
 sx65 <- phdata(srdta)$sex[vec]
 sx65
@@ -1368,23 +1375,24 @@
 <<>>=
 sum(sx65)
 @
-To conclude, the proportion of male is 
+To conclude, the proportion of males is 
 \Sexpr{sum(phdata(srdta)$sex[which(phdata(srdta)$age>65)])/length(which(phdata(srdta)$age>65))}.
 
-Distribution of \texttt{qt3} in people younger and older than 65 are:
+The distributions of \texttt{qt3} in people younger and older than 65 are:
 <<>>=
 summary(phdata(srdta)$qt3[vec])
-sd(phdata(srdta)$qt3[vec],na.rm=TRUE)
+sd( phdata(srdta)$qt3[vec], na.rm=TRUE )
 young <- which(phdata(srdta)$age<65)
-summary(phdata(srdta)$qt3[young])
-sd(phdata(srdta)$qt3[young],na.rm=TRUE)
+summary( phdata(srdta)$qt3[young] )
+sd( phdata(srdta)$qt3[young], na.rm=TRUE )
 @
-Mean, median, min and max of age:
+The Mean, median, min and max of age:
 <<>>=
-summary(phdata(srdta)$age)
+summary( phdata(srdta)$age )
 @
-The histogram for qt2 looks strange (you can generate that 
-using \texttt{hist(phdata(srdta)\$qt2)}): it seems there are few very strong outliers. 
+The histogram for \texttt{qt2} looks strange (you can generate it 
+using \texttt{hist(phdata(srdta)\$qt2)}): it seems there are a few very 
+strong outliers. 
 %(figure \ref{fig:histqt2})
 %\begin{figure}
 %\centering
@@ -1396,7 +1404,7 @@
 %\end{figure}
 You can also see that with \texttt{summary}:
 <<>>=
-summary(phdata(srdta)$qt2)
+summary( phdata(srdta)$qt2 )
 @
 \end{Answer}
 
@@ -1407,30 +1415,27 @@
 
 While contingency tables, bi-plots and correlation 
 are powerful tools to analyse relations between 
-pairs of variable, a more general framework 
-allowing investigation of relation of an outcome to 
+pairs of variables, a more general framework 
+allowing investigation of the relation of an outcome to 
 multiple predictors is regression. 
-In R, function \texttt{lm} implements linear 
-regression modelling, and function \texttt{glm} 
+In R, the function \texttt{lm} implements linear 
+regression modelling, and the function \texttt{glm} 
 implements generalised linear regression.
 In this section, we will use these two 
-functions to analyse quantitative an 
+functions to analyse quantitative and
 binary outcomes.
 
 You can do linear regression to check if trait \texttt{qt2}
-has relation with sex and age by
-
+has a relation with sex and age by
 <<>>=
 a <- lm(phdata(srdta)$qt2 ~ phdata(srdta)$age + phdata(srdta)$sex)
 @
-
-The results of analysis are stored in object 'a', which 
-has class 'lm' and contains may sub-objects:
+The results of this analysis are stored in object 'a', which 
+has class 'lm' and contains many sub-objects:
 <<>>=
 class(a)
 names(a)
 @
-
 At this moment you do not need to understand 
 all these sub-objects; the meaningful summary of analysis 
 is produced with 
@@ -1442,14 +1447,12 @@
 As before, to make easy access to your data (basically, 
 to avoid typing \texttt{phdata(srdta)}
 before every trait name, you may attach the data to the search path:
-
 <<>>=
 attach(phdata(srdta))
 @
-
-Then,the above expression to run linear regression analysis simplifies to:
+Then, the above expression to run linear regression analysis simplifies to:
 <<>>=
-summary(lm(qt2 ~ age + sex))
+summary( lm(qt2 ~ age + sex) )
 @
 with the same results. 
 
@@ -1462,11 +1465,10 @@
 a <- glm(bt ~ age + sex, family="binomial")
 summary(a)
 @
-
 There is strong association between \texttt{bt} and sex and age.
 If you want to characterise the strength of association to a binary trait 
-with Odds Ratios, take the exponents of the regression coefficient. For example, 
-the odds ratio associated with male is 
+with Odds Ratios, take the exponents of the regression coefficient. For 
+example, the odds ratio associated with male is 
 <<>>=
 exp(0.3796)
 @