[Genabel-commits] r1193 - tutorials/GenABEL_general

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Wed Apr 17 01:29:19 CEST 2013


Author: lckarssen
Date: 2013-04-17 01:29:18 +0200 (Wed, 17 Apr 2013)
New Revision: 1193

Modified:
   tutorials/GenABEL_general/workgwaaclass.Rnw
Log:
Fix several layout and spelling bugs in the 'intro to the GenABEL package' chapter of the GenABEL tutorial.

Modified: tutorials/GenABEL_general/workgwaaclass.Rnw
===================================================================
--- tutorials/GenABEL_general/workgwaaclass.Rnw	2013-04-16 19:07:09 UTC (rev 1192)
+++ tutorials/GenABEL_general/workgwaaclass.Rnw	2013-04-16 23:29:18 UTC (rev 1193)
@@ -1,22 +1,19 @@
-\chapter{Introduction to \GA{}}
+\chapter{Introduction to the \GA{}}
 \label{sec:workgwaaclass}
 
 In this section, you will become familiar with the 
-\GA{} library, designed for GWA analysis. Compared to 
+\GA{} library, designed for GWA analysis. Compared to the
 \texttt{genetics} package, it provides specific 
 facilities for storage and manipulation of large amounts
 of data, very fast tests for GWA analysis, and special 
 functions to analyse and graphically present the 
 results of GWA analysis (thus "analysis of analysis").
 
-Start R and 
-load \GA{} library using command
-
+Start R and load the \GA{} library using the command
 <<results=hide>>=
 library(GenABEL)
 @
-
-After that, load example data set using the command
+After that, load the example data set using the command
 <<>>=
 data(srdta)
 @
@@ -25,51 +22,49 @@
 \label{subs:gwaaclass}
 
 The object you have loaded, \texttt{srdta}, belongs to the 
-\texttt{gwaa.data} class. This is a special class developed to facilitate
+\texttt{gwaa.data} class. This is a special R class developed to facilitate
 GWA analysis. 
 
 In GWA analysis, different types of data are used. These 
 include the phenotypic and genotypic data on the study 
 participants and chromosome and location of every SNP. 
 For every SNP, it is desirable to know the details of 
-coding (what are alleles? -- A, T, G, C? -- and what is 
+coding (how are the alleles coded? -- A, T, G, C? -- as well as
 the strand -- '+' or '-', 'top' or 'bot'? -- this coding is for).
 
 One could attempt to store all phenotypes and genotypes 
-together in a single table, using, e.g. one row per study subject; 
+together in a single table, using, \eg one row per study subject; 
 than the columns will correspond to study phenotypes and SNPs. 
 For a typical GWA data set, this would lead to a table of few thousands
 rows and few hundreds of thousands to millions of columns. Such a format is generated 
 when one downloads HapMap data for a region. 
 To store GWA data in such tables internally, within R, 
-proves to be inefficient. In \GA{}, special data class, 
-\texttt{gwaa.data-class} is used 
-to store GWA data.
+proves to be inefficient. In \GA{}, a special data class, 
+\texttt{gwaa.data-class} is used to store GWA data.
 
 You may consider an object of \texttt{gwaa.data-class} as 
 a 'black box' from which you can get specific data using 
-specific functions. If you are interested in internal 
+specific functions. If you are interested in the internal 
 structure of the \texttt{gwaa.data-class}, you can find 
 the description in section \ref{subs:gwaaclass_internal}
 (\nameref{subs:gwaaclass_internal}). 
 
 The data frame, which contains all phenotypic data in the 
 study may be accessed using the \texttt{phdata} function. 
-Let us have a look at few first rows of the phenotypic data 
+Let us have a look at the first few rows of the phenotypic data 
 frame of \texttt{srdta}:
 <<>>=
 phdata(srdta)[1:5,]
 @
-  
-The rows of this data frame correspond to study subjects, and 
+The rows of this data frame correspond to th estudy subjects, and 
 the columns correspond to the variables. There are two default 
 variables, which are always present in \texttt{phdata}. 
-The first of these is ''id'', which contains study subject 
+The first of these is ``id'', which contains study subject 
 identification code. This identification code can be arbitrary 
 character, numer, or alphanumeric combination, 
 but every person must be coded with an unique ID.
-The second default variable is ''sex'', where males are coded 
-with ones (''1'') and females are coded with zero (''0'').
+The second default variable is ``sex'', where males are coded 
+with ones (``1'') and females are coded with zero (``0'').
 
 It is important to understand that this data frame is not 
 supposed to be directly modified by the user, as its 
@@ -98,19 +93,19 @@
 all GWA genetic information in an object of class 
 \texttt{snp.data} class. It is not supposed to be 
 modified directly by user. The genotypic data can 
-be accessed through \texttt{gtdata} function, e.g. 
+be accessed through the \texttt{gtdata} function, \eg 
 <<>>=
-gtdata(srdta[1:10,1:10])
+gtdata(srdta[1:10, 1:10])
 @
 As you can see, these data are of little direct use 
 as these are stored in an internal format -- 
-you need to coerce that to other data type if you 
-want to manipulate / analyse these data using 
+you need to coerce that to another data type if you 
+want to manipulate/analyse these data using 
 non-\GA{} functions (see section \ref{subs:gwaa}).
 
 The number of individuals described in an 
 object of \texttt{gwaa.data-class} can be 
-accessed through \texttt{nids} function, e.g.
+accessed through \texttt{nids} function, \eg
 <<>>=
 nids(srdta)
 @
@@ -120,9 +115,9 @@
 nsnps(srdta)
 @
 
-The IDs of individuals included in the study can be 
-accessd via \texttt{idnames} function, for example 
-IDs of the first 7 individuals in the study are
+The IDs of the individuals included in the study can be 
+accessed via the \texttt{idnames} function, for example 
+the IDs of the first 7 individuals in the study are
 <<>>=
 idnames(srdta)[1:7]
 @
@@ -132,18 +127,18 @@
 <<>>=
 male(srdta)[1:7]
 @
-where males (heterogametic sex) are assigned with '1' and 
-a homogametic sex (females) are assigned value '0'.
+where males (heterogametic sex) are assigned with ``1'' and 
+a homogametic sex (females) are assigned the value ``0''.
 
 Names of SNPs can be 
-accessed using \texttt{snpnames} function; for 
+accessed using the \texttt{snpnames} function; for 
 example the names of the first 10 SNPs in the 
 \texttt{srdta} are
 <<>>=
 snpnames(srdta)[1:10]
 @
 
-SNP annotation include (presented for the first 10 SNPs only):
+SNP annotation includes (shown for the first 10 SNPs only):
 \begin{itemize} 
 \item Chromosome:
 <<>>=
@@ -153,14 +148,14 @@
 <<>>=
 map(srdta)[1:10]
 @
-\item Coding (where the second allele is the ''effect'' or ''coded'' one):
+\item Coding (where the second allele is the ``effect'' or ``coded'' one):
 <<>>=
 coding(srdta)[1:10]
 @
-For every SNP, coding is presented with a pair of characters, 
-for example ''AG''. For 'AG' polymorphism, you may expect 
-''AA'', ''AG'' and ''GG'' genotypes to be found in population. 
-The order (that is ''AG'' vs ''GA'') is important -- 
+For every SNP, the coding is represented with a pair of characters, 
+for example ``AG''. For an ``AG'' polymorphism, you may expect 
+``AA'', ``AG'' and ``GG'' genotypes to be found in your population. 
+The order (that is ``AG'' vs.~``GA'') is important -- 
 the first allele reported is the one 
 which will be used as a reference in association analysis, and 
 thus the effects are reported for the second allele. You can 
@@ -173,16 +168,16 @@
 effallele(srdta)[1:10]
 @
 
-\item Strand on which the coding is reported ('+', '-' or missing, 'u'):
+\item The strand on which the coding is reported ('+', '-' or missing, 'u'):
 <<>>=
 strand(srdta)[1:10]
 @
 \end{itemize}
 
 \begin{summary}
-\item \GA{} uses special data class, \texttt{gwaa.data-class}, to
+\item \GA{} uses a special data class, \texttt{gwaa.data-class}, to
 	store GWA data.
-\item To access the content of an object of  \texttt{gwaa.data-class}, 
+\item To access the content of an object of \texttt{gwaa.data-class}, 
 	a number of functions is used
 \end{summary}
 
@@ -210,18 +205,18 @@
 <<>>=
 nids(srdta) - sum(male(srdta))
 @
-... or you could get the same answer like this\footnote{
+\ldots you could get the same answer like this\footnote{
 This is something covered later in the section \ref{subs:gwaa_subs} 
 ("\nameref{subs:gwaa_subs}")
 }:
 <<>>=
-sum(male(srdta)==0)
+sum(male(srdta) == 0)
 @
 The proportion of males can be computed using above results
 <<>>=
-sum(male(srdta))/nids(srdta)
+sum(male(srdta)) / nids(srdta)
 @
-or by using \texttt{mean()} function:
+or by using the \texttt{mean()} function:
 <<>>=
 mean(male(srdta))
 @
@@ -258,22 +253,22 @@
 %\end{ex}
 
 \begin{Exercise}[title=Exploring SNPs in \texttt{srdta}]%\label{ex2.1}
-Explore SNPs contained in \texttt{srdta} using functions to 
+Explore the SNPs contained in \texttt{srdta} using the functions to 
 access SNP names (\texttt{snpnames}) and map (\texttt{map}) 
 location
 \Question What are names of markers located after 2,490,000 b.p.?
 \Question Between 1,100,000 and 1,105,000 b.p.?
 \end{Exercise}
 \begin{Answer}
-The names of markers located after 2,490,000 b.p. are
+The names of markers located after 2,490,000 b.p.~are
 <<>>=
-vec <- (map(srdta)>2490000)
+vec <- (map(srdta) > 2490000)
 snpnames(srdta)[vec]
 @
 The names of markers located 
-between 1,100,000 and 1,105,000 b.p. are:
+between 1,100,000 and 1,105,000 b.p.~are:
 <<>>=
-vec <- (map(srdta)>1100000 & map(srdta)<1105000)
+vec <- (map(srdta) > 1100000 & map(srdta) < 1105000)
 snpnames(srdta)[vec]
 @
 \end{Answer}
@@ -284,19 +279,19 @@
 \section{Accessing and modifying phenotypic data}
 \label{subsec:modify_phdata}
 
-As it was already mentioned, the object returned by \texttt{phdata} 
+As was already mentioned, the object returned by \texttt{phdata} 
 contains phenotypic data and is 
-an conventional data frame, wich obligatory includes 
+a conventional data frame, wich obligatory includes 
 'id' and 'sex' variables, and ordered an a way that it 
 couples to the genotypic data. 
 
 Being a data frame, \texttt{phdata} may be accessed using 
-corresponding methods:
+the corresponding methods:
 <<>>=
-phdata(srdta)[1:5,]
+phdata(srdta)[1:5, ]
 class(phdata(srdta))
-phdata(srdta)[1:5,2]
-phdata(srdta)[1:5,"sex"]
+phdata(srdta)[1:5, 2]
+phdata(srdta)[1:5, "sex"]
 phdata(srdta)$sex[1:5]
 @
 
@@ -309,36 +304,37 @@
 from \texttt{phdata} part of an object of 
 \texttt{gwaa.data-class}. 
 
-For example, if you want to add a variable (say, square of age)
-computed from the 'age' variable of \texttt{srdta}
+For example, if you want to add a variable (say, the square of age)
+computed from the ``age'' variable of \texttt{srdta}
 <<>>=
-phdata(srdta)[1:5,]
+phdata(srdta)[1:5, ]
 age2 <- phdata(srdta)$age^2
 @
-you need to use \texttt{add.phdata} function:
+you need to use the \texttt{add.phdata} function:
 <<>>=
-srdta <- add.phdata(srdta,newph=age2,name="age_squared")
-phdata(srdta)[1:5,]
+srdta <- add.phdata(srdta, newph=age2, name="age_squared")
+phdata(srdta)[1:5, ]
 @
 
-You can add more then one variable at once using the 
-same function, however, in this case the second ('newph') 
+You can add more than one variable at once using the 
+same function, however, in this case the second (``newph'') 
 argument of the function should be a data frame, 
-which contains 'id' variable specifing the IDs of 
+which contains an 'id' variable specifing the IDs of the
 individuals. Imagine we have the data for individuals 
 'p1', 'p2' and 'p7' (we will generate random data 
 for them; pay attention only to the result):
 <<>>=
-newvalues <- matrix(rnorm(3*5),3,5)
-newdata <- data.frame(id=c("p1","p2","p7"),ph1=1,ph2=1,ph3=1,ph4=1,ph5=1)
-newdata[,c(2:6)] <- newvalues
+newvalues <- matrix( rnorm(3*5), 3, 5 )
+newdata   <- data.frame(id=c("p1", "p2", "p7"), 
+                        ph1=1, ph2=1, ph3=1, ph4=1, ph5=1)
+newdata[, c(2:6)] <- newvalues
 newdata
 @
 
-These data can be added to phenotypic data with
+These data can be added to the phenotypic data with
 <<>>=
 srdta <- add.phdata(srdta,newdata)
-phdata(srdta)[1:10,]
+phdata(srdta)[1:10, ]
 @
 
 Finally, if you need, you can delete some phenotypes 
@@ -346,15 +342,16 @@
 function. Let us delete the phenotypes we have just 
 added:
 <<>>=
-srdta <- del.phdata(srdta,c("age_squared","ph1","ph2","ph3","ph4","ph5"))
-phdata(srdta)[1:10,]
+srdta <- del.phdata(srdta, 
+                    c("age_squared", "ph1", "ph2", "ph3", "ph4", "ph5"))
+phdata(srdta)[1:10, ]
 @
 
 \begin{summary}
 \item Phenotypic data contained in an object of \texttt{gwaa.data-class} 
-	can be accessed using \texttt{phdata} functions
-\item You can add phenotypes using \texttt{add.phdata} function 
-\item You can delete phenotypes using \texttt{del.phdata} function 
+	can be accessed using the \texttt{phdata} functions
+\item You can add phenotypes using the \texttt{add.phdata} function 
+\item You can delete phenotypes using the \texttt{del.phdata} function 
 \end{summary}
 
 
@@ -368,12 +365,12 @@
 you may think of 
 an object of class \texttt{gwaa.data} as a matrix whose rows 
 correspond to study subjects and columns correspond to SNPs studied 
-(though the actual object is a way more complicated). For example, 
-if we would like to investigate what is the 
+(though the actual object is more complicated). For example, 
+if we would like to investigate the 
 content of \texttt{srdta} for the first 5 people and 3 SNPs, we can 
 run
 <<>>=
-ssubs <- srdta[1:5,1:3]
+ssubs <- srdta[1:5, 1:3]
 class(ssubs)
 @
 
@@ -384,42 +381,41 @@
 <<>>=
 phdata(ssubs)
 @
-and genotypc data, whcih can be accessed via \texttt{gtdata} 
+and genotypc data, which can be accessed via the \texttt{gtdata} 
 function:
 <<>>=
 gtdata(ssubs)
 @
 whose content is not quite straightforward to read.
 
-To get human-readable information, genotypic object  
+To get human-readable information, a genotypic object  
 should be coerced to a regular \texttt{R} data type, 
-e.g. character, using 
+\eg character, using the 
 \texttt{as.character()} function:
 <<>>=
 as.character(gtdata(ssubs))
 @
 
-Other useful coercion is to "numeric":
+Another useful coercion is to "numeric":
 <<>>=
 as.numeric(gtdata(ssubs))
 @
-
 Note that conversion to numeric happened according to the 
 underlying genotype and the rules specified by SNP coding:
 <<>>=
 coding(ssubs)
 @
 -- the genotype, which is made of the 'first' 
-allele of the 'code' is converted to 0, 
-the heterozygote to '1' and a himozygote 
-for the second allele is converted to '2'.
+allele of the 'code' is converted to ``0'', 
+the heterozygote to ``1'' and a homozygote 
+for the second allele is converted to ``2''.
 
-For example, when coding is ''GA'', as is for the \texttt{rs18} 
+For example, when the coding is ``GA'', as it is for the \texttt{rs18} 
 (the second SNP), homozygotes for the first allele, as
-specified by coding (''G'')
-are converted to zeros (''0''), heterozygotes are converted to 
-ones (''1''), and homozygotes for the second allele (''A'') are 
-converted to twos (''2''). Clearly, when numerically converted 
+specified by coding (``G'')
+are converted to zeros (``0''), heterozygotes are converted to 
+ones (``1''), and homozygotes for the second allele (``A'') are 
+converted to twos (``2''). Clearly, when numerically converted 
 data are used for association analysis, the effects will be 
 estimated for the second allele, while first will be used as 
 a reference.
@@ -430,11 +426,11 @@
 Several useful genetic analysis libraries were developed for R. 
 These include \texttt{genetics} (analysis of linkage disequilibrium and 
 many other useful functions) and \texttt{haplo.stats} (analysis of 
-association between traits and haplotypes). These use there own 
+association between traits and haplotypes). These use their own 
 genetic data formats. 
 
 One can translate \GA{} genetic data 
-to the format used by "genetics" library by \texttt{as.genotype()}:
+to the format used by "genetics" library using \texttt{as.genotype()}:
 <<>>=
 as.genotype(gtdata(ssubs))
 @
@@ -450,127 +446,127 @@
 as \texttt{scan.haplo()} and \texttt{scan.haplo.2D()}).
 
 It is possible to select sub-sets of \texttt{gwaa.data-class} based not 
-only on index (e.g. first 10 people and SNP number 33), but also based 
+only on index (\eg first 10 people and SNP number 33), but also based 
 on names. 
 
 For example, if we would like to retrieve phenotypic data on people 
 with IDs "p141", "p147" and "p2000", we can use
 <<>>=
-phdata(srdta[c("p141","p147","p2000"),])
+phdata( srdta[c("p141", "p147", "p2000"), ] )
 @
 here, the first part of expression sub-sets \texttt{srdta} on
 selected IDs, and the second tells which part of the retrieved 
 sub-set we want to see. You can try 
-\texttt{srdta[c("p141","p147","p2000"),]}, but be prepared to 
+\texttt{srdta[c("p141", "p147", "p2000"),]}, but be prepared to 
 see long output, as all information will be reported. 
 
-In similar manner, we can also select on SNP name. For example, 
+In a similar manner, we can also select on SNP name. For example, 
 if we are interested to see information on SNPs "rs10" and "rs29"
 for above people, we can run
 <<>>=
-phdata(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
-gtdata(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
+phdata(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
+gtdata(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
 @
 
 To see the actual genotypes for the above three people and 
 two SNPs, use
 <<>>=
-as.character(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
+as.character(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
 @
 or 
 <<>>=
-as.numeric(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
+as.numeric(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
 @
 
 \begin{Exercise}[title=Exploring rs114]%\label{allelecount}
 Explore genotypes for SNP "rs114"
 \Question What is the coding and which allele is the reference one?
-\Question What is the frequency of \textbf{non-reference} (''effective'') allele in total sample?
-\Question What is the frequency of effective allele in male?
-\Question What is the frequency of effective allele in female?
+\Question What is the frequency of \textbf{non-reference} (``effect'') allele in total sample?
+\Question What is the frequency of the effect allele in males?
+\Question What is the frequency of the effect allele in females?
 \Question What is the frequency of the \textbf{reference} allele 
-in total sample, males and females?
+in the total sample, males and females?
 \end{Exercise}
 \begin{Answer}
-To learn what allele of ''rs114'' is the reference 
+To learn what allele of ``rs114'' is the reference 
 you need to run
 <<>>=
 coding(srdta)["rs114"]
 @
-Here, the first (''\Sexpr{strsplit(coding(srdta)["rs114"],"")[[1]][1]}'') 
+Here, the first (``\Sexpr{strsplit(coding(srdta)["rs114"], "")[[1]][1]}'') 
 allele is the reference and thus 
-the second (''\Sexpr{strsplit(coding(srdta)["rs114"],"")[[1]][2]}'') 
-is the effective one. Remember that when using \texttt{as.numeric} 
+the second (``\Sexpr{strsplit(coding(srdta)["rs114"], "")[[1]][2]}'') 
+is the effect allele. Remember that when using the \texttt{as.numeric} 
 function to convert the genotypes to human-readable and \texttt{R}-operatable 
-format, the homozygotes for reference will be coded as ''0'', heterozygotes 
-as ''1'' and the non-reference (''effective'') homozygotes will be coded as ''2'':
+format, the homozygotes for reference will be coded as ``0'', heterozygotes 
+as ``1'' and the non-reference (``effect'') homozygotes will be coded as ``2'':
 <<>>=
-table(as.character(gtdata(srdta[,"rs114"])),as.numeric(gtdata(srdta[,"rs114"])))
+table(as.character( gtdata(srdta[, "rs114"] )), 
+      as.numeric(   gtdata(srdta[,"rs114"] )))
 @
 
-To compute frequency of the effective allele of SNP ''rs114'' in total 
+To compute the frequency of the effect allele of SNP ``rs114'' in the total 
 sample, you can go two ways. First, we can try to take a sum 
-of all ''rs114'' genotypes and divide it by twice the number of 
+of all ``rs114'' genotypes and divide it by twice the number of 
 people:
 <<>>=
-a <- as.numeric(gtdata(srdta[,"rs114"]))
+a <- as.numeric(gtdata(srdta[, "rs114"]))
 sum(a)
 @
-This, however, returns NA, because some of the genotypes are 
+This, however, returns \texttt{NA}, because some of the genotypes are 
 missing. We can deal with this problem by running \texttt{sum()}
 with the option \texttt{na.rm=TRUE}:
 <<>>=
-sum(a,na.rm=T)
+sum(a, na.rm=TRUE)
 @
-so the number of 'effect' alleles is \Sexpr{sum(a,na.rm=T)}.
+so the number of ``effect'' alleles is \Sexpr{sum(a, na.rm=TRUE)}.
 
 However, now we do not know 
 what was the number of people for whom the genotype was 
 measured! -- \texttt{nids} would return the \emph{total}
-number of people, but not the number of ones measured for 
-''rs114''. 
+number of people, but not the number of those measured for 
+``rs114''. 
 
-This problem can be dealt with through using \texttt{is.na(A)} 
+This problem can be dealt with through using the \texttt{is.na(A)} 
 function which returns true when some element of \texttt{A} is not measured.
-Thus, the number of people with measured genotype for ''rs114'' is
+Thus, the number of people with measured genotype for ``rs114'' is
 <<>>=
 nids(srdta)
 nmeasured <- sum(!is.na(a))
 nmeasured
 @
-(note the ''!'' before \texttt{is.na}, which means NOT, so we get 
-these elements which are not NA). 
-The frequency of the 'effect' allele thus is
+(note the ``!'' before \texttt{is.na}, which means NOT, so we get 
+these elements which are not \texttt{NA}). 
+Consequently, the frequency of the ``effect'' allele is
 <<>>=
-sum(a,na.rm=T)/(2*nmeasured)
+sum(a, na.rm=TRUE) / (2 * nmeasured)
 @
 
-
-An easier way would be to compute mean value of ''rs114'' with the 
-\texttt{mean( ... ,na.rm=TRUE)} function and divide it by 2:
+An easier way would be to compute mean value of ``rs114'' with the 
+\texttt{mean( \ldots , na.rm=TRUE)} function and divide it by 2:
 <<>>=
-mean(a,na.rm=T)/2
+mean(a, na.rm=TRUE)/2
 @
 
-To compute frequency of the effective allele of ''rs114'' in males, 
+To compute the frequency of the effect allele of ``rs114'' in males, 
 you can use
 <<>>=
-amale <- as.numeric(gtdata(srdta[male(srdta)==1,"rs114"]))
-mean(amale,na.rm=T)/2
+amale <- as.numeric( gtdata(srdta[male(srdta)==1, "rs114"]) )
+mean(amale, na.rm=TRUE)/2
 @
 
-To compute frequency of the effective allele in females, you can use
+To compute the frequency of the effect allele in females, you can use
 <<>>=
-afemale <- as.numeric(gtdata(srdta[male(srdta)==0,"rs114"]))
-mean(afemale,na.rm=T)/2
+afemale <- as.numeric( gtdata(srdta[male(srdta)==0, "rs114"]) )
+mean(afemale, na.rm=TRUE)/2
 @
 
 The frequencies of the reference allele are computed very simply as 
 one minus the frequency of the effective allele:
 <<>>=
-1 - mean(a,na.rm=T)/2
-1 - mean(amale,na.rm=T)/2
-1 - mean(afemale,na.rm=T)/2
+1 - mean(a,       na.rm=TRUE)/2
+1 - mean(amale,   na.rm=TRUE)/2
+1 - mean(afemale, na.rm=TRUE)/2
 @
 
 \end{Answer}
@@ -579,33 +575,33 @@
 \begin{summary}
 \item It is possible to obtain subsets of objects of 
 \texttt{gwaa.data-class} and \texttt{snp.data-class}
-using standard 2D sub-setting model \texttt{[i,j]}, 
+using the standard 2D sub-setting model \texttt{[i, j]}, 
 where \texttt{i} corresponds to study subjects and 
 \texttt{j} corresponds to SNPs.
 \item It is possible to provide ID and SNP names instead 
 of indexes when sub-setting an object of class 
 \texttt{gwaa.data-class}.
-\item Function \texttt{as.numeric()} converts 
+\item The function \texttt{as.numeric()} converts 
 genotypic data from \texttt{snp.data-class} to
 regular integer numbers, which can be used in analysis 
 with R.
-\item Function \texttt{as.character()} converts 
+\item The function \texttt{as.character()} converts 
 genotypic data from \texttt{snp.data-class} to
 character format.
-\item Function \texttt{as.genotype()} converts 
+\item The function \texttt{as.genotype()} converts 
 genotypic data from \texttt{snp.data-class} to
-the format used by library \texttt{genetics}.
-\item Function \texttt{as.hsgeno()} converts 
+the format used by the \texttt{genetics} library.
+\item The function \texttt{as.hsgeno()} converts 
 genotypic data from \texttt{snp.data-class} to
-the format used by library \texttt{haplo.stats}.
+the format used by the \texttt{haplo.stats} library.
 \end{summary}
 
 
 \section{Exploring genetic data}
 \label{subs:expgendat}
 
-Implementation of function \texttt{summary()} to
-summarize genotypic part of \texttt{gwaa.data-class}
+Implementation of the function \texttt{summary()} to
+summarize the genotypic part of a \texttt{gwaa.data-class} object
 is very useful in genetic data 
 exploration and quality control (QC). 
 Let us try application of this function to the 
@@ -615,44 +611,44 @@
 a
 @
 
-In the first section, the summary is generated for phenotypic data.
-In the second section, summary is generated for genotypic data. 
+In the first section, a summary is generated for the phenotypic data.
+In the second section, a summary is generated for the genotypic data. 
 In this section, \texttt{NoMeasured} refers to the number 
 of genotypes scores, \texttt{CallRate} to the proportion of these,
 \texttt{Q.2} is the frequency of the 'B' allele. The counts in 
 three genotypic classes are provided next. 
-\texttt{Pexact} refers to exact P-value for the test of Hardy-Weinberg
+\texttt{Pexact} refers to exact $P$-value for the test of Hardy-Weinberg
 equilibrium.
 
-As you've seen above, an object of the class \texttt{gwaa.data-class} 
-is sub-settable in 
-standard manner: \texttt{[i,j]}, where \texttt{i} is an index of a 
-study subject and \texttt{j} is an index of a SNP. Importantly, 
+As you have seen above, an object of the class \texttt{gwaa.data-class} 
+is sub-settable in the
+standard manner: \texttt{[i, j]}, where \texttt{i} is the index of a 
+study subject and \texttt{j} is the index of a SNP. Importantly, 
 \texttt{i} could be a list of indexes:
 <<>>=
-vec <- which(phdata(srdta)$age>=65)
+vec <- which(phdata(srdta)$age >= 65)
 vec
-summary(gtdata(srdta[vec,1:3]))
+summary( gtdata(srdta[vec, 1:3]) )
 @
 This shows summary of first three genotypes for people with age 
 greater then or equal to 65 y.o. The same result may be achieved 
 by sub-setting using a vector of logical values: 
 <<>>=
-vec <- (phdata(srdta)$age>=65)
+vec <- (phdata(srdta)$age >= 65)
 table(vec)
-summary(gtdata(srdta[vec,1:3]))
+summary( gtdata(srdta[vec, 1:3]) )
 @
 or a list with IDs of study subjects:
 <<>>=
 vec1 <- idnames(srdta)[vec]
 vec1
-summary(gtdata(srdta[vec1,1:3]))
+summary( gtdata(srdta[vec1, 1:3]) )
 @
 
-Let us explore the object returned by \texttt{summary} function 
-when applied to \texttt{snp.data} class in more details:
+Let us explore the object returned by the \texttt{summary} function 
+when applied to \texttt{snp.data} class in more detail:
 <<>>=
-a <- summary(gtdata(srdta[vec1,1:3]))
+a <- summary( gtdata(srdta[vec1, 1:3]) )
 class(a)
 @
 Thus, the object returned is a \texttt{data.frame}. Therefore 
@@ -674,18 +670,18 @@
 \begin{Answer}
 To test for HWE in first 10 SNPs in total sample
 <<>>=
-summary(gtdata(srdta[,1:10]))
+summary( gtdata(srdta[, 1:10]) )
 @ 
 To test it in cases
 <<>>=
-summary(gtdata(srdta[phdata(srdta)$bt==1,1:10]))
+summary( gtdata(srdta[phdata(srdta)$bt==1, 1:10]) )
 @ 
 in controls
 <<>>=
-summary(gtdata(srdta[phdata(srdta)$bt==0,1:10]))
+summary( gtdata(srdta[phdata(srdta)$bt==0, 1:10]) )
 @ 
 SNPs 'rs73' and 'rs128' are out of HWE (at $p \le 0.05$) 
-in total sample, and also in cases and controls.
+in the total sample, and also in cases and controls.
 \end{Answer}
 
 %Several pieces of information are critical in accessing quality of 
@@ -698,70 +694,63 @@
 
 Let us analyse the distribution of call rate in the whole study. For this, 
 we first need to obtain the vector of call rates:
-
 <<>>=
 sumgt <- summary(gtdata(srdta))
-crate <- sumgt[,"CallRate"]
+crate <- sumgt[, "CallRate"]
 @
 
-This vector may be presented by a histogram
+This vector may be depicted by a histogram
 <<>>=
 hist(crate)
 @
 which shows that most SNPs have call rate between 93 and 97\%
-(figure \ref{fig:histCR}).
+(Figure~\ref{fig:histCR}).
 \begin{figure}
 \centering
 <<fig=true,echo=false>>=
 hist(crate)
 @
-\caption{Histogram of the call rate}
+\caption{Histogram of the call rate.}
 \label{fig:histCR}
 \end{figure}
 
-As next step, you would like to produce a summary table, 
+As a next step, you would like to produce a summary table, 
 showing how many markers had call rate lower than, say, 
 93\%, between 93 and 95\%, between 95 and 99\% and more than 99\%. 
-You can use \texttt{catable()} command for that:
-
+You can use the \texttt{catable()} command for that:
 <<>>=
-catable(crate,c(.93,.95,.99))
+catable(crate, c(.93, .95, .99))
 @
-
-Similar procedure may be applied to see deviation from HWE:
-
+A similar procedure may be applied to see deviation from HWE:
 <<>>=
-hwp <- sumgt[,"Pexact"]
-catable(hwp,c(0.05/nsnps(srdta),0.01,0.05,0.1))
+hwp <- sumgt[, "Pexact"]
+catable(hwp, c(0.05/nsnps(srdta), 0.01, 0.05, 0.1))
 @
-
 The first cut-off category will detect SNPs which are 
-deviating from HWE at the Bonferroni-corrected P-level.
+deviating from HWE at the Bonferroni-corrected $P$-level.
 
-However, for these data it will make more sense to table cumulative 
+However, for these data it will make more sense to table the cumulative 
 distribution:
-
 <<>>=
-catable(hwp,c(0.05/nsnps(srdta),0.01,0.05,0.1),cum=T)
+catable(hwp, c(0.05/nsnps(srdta), 0.01, 0.05, 0.1), cum=TRUE)
 @
 
 If you would like to investigate the minor allele frequency 
 (MAF) distribution, the same logic would apply. First, derive 
-MAF with 
+the MAF with 
 <<>>=
-afr <- sumgt[,"Q.2"]
-maf <- pmin(afr,(1.-afr))
+afr <- sumgt[, "Q.2"]
+maf <- pmin(afr, (1. - afr))
 @
-
 Next, generate histograms for frequency and MAF:
 \begin{figure}
 \centering
 <<fig=true,echo=false>>=
-par(mfcol=c(1,2))
+par(mfcol=c(1, 2))
 hist(afr)
 hist(maf)
 @
-\caption{Histogram of the call rate}
+\caption{Histogram of the call rate.}
 \label{fig:histMAF}
 \end{figure}
 <<>>=
@@ -769,19 +758,19 @@
 hist(afr)
 hist(maf)
 @
-(shown at the figure \ref{fig:histMAF}) and then generate table describing frequency distribution:
+(shown in Figure~\ref{fig:histMAF}) and then generate a table describing the 
+frequency distribution:
 <<>>=
-catable(afr,c(0.01,0.05,0.1,0.2,0.5,0.8,0.9,0.95,0.99))
-catable(maf,c(0,0.01,0.05,0.1,0.2),cum=T)
+catable(afr, c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95, 0.99))
+catable(maf, c(0, 0.01, 0.05, 0.1, 0.2), cum=TRUE)
 @
+Note that we used ``0'' as the first category -- this will give you the number of 
+monomorphic SNPs which we recommend to exclude from analysis.
 
-Note that we used "0" as the first category -- this will give you the number of 
-monomorhic SNPs which we recommend to exclude from analysis.
-
-Other function, \texttt{perid.summary}, produces summary SNP statistics 
+Another function, \texttt{perid.summary}, produces summary SNP statistics 
 per person. Let us try producing this summary for the first 10 people:
 <<>>=
-perid.summary(srdta[1:10,])
+perid.summary(srdta[1:10, ])
 @
 This table lists the number of genotypes scored for the person, 
 call rate, and heterozygosity. 
@@ -793,56 +782,56 @@
 <<>>=
 het <- perid.summary(srdta)$Het
 mean(het)
-catable(het,c(0.1,0.25,0.3,0.35,0.5))
+catable(het, c(0.1, 0.25, 0.3, 0.35, 0.5))
 hist(het)
 @
-The resulting histogram is presented in figure \ref{fig:histHet}.
-It is easy to see that few people have very low heterozygosity, 
+The resulting histogram is shown in Figure~\ref{fig:histHet}.
+It is easy to see that a few people have very low heterozygosity, 
 but there are no outliers with extremely high values. 
 \begin{figure}
 \centering
 <<fig=true,echo=false>>=
 hist(het)
 @
-\caption{Histogram of heterozygosity}
+\caption{Histogram of heterozygosity.}
 \label{fig:histHet}
 \end{figure}
 
-In this section, we covered low-level functions \texttt{summary} 
-and \texttt{perid.summary}. Based on these, an upper-level 
+In this section, we covered the low-level functions \texttt{summary} 
+and \texttt{perid.summary}. On these functions a higher level 
 genetic data quality control function, \texttt{check.marker}, is 
 based. That function will be covered in the next section.
 
 \begin{summary}
-\item When \texttt{summary()} function is applied to an 
-\texttt{gtdata} subset of \texttt{gwaa.data-class}, it return summary statistics for SNPs, 
-including exact test for Hardy-Weinberg equilibrium.
-\item When \texttt{perid.summary()} function is applied to an 
-object of \texttt{gwaa.data-class} (or \texttt{gtdata} part of it), 
-it return per-person summary statistics, 
-including the call rate within this person and its' heterozygosity.
+\item When the \texttt{summary()} function is applied to a
+\texttt{gtdata} subset of \texttt{gwaa.data-class}, it returns summary statistics for the SNPs, 
+including an exact test for Hardy-Weinberg equilibrium.
+\item When the \texttt{perid.summary()} function is applied to an 
+object of \texttt{gwaa.data-class} (or the \texttt{gtdata} part of it), 
+it returns per-person summary statistics, 
+including the call rate within this person and its heterozygosity.
 \end{summary}
 
 \begin{Exercise}[title=Characterizing call rate]
 Characterise the distribution of call rates within study subjects 
-and produce a histogram. How many people have 
+and produce a histogram. How many people have a
 call rate below 93\%?
 \end{Exercise}
 \begin{Answer}
 To characterize ID call rate, you can run the following commands:
 <<>>=
   idsummary <- perid.summary(srdta)
-  idsummary[1:5,]
+  idsummary[1:5, ]
   idcall <- idsummary$Call
   idcall[1:5]
-  catable(idcall,c(0.9,0.93,0.95,0.98,0.99))
-  table(idcall<0.93)
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/genabel -r 1193


More information about the Genabel-commits mailing list