[Genabel-commits] r1193 - tutorials/GenABEL_general
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Wed Apr 17 01:29:19 CEST 2013
Author: lckarssen
Date: 2013-04-17 01:29:18 +0200 (Wed, 17 Apr 2013)
New Revision: 1193
Modified:
tutorials/GenABEL_general/workgwaaclass.Rnw
Log:
Fix several layout and spelling bugs in the 'intro to the GenABEL package' chapter of the GenABEL tutorial.
Modified: tutorials/GenABEL_general/workgwaaclass.Rnw
===================================================================
--- tutorials/GenABEL_general/workgwaaclass.Rnw 2013-04-16 19:07:09 UTC (rev 1192)
+++ tutorials/GenABEL_general/workgwaaclass.Rnw 2013-04-16 23:29:18 UTC (rev 1193)
@@ -1,22 +1,19 @@
-\chapter{Introduction to \GA{}}
+\chapter{Introduction to the \GA{}}
\label{sec:workgwaaclass}
In this section, you will become familiar with the
-\GA{} library, designed for GWA analysis. Compared to
+\GA{} library, designed for GWA analysis. Compared to the
\texttt{genetics} package, it provides specific
facilities for storage and manipulation of large amounts
of data, very fast tests for GWA analysis, and special
functions to analyse and graphically present the
results of GWA analysis (thus "analysis of analysis").
-Start R and
-load \GA{} library using command
-
+Start R and load the \GA{} library using the command
<<results=hide>>=
library(GenABEL)
@
-
-After that, load example data set using the command
+After that, load the example data set using the command
<<>>=
data(srdta)
@
@@ -25,51 +22,49 @@
\label{subs:gwaaclass}
The object you have loaded, \texttt{srdta}, belongs to the
-\texttt{gwaa.data} class. This is a special class developed to facilitate
+\texttt{gwaa.data} class. This is a special R class developed to facilitate
GWA analysis.
In GWA analysis, different types of data are used. These
include the phenotypic and genotypic data on the study
participants and chromosome and location of every SNP.
For every SNP, it is desirable to know the details of
-coding (what are alleles? -- A, T, G, C? -- and what is
+coding (how are the alleles coded? -- A, T, G, C? -- as well as
the strand -- '+' or '-', 'top' or 'bot'? -- this coding is for).
One could attempt to store all phenotypes and genotypes
-together in a single table, using, e.g. one row per study subject;
+together in a single table, using, \eg one row per study subject;
than the columns will correspond to study phenotypes and SNPs.
For a typical GWA data set, this would lead to a table of few thousands
rows and few hundreds of thousands to millions of columns. Such a format is generated
when one downloads HapMap data for a region.
To store GWA data in such tables internally, within R,
-proves to be inefficient. In \GA{}, special data class,
-\texttt{gwaa.data-class} is used
-to store GWA data.
+proves to be inefficient. In \GA{}, a special data class,
+\texttt{gwaa.data-class} is used to store GWA data.
You may consider an object of \texttt{gwaa.data-class} as
a 'black box' from which you can get specific data using
-specific functions. If you are interested in internal
+specific functions. If you are interested in the internal
structure of the \texttt{gwaa.data-class}, you can find
the description in section \ref{subs:gwaaclass_internal}
(\nameref{subs:gwaaclass_internal}).
The data frame, which contains all phenotypic data in the
study may be accessed using the \texttt{phdata} function.
-Let us have a look at few first rows of the phenotypic data
+Let us have a look at the first few rows of the phenotypic data
frame of \texttt{srdta}:
<<>>=
phdata(srdta)[1:5,]
@
-
-The rows of this data frame correspond to study subjects, and
+The rows of this data frame correspond to th estudy subjects, and
the columns correspond to the variables. There are two default
variables, which are always present in \texttt{phdata}.
-The first of these is ''id'', which contains study subject
+The first of these is ``id'', which contains study subject
identification code. This identification code can be arbitrary
character, numer, or alphanumeric combination,
but every person must be coded with an unique ID.
-The second default variable is ''sex'', where males are coded
-with ones (''1'') and females are coded with zero (''0'').
+The second default variable is ``sex'', where males are coded
+with ones (``1'') and females are coded with zero (``0'').
It is important to understand that this data frame is not
supposed to be directly modified by the user, as its
@@ -98,19 +93,19 @@
all GWA genetic information in an object of class
\texttt{snp.data} class. It is not supposed to be
modified directly by user. The genotypic data can
-be accessed through \texttt{gtdata} function, e.g.
+be accessed through the \texttt{gtdata} function, \eg
<<>>=
-gtdata(srdta[1:10,1:10])
+gtdata(srdta[1:10, 1:10])
@
As you can see, these data are of little direct use
as these are stored in an internal format --
-you need to coerce that to other data type if you
-want to manipulate / analyse these data using
+you need to coerce that to another data type if you
+want to manipulate/analyse these data using
non-\GA{} functions (see section \ref{subs:gwaa}).
The number of individuals described in an
object of \texttt{gwaa.data-class} can be
-accessed through \texttt{nids} function, e.g.
+accessed through \texttt{nids} function, \eg
<<>>=
nids(srdta)
@
@@ -120,9 +115,9 @@
nsnps(srdta)
@
-The IDs of individuals included in the study can be
-accessd via \texttt{idnames} function, for example
-IDs of the first 7 individuals in the study are
+The IDs of the individuals included in the study can be
+accessed via the \texttt{idnames} function, for example
+the IDs of the first 7 individuals in the study are
<<>>=
idnames(srdta)[1:7]
@
@@ -132,18 +127,18 @@
<<>>=
male(srdta)[1:7]
@
-where males (heterogametic sex) are assigned with '1' and
-a homogametic sex (females) are assigned value '0'.
+where males (heterogametic sex) are assigned with ``1'' and
+a homogametic sex (females) are assigned the value ``0''.
Names of SNPs can be
-accessed using \texttt{snpnames} function; for
+accessed using the \texttt{snpnames} function; for
example the names of the first 10 SNPs in the
\texttt{srdta} are
<<>>=
snpnames(srdta)[1:10]
@
-SNP annotation include (presented for the first 10 SNPs only):
+SNP annotation includes (shown for the first 10 SNPs only):
\begin{itemize}
\item Chromosome:
<<>>=
@@ -153,14 +148,14 @@
<<>>=
map(srdta)[1:10]
@
-\item Coding (where the second allele is the ''effect'' or ''coded'' one):
+\item Coding (where the second allele is the ``effect'' or ``coded'' one):
<<>>=
coding(srdta)[1:10]
@
-For every SNP, coding is presented with a pair of characters,
-for example ''AG''. For 'AG' polymorphism, you may expect
-''AA'', ''AG'' and ''GG'' genotypes to be found in population.
-The order (that is ''AG'' vs ''GA'') is important --
+For every SNP, the coding is represented with a pair of characters,
+for example ``AG''. For an ``AG'' polymorphism, you may expect
+``AA'', ``AG'' and ``GG'' genotypes to be found in your population.
+The order (that is ``AG'' vs.~``GA'') is important --
the first allele reported is the one
which will be used as a reference in association analysis, and
thus the effects are reported for the second allele. You can
@@ -173,16 +168,16 @@
effallele(srdta)[1:10]
@
-\item Strand on which the coding is reported ('+', '-' or missing, 'u'):
+\item The strand on which the coding is reported ('+', '-' or missing, 'u'):
<<>>=
strand(srdta)[1:10]
@
\end{itemize}
\begin{summary}
-\item \GA{} uses special data class, \texttt{gwaa.data-class}, to
+\item \GA{} uses a special data class, \texttt{gwaa.data-class}, to
store GWA data.
-\item To access the content of an object of \texttt{gwaa.data-class},
+\item To access the content of an object of \texttt{gwaa.data-class},
a number of functions is used
\end{summary}
@@ -210,18 +205,18 @@
<<>>=
nids(srdta) - sum(male(srdta))
@
-... or you could get the same answer like this\footnote{
+\ldots you could get the same answer like this\footnote{
This is something covered later in the section \ref{subs:gwaa_subs}
("\nameref{subs:gwaa_subs}")
}:
<<>>=
-sum(male(srdta)==0)
+sum(male(srdta) == 0)
@
The proportion of males can be computed using above results
<<>>=
-sum(male(srdta))/nids(srdta)
+sum(male(srdta)) / nids(srdta)
@
-or by using \texttt{mean()} function:
+or by using the \texttt{mean()} function:
<<>>=
mean(male(srdta))
@
@@ -258,22 +253,22 @@
%\end{ex}
\begin{Exercise}[title=Exploring SNPs in \texttt{srdta}]%\label{ex2.1}
-Explore SNPs contained in \texttt{srdta} using functions to
+Explore the SNPs contained in \texttt{srdta} using the functions to
access SNP names (\texttt{snpnames}) and map (\texttt{map})
location
\Question What are names of markers located after 2,490,000 b.p.?
\Question Between 1,100,000 and 1,105,000 b.p.?
\end{Exercise}
\begin{Answer}
-The names of markers located after 2,490,000 b.p. are
+The names of markers located after 2,490,000 b.p.~are
<<>>=
-vec <- (map(srdta)>2490000)
+vec <- (map(srdta) > 2490000)
snpnames(srdta)[vec]
@
The names of markers located
-between 1,100,000 and 1,105,000 b.p. are:
+between 1,100,000 and 1,105,000 b.p.~are:
<<>>=
-vec <- (map(srdta)>1100000 & map(srdta)<1105000)
+vec <- (map(srdta) > 1100000 & map(srdta) < 1105000)
snpnames(srdta)[vec]
@
\end{Answer}
@@ -284,19 +279,19 @@
\section{Accessing and modifying phenotypic data}
\label{subsec:modify_phdata}
-As it was already mentioned, the object returned by \texttt{phdata}
+As was already mentioned, the object returned by \texttt{phdata}
contains phenotypic data and is
-an conventional data frame, wich obligatory includes
+a conventional data frame, wich obligatory includes
'id' and 'sex' variables, and ordered an a way that it
couples to the genotypic data.
Being a data frame, \texttt{phdata} may be accessed using
-corresponding methods:
+the corresponding methods:
<<>>=
-phdata(srdta)[1:5,]
+phdata(srdta)[1:5, ]
class(phdata(srdta))
-phdata(srdta)[1:5,2]
-phdata(srdta)[1:5,"sex"]
+phdata(srdta)[1:5, 2]
+phdata(srdta)[1:5, "sex"]
phdata(srdta)$sex[1:5]
@
@@ -309,36 +304,37 @@
from \texttt{phdata} part of an object of
\texttt{gwaa.data-class}.
-For example, if you want to add a variable (say, square of age)
-computed from the 'age' variable of \texttt{srdta}
+For example, if you want to add a variable (say, the square of age)
+computed from the ``age'' variable of \texttt{srdta}
<<>>=
-phdata(srdta)[1:5,]
+phdata(srdta)[1:5, ]
age2 <- phdata(srdta)$age^2
@
-you need to use \texttt{add.phdata} function:
+you need to use the \texttt{add.phdata} function:
<<>>=
-srdta <- add.phdata(srdta,newph=age2,name="age_squared")
-phdata(srdta)[1:5,]
+srdta <- add.phdata(srdta, newph=age2, name="age_squared")
+phdata(srdta)[1:5, ]
@
-You can add more then one variable at once using the
-same function, however, in this case the second ('newph')
+You can add more than one variable at once using the
+same function, however, in this case the second (``newph'')
argument of the function should be a data frame,
-which contains 'id' variable specifing the IDs of
+which contains an 'id' variable specifing the IDs of the
individuals. Imagine we have the data for individuals
'p1', 'p2' and 'p7' (we will generate random data
for them; pay attention only to the result):
<<>>=
-newvalues <- matrix(rnorm(3*5),3,5)
-newdata <- data.frame(id=c("p1","p2","p7"),ph1=1,ph2=1,ph3=1,ph4=1,ph5=1)
-newdata[,c(2:6)] <- newvalues
+newvalues <- matrix( rnorm(3*5), 3, 5 )
+newdata <- data.frame(id=c("p1", "p2", "p7"),
+ ph1=1, ph2=1, ph3=1, ph4=1, ph5=1)
+newdata[, c(2:6)] <- newvalues
newdata
@
-These data can be added to phenotypic data with
+These data can be added to the phenotypic data with
<<>>=
srdta <- add.phdata(srdta,newdata)
-phdata(srdta)[1:10,]
+phdata(srdta)[1:10, ]
@
Finally, if you need, you can delete some phenotypes
@@ -346,15 +342,16 @@
function. Let us delete the phenotypes we have just
added:
<<>>=
-srdta <- del.phdata(srdta,c("age_squared","ph1","ph2","ph3","ph4","ph5"))
-phdata(srdta)[1:10,]
+srdta <- del.phdata(srdta,
+ c("age_squared", "ph1", "ph2", "ph3", "ph4", "ph5"))
+phdata(srdta)[1:10, ]
@
\begin{summary}
\item Phenotypic data contained in an object of \texttt{gwaa.data-class}
- can be accessed using \texttt{phdata} functions
-\item You can add phenotypes using \texttt{add.phdata} function
-\item You can delete phenotypes using \texttt{del.phdata} function
+ can be accessed using the \texttt{phdata} functions
+\item You can add phenotypes using the \texttt{add.phdata} function
+\item You can delete phenotypes using the \texttt{del.phdata} function
\end{summary}
@@ -368,12 +365,12 @@
you may think of
an object of class \texttt{gwaa.data} as a matrix whose rows
correspond to study subjects and columns correspond to SNPs studied
-(though the actual object is a way more complicated). For example,
-if we would like to investigate what is the
+(though the actual object is more complicated). For example,
+if we would like to investigate the
content of \texttt{srdta} for the first 5 people and 3 SNPs, we can
run
<<>>=
-ssubs <- srdta[1:5,1:3]
+ssubs <- srdta[1:5, 1:3]
class(ssubs)
@
@@ -384,42 +381,41 @@
<<>>=
phdata(ssubs)
@
-and genotypc data, whcih can be accessed via \texttt{gtdata}
+and genotypc data, which can be accessed via the \texttt{gtdata}
function:
<<>>=
gtdata(ssubs)
@
whose content is not quite straightforward to read.
-To get human-readable information, genotypic object
+To get human-readable information, a genotypic object
should be coerced to a regular \texttt{R} data type,
-e.g. character, using
+\eg character, using the
\texttt{as.character()} function:
<<>>=
as.character(gtdata(ssubs))
@
-Other useful coercion is to "numeric":
+Another useful coercion is to "numeric":
<<>>=
as.numeric(gtdata(ssubs))
@
-
Note that conversion to numeric happened according to the
underlying genotype and the rules specified by SNP coding:
<<>>=
coding(ssubs)
@
-- the genotype, which is made of the 'first'
-allele of the 'code' is converted to 0,
-the heterozygote to '1' and a himozygote
-for the second allele is converted to '2'.
+allele of the 'code' is converted to ``0'',
+the heterozygote to ``1'' and a homozygote
+for the second allele is converted to ``2''.
-For example, when coding is ''GA'', as is for the \texttt{rs18}
+For example, when the coding is ``GA'', as it is for the \texttt{rs18}
(the second SNP), homozygotes for the first allele, as
-specified by coding (''G'')
-are converted to zeros (''0''), heterozygotes are converted to
-ones (''1''), and homozygotes for the second allele (''A'') are
-converted to twos (''2''). Clearly, when numerically converted
+specified by coding (``G'')
+are converted to zeros (``0''), heterozygotes are converted to
+ones (``1''), and homozygotes for the second allele (``A'') are
+converted to twos (``2''). Clearly, when numerically converted
data are used for association analysis, the effects will be
estimated for the second allele, while first will be used as
a reference.
@@ -430,11 +426,11 @@
Several useful genetic analysis libraries were developed for R.
These include \texttt{genetics} (analysis of linkage disequilibrium and
many other useful functions) and \texttt{haplo.stats} (analysis of
-association between traits and haplotypes). These use there own
+association between traits and haplotypes). These use their own
genetic data formats.
One can translate \GA{} genetic data
-to the format used by "genetics" library by \texttt{as.genotype()}:
+to the format used by "genetics" library using \texttt{as.genotype()}:
<<>>=
as.genotype(gtdata(ssubs))
@
@@ -450,127 +446,127 @@
as \texttt{scan.haplo()} and \texttt{scan.haplo.2D()}).
It is possible to select sub-sets of \texttt{gwaa.data-class} based not
-only on index (e.g. first 10 people and SNP number 33), but also based
+only on index (\eg first 10 people and SNP number 33), but also based
on names.
For example, if we would like to retrieve phenotypic data on people
with IDs "p141", "p147" and "p2000", we can use
<<>>=
-phdata(srdta[c("p141","p147","p2000"),])
+phdata( srdta[c("p141", "p147", "p2000"), ] )
@
here, the first part of expression sub-sets \texttt{srdta} on
selected IDs, and the second tells which part of the retrieved
sub-set we want to see. You can try
-\texttt{srdta[c("p141","p147","p2000"),]}, but be prepared to
+\texttt{srdta[c("p141", "p147", "p2000"),]}, but be prepared to
see long output, as all information will be reported.
-In similar manner, we can also select on SNP name. For example,
+In a similar manner, we can also select on SNP name. For example,
if we are interested to see information on SNPs "rs10" and "rs29"
for above people, we can run
<<>>=
-phdata(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
-gtdata(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
+phdata(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
+gtdata(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
@
To see the actual genotypes for the above three people and
two SNPs, use
<<>>=
-as.character(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
+as.character(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
@
or
<<>>=
-as.numeric(srdta[c("p141","p147","p2000"),c("rs10","rs29")])
+as.numeric(srdta[c("p141", "p147", "p2000"), c("rs10", "rs29")])
@
\begin{Exercise}[title=Exploring rs114]%\label{allelecount}
Explore genotypes for SNP "rs114"
\Question What is the coding and which allele is the reference one?
-\Question What is the frequency of \textbf{non-reference} (''effective'') allele in total sample?
-\Question What is the frequency of effective allele in male?
-\Question What is the frequency of effective allele in female?
+\Question What is the frequency of \textbf{non-reference} (``effect'') allele in total sample?
+\Question What is the frequency of the effect allele in males?
+\Question What is the frequency of the effect allele in females?
\Question What is the frequency of the \textbf{reference} allele
-in total sample, males and females?
+in the total sample, males and females?
\end{Exercise}
\begin{Answer}
-To learn what allele of ''rs114'' is the reference
+To learn what allele of ``rs114'' is the reference
you need to run
<<>>=
coding(srdta)["rs114"]
@
-Here, the first (''\Sexpr{strsplit(coding(srdta)["rs114"],"")[[1]][1]}'')
+Here, the first (``\Sexpr{strsplit(coding(srdta)["rs114"], "")[[1]][1]}'')
allele is the reference and thus
-the second (''\Sexpr{strsplit(coding(srdta)["rs114"],"")[[1]][2]}'')
-is the effective one. Remember that when using \texttt{as.numeric}
+the second (``\Sexpr{strsplit(coding(srdta)["rs114"], "")[[1]][2]}'')
+is the effect allele. Remember that when using the \texttt{as.numeric}
function to convert the genotypes to human-readable and \texttt{R}-operatable
-format, the homozygotes for reference will be coded as ''0'', heterozygotes
-as ''1'' and the non-reference (''effective'') homozygotes will be coded as ''2'':
+format, the homozygotes for reference will be coded as ``0'', heterozygotes
+as ``1'' and the non-reference (``effect'') homozygotes will be coded as ``2'':
<<>>=
-table(as.character(gtdata(srdta[,"rs114"])),as.numeric(gtdata(srdta[,"rs114"])))
+table(as.character( gtdata(srdta[, "rs114"] )),
+ as.numeric( gtdata(srdta[,"rs114"] )))
@
-To compute frequency of the effective allele of SNP ''rs114'' in total
+To compute the frequency of the effect allele of SNP ``rs114'' in the total
sample, you can go two ways. First, we can try to take a sum
-of all ''rs114'' genotypes and divide it by twice the number of
+of all ``rs114'' genotypes and divide it by twice the number of
people:
<<>>=
-a <- as.numeric(gtdata(srdta[,"rs114"]))
+a <- as.numeric(gtdata(srdta[, "rs114"]))
sum(a)
@
-This, however, returns NA, because some of the genotypes are
+This, however, returns \texttt{NA}, because some of the genotypes are
missing. We can deal with this problem by running \texttt{sum()}
with the option \texttt{na.rm=TRUE}:
<<>>=
-sum(a,na.rm=T)
+sum(a, na.rm=TRUE)
@
-so the number of 'effect' alleles is \Sexpr{sum(a,na.rm=T)}.
+so the number of ``effect'' alleles is \Sexpr{sum(a, na.rm=TRUE)}.
However, now we do not know
what was the number of people for whom the genotype was
measured! -- \texttt{nids} would return the \emph{total}
-number of people, but not the number of ones measured for
-''rs114''.
+number of people, but not the number of those measured for
+``rs114''.
-This problem can be dealt with through using \texttt{is.na(A)}
+This problem can be dealt with through using the \texttt{is.na(A)}
function which returns true when some element of \texttt{A} is not measured.
-Thus, the number of people with measured genotype for ''rs114'' is
+Thus, the number of people with measured genotype for ``rs114'' is
<<>>=
nids(srdta)
nmeasured <- sum(!is.na(a))
nmeasured
@
-(note the ''!'' before \texttt{is.na}, which means NOT, so we get
-these elements which are not NA).
-The frequency of the 'effect' allele thus is
+(note the ``!'' before \texttt{is.na}, which means NOT, so we get
+these elements which are not \texttt{NA}).
+Consequently, the frequency of the ``effect'' allele is
<<>>=
-sum(a,na.rm=T)/(2*nmeasured)
+sum(a, na.rm=TRUE) / (2 * nmeasured)
@
-
-An easier way would be to compute mean value of ''rs114'' with the
-\texttt{mean( ... ,na.rm=TRUE)} function and divide it by 2:
+An easier way would be to compute mean value of ``rs114'' with the
+\texttt{mean( \ldots , na.rm=TRUE)} function and divide it by 2:
<<>>=
-mean(a,na.rm=T)/2
+mean(a, na.rm=TRUE)/2
@
-To compute frequency of the effective allele of ''rs114'' in males,
+To compute the frequency of the effect allele of ``rs114'' in males,
you can use
<<>>=
-amale <- as.numeric(gtdata(srdta[male(srdta)==1,"rs114"]))
-mean(amale,na.rm=T)/2
+amale <- as.numeric( gtdata(srdta[male(srdta)==1, "rs114"]) )
+mean(amale, na.rm=TRUE)/2
@
-To compute frequency of the effective allele in females, you can use
+To compute the frequency of the effect allele in females, you can use
<<>>=
-afemale <- as.numeric(gtdata(srdta[male(srdta)==0,"rs114"]))
-mean(afemale,na.rm=T)/2
+afemale <- as.numeric( gtdata(srdta[male(srdta)==0, "rs114"]) )
+mean(afemale, na.rm=TRUE)/2
@
The frequencies of the reference allele are computed very simply as
one minus the frequency of the effective allele:
<<>>=
-1 - mean(a,na.rm=T)/2
-1 - mean(amale,na.rm=T)/2
-1 - mean(afemale,na.rm=T)/2
+1 - mean(a, na.rm=TRUE)/2
+1 - mean(amale, na.rm=TRUE)/2
+1 - mean(afemale, na.rm=TRUE)/2
@
\end{Answer}
@@ -579,33 +575,33 @@
\begin{summary}
\item It is possible to obtain subsets of objects of
\texttt{gwaa.data-class} and \texttt{snp.data-class}
-using standard 2D sub-setting model \texttt{[i,j]},
+using the standard 2D sub-setting model \texttt{[i, j]},
where \texttt{i} corresponds to study subjects and
\texttt{j} corresponds to SNPs.
\item It is possible to provide ID and SNP names instead
of indexes when sub-setting an object of class
\texttt{gwaa.data-class}.
-\item Function \texttt{as.numeric()} converts
+\item The function \texttt{as.numeric()} converts
genotypic data from \texttt{snp.data-class} to
regular integer numbers, which can be used in analysis
with R.
-\item Function \texttt{as.character()} converts
+\item The function \texttt{as.character()} converts
genotypic data from \texttt{snp.data-class} to
character format.
-\item Function \texttt{as.genotype()} converts
+\item The function \texttt{as.genotype()} converts
genotypic data from \texttt{snp.data-class} to
-the format used by library \texttt{genetics}.
-\item Function \texttt{as.hsgeno()} converts
+the format used by the \texttt{genetics} library.
+\item The function \texttt{as.hsgeno()} converts
genotypic data from \texttt{snp.data-class} to
-the format used by library \texttt{haplo.stats}.
+the format used by the \texttt{haplo.stats} library.
\end{summary}
\section{Exploring genetic data}
\label{subs:expgendat}
-Implementation of function \texttt{summary()} to
-summarize genotypic part of \texttt{gwaa.data-class}
+Implementation of the function \texttt{summary()} to
+summarize the genotypic part of a \texttt{gwaa.data-class} object
is very useful in genetic data
exploration and quality control (QC).
Let us try application of this function to the
@@ -615,44 +611,44 @@
a
@
-In the first section, the summary is generated for phenotypic data.
-In the second section, summary is generated for genotypic data.
+In the first section, a summary is generated for the phenotypic data.
+In the second section, a summary is generated for the genotypic data.
In this section, \texttt{NoMeasured} refers to the number
of genotypes scores, \texttt{CallRate} to the proportion of these,
\texttt{Q.2} is the frequency of the 'B' allele. The counts in
three genotypic classes are provided next.
-\texttt{Pexact} refers to exact P-value for the test of Hardy-Weinberg
+\texttt{Pexact} refers to exact $P$-value for the test of Hardy-Weinberg
equilibrium.
-As you've seen above, an object of the class \texttt{gwaa.data-class}
-is sub-settable in
-standard manner: \texttt{[i,j]}, where \texttt{i} is an index of a
-study subject and \texttt{j} is an index of a SNP. Importantly,
+As you have seen above, an object of the class \texttt{gwaa.data-class}
+is sub-settable in the
+standard manner: \texttt{[i, j]}, where \texttt{i} is the index of a
+study subject and \texttt{j} is the index of a SNP. Importantly,
\texttt{i} could be a list of indexes:
<<>>=
-vec <- which(phdata(srdta)$age>=65)
+vec <- which(phdata(srdta)$age >= 65)
vec
-summary(gtdata(srdta[vec,1:3]))
+summary( gtdata(srdta[vec, 1:3]) )
@
This shows summary of first three genotypes for people with age
greater then or equal to 65 y.o. The same result may be achieved
by sub-setting using a vector of logical values:
<<>>=
-vec <- (phdata(srdta)$age>=65)
+vec <- (phdata(srdta)$age >= 65)
table(vec)
-summary(gtdata(srdta[vec,1:3]))
+summary( gtdata(srdta[vec, 1:3]) )
@
or a list with IDs of study subjects:
<<>>=
vec1 <- idnames(srdta)[vec]
vec1
-summary(gtdata(srdta[vec1,1:3]))
+summary( gtdata(srdta[vec1, 1:3]) )
@
-Let us explore the object returned by \texttt{summary} function
-when applied to \texttt{snp.data} class in more details:
+Let us explore the object returned by the \texttt{summary} function
+when applied to \texttt{snp.data} class in more detail:
<<>>=
-a <- summary(gtdata(srdta[vec1,1:3]))
+a <- summary( gtdata(srdta[vec1, 1:3]) )
class(a)
@
Thus, the object returned is a \texttt{data.frame}. Therefore
@@ -674,18 +670,18 @@
\begin{Answer}
To test for HWE in first 10 SNPs in total sample
<<>>=
-summary(gtdata(srdta[,1:10]))
+summary( gtdata(srdta[, 1:10]) )
@
To test it in cases
<<>>=
-summary(gtdata(srdta[phdata(srdta)$bt==1,1:10]))
+summary( gtdata(srdta[phdata(srdta)$bt==1, 1:10]) )
@
in controls
<<>>=
-summary(gtdata(srdta[phdata(srdta)$bt==0,1:10]))
+summary( gtdata(srdta[phdata(srdta)$bt==0, 1:10]) )
@
SNPs 'rs73' and 'rs128' are out of HWE (at $p \le 0.05$)
-in total sample, and also in cases and controls.
+in the total sample, and also in cases and controls.
\end{Answer}
%Several pieces of information are critical in accessing quality of
@@ -698,70 +694,63 @@
Let us analyse the distribution of call rate in the whole study. For this,
we first need to obtain the vector of call rates:
-
<<>>=
sumgt <- summary(gtdata(srdta))
-crate <- sumgt[,"CallRate"]
+crate <- sumgt[, "CallRate"]
@
-This vector may be presented by a histogram
+This vector may be depicted by a histogram
<<>>=
hist(crate)
@
which shows that most SNPs have call rate between 93 and 97\%
-(figure \ref{fig:histCR}).
+(Figure~\ref{fig:histCR}).
\begin{figure}
\centering
<<fig=true,echo=false>>=
hist(crate)
@
-\caption{Histogram of the call rate}
+\caption{Histogram of the call rate.}
\label{fig:histCR}
\end{figure}
-As next step, you would like to produce a summary table,
+As a next step, you would like to produce a summary table,
showing how many markers had call rate lower than, say,
93\%, between 93 and 95\%, between 95 and 99\% and more than 99\%.
-You can use \texttt{catable()} command for that:
-
+You can use the \texttt{catable()} command for that:
<<>>=
-catable(crate,c(.93,.95,.99))
+catable(crate, c(.93, .95, .99))
@
-
-Similar procedure may be applied to see deviation from HWE:
-
+A similar procedure may be applied to see deviation from HWE:
<<>>=
-hwp <- sumgt[,"Pexact"]
-catable(hwp,c(0.05/nsnps(srdta),0.01,0.05,0.1))
+hwp <- sumgt[, "Pexact"]
+catable(hwp, c(0.05/nsnps(srdta), 0.01, 0.05, 0.1))
@
-
The first cut-off category will detect SNPs which are
-deviating from HWE at the Bonferroni-corrected P-level.
+deviating from HWE at the Bonferroni-corrected $P$-level.
-However, for these data it will make more sense to table cumulative
+However, for these data it will make more sense to table the cumulative
distribution:
-
<<>>=
-catable(hwp,c(0.05/nsnps(srdta),0.01,0.05,0.1),cum=T)
+catable(hwp, c(0.05/nsnps(srdta), 0.01, 0.05, 0.1), cum=TRUE)
@
If you would like to investigate the minor allele frequency
(MAF) distribution, the same logic would apply. First, derive
-MAF with
+the MAF with
<<>>=
-afr <- sumgt[,"Q.2"]
-maf <- pmin(afr,(1.-afr))
+afr <- sumgt[, "Q.2"]
+maf <- pmin(afr, (1. - afr))
@
-
Next, generate histograms for frequency and MAF:
\begin{figure}
\centering
<<fig=true,echo=false>>=
-par(mfcol=c(1,2))
+par(mfcol=c(1, 2))
hist(afr)
hist(maf)
@
-\caption{Histogram of the call rate}
+\caption{Histogram of the call rate.}
\label{fig:histMAF}
\end{figure}
<<>>=
@@ -769,19 +758,19 @@
hist(afr)
hist(maf)
@
-(shown at the figure \ref{fig:histMAF}) and then generate table describing frequency distribution:
+(shown in Figure~\ref{fig:histMAF}) and then generate a table describing the
+frequency distribution:
<<>>=
-catable(afr,c(0.01,0.05,0.1,0.2,0.5,0.8,0.9,0.95,0.99))
-catable(maf,c(0,0.01,0.05,0.1,0.2),cum=T)
+catable(afr, c(0.01, 0.05, 0.1, 0.2, 0.5, 0.8, 0.9, 0.95, 0.99))
+catable(maf, c(0, 0.01, 0.05, 0.1, 0.2), cum=TRUE)
@
+Note that we used ``0'' as the first category -- this will give you the number of
+monomorphic SNPs which we recommend to exclude from analysis.
-Note that we used "0" as the first category -- this will give you the number of
-monomorhic SNPs which we recommend to exclude from analysis.
-
-Other function, \texttt{perid.summary}, produces summary SNP statistics
+Another function, \texttt{perid.summary}, produces summary SNP statistics
per person. Let us try producing this summary for the first 10 people:
<<>>=
-perid.summary(srdta[1:10,])
+perid.summary(srdta[1:10, ])
@
This table lists the number of genotypes scored for the person,
call rate, and heterozygosity.
@@ -793,56 +782,56 @@
<<>>=
het <- perid.summary(srdta)$Het
mean(het)
-catable(het,c(0.1,0.25,0.3,0.35,0.5))
+catable(het, c(0.1, 0.25, 0.3, 0.35, 0.5))
hist(het)
@
-The resulting histogram is presented in figure \ref{fig:histHet}.
-It is easy to see that few people have very low heterozygosity,
+The resulting histogram is shown in Figure~\ref{fig:histHet}.
+It is easy to see that a few people have very low heterozygosity,
but there are no outliers with extremely high values.
\begin{figure}
\centering
<<fig=true,echo=false>>=
hist(het)
@
-\caption{Histogram of heterozygosity}
+\caption{Histogram of heterozygosity.}
\label{fig:histHet}
\end{figure}
-In this section, we covered low-level functions \texttt{summary}
-and \texttt{perid.summary}. Based on these, an upper-level
+In this section, we covered the low-level functions \texttt{summary}
+and \texttt{perid.summary}. On these functions a higher level
genetic data quality control function, \texttt{check.marker}, is
based. That function will be covered in the next section.
\begin{summary}
-\item When \texttt{summary()} function is applied to an
-\texttt{gtdata} subset of \texttt{gwaa.data-class}, it return summary statistics for SNPs,
-including exact test for Hardy-Weinberg equilibrium.
-\item When \texttt{perid.summary()} function is applied to an
-object of \texttt{gwaa.data-class} (or \texttt{gtdata} part of it),
-it return per-person summary statistics,
-including the call rate within this person and its' heterozygosity.
+\item When the \texttt{summary()} function is applied to a
+\texttt{gtdata} subset of \texttt{gwaa.data-class}, it returns summary statistics for the SNPs,
+including an exact test for Hardy-Weinberg equilibrium.
+\item When the \texttt{perid.summary()} function is applied to an
+object of \texttt{gwaa.data-class} (or the \texttt{gtdata} part of it),
+it returns per-person summary statistics,
+including the call rate within this person and its heterozygosity.
\end{summary}
\begin{Exercise}[title=Characterizing call rate]
Characterise the distribution of call rates within study subjects
-and produce a histogram. How many people have
+and produce a histogram. How many people have a
call rate below 93\%?
\end{Exercise}
\begin{Answer}
To characterize ID call rate, you can run the following commands:
<<>>=
idsummary <- perid.summary(srdta)
- idsummary[1:5,]
+ idsummary[1:5, ]
idcall <- idsummary$Call
idcall[1:5]
- catable(idcall,c(0.9,0.93,0.95,0.98,0.99))
- table(idcall<0.93)
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/genabel -r 1193
More information about the Genabel-commits
mailing list