[Genabel-commits] r1191 - tutorials/GenABEL_general

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Tue Apr 16 18:30:30 CEST 2013


Author: lckarssen
Date: 2013-04-16 18:30:29 +0200 (Tue, 16 Apr 2013)
New Revision: 1191

Modified:
   tutorials/GenABEL_general/GWA.Rnw
Log:
Fix several layout and spelling bugs in the GWA chapter of the GenABEL tutorial.

Modified: tutorials/GenABEL_general/GWA.Rnw
===================================================================
--- tutorials/GenABEL_general/GWA.Rnw	2013-04-11 06:44:54 UTC (rev 1190)
+++ tutorials/GenABEL_general/GWA.Rnw	2013-04-16 16:30:29 UTC (rev 1191)
@@ -17,15 +17,15 @@
 will use is approaches the amount to be expected 
 in a real experiment. However, because the regions 
 are small, and the LD between SNPs is high, 
-some specific features (e.g. relatively high 
+some specific features (\eg relatively high 
 residual inflation, which occurs because 
 large proportion of SNPs are in LD with 
-the reuly associated ones)
+the really associated ones)
 are specific 
 features of this data set, which are not observed in 
 true GWA studies. 
 
-Start R and load \GA{} library 
+Start R and load the \GA{} library 
 by typing 
 <<results=hide>>=
 library(GenABEL)
@@ -35,7 +35,7 @@
 data(ge03d2ex)
 @
 
-Investigate the objects loaded by command
+Investigate the objects loaded by the command
 <<>>=
 ls()
 @
@@ -44,28 +44,25 @@
 <<>>=
 class(ge03d2ex)
 @
-
 To check what are the names of variables in the phenotypic data frame, use 
 <<>>=
 names(phdata(ge03d2ex))
 @
-
-We can attach this data frame
-to the \texttt{R} search path by
+We can attach this data frame to the \texttt{R} search path by
 <<>>=
 attach(phdata(ge03d2ex))
 @
 
 \section{Data descriptives and first round of GWA analysis}
 
-Let us investigate what are the traits presented in the data frame loaded 
-and what are the characteristics of the distribution by using specific 
+Let us investigate which traits are present in the loaded data frame 
+and what are the characteristics of the distribution by using the 
 \GA{} function \texttt{descriptive.trait}:
 <<>>=
 descriptives.trait(ge03d2ex)
 @
 
-You can see that phenotypic frame contains the data on 136 people; the data 
+You can see that the phenotypic frame contains data on 136 people; the data 
 on sex, age, height, weight, diet and body mass index (BMI) are available. 
 Our trait of interest is \texttt{dm2} (type 2 diabetes). Note that every 
 single piece of information in this data set is simulated; however, we 
@@ -74,7 +71,7 @@
 You can produce a summary for cases and controls separately and 
 compare distributions of the traits by 
 <<>>=
-descriptives.trait(ge03d2ex,by=dm2)
+descriptives.trait(ge03d2ex, by=dm2)
 @
 Here, the \texttt{by} argument specifies the grouping variable.
 You can see that cases and controls are different in weight, which 
@@ -86,62 +83,59 @@
 descriptives.marker(ge03d2ex)
 @
 It is of note that we can see inflation of the proportion of the 
-tests for HWE at particular threshold, as compared to the expected. 
+tests for HWE at a particular threshold, as compared to the expected. 
 This may indicate poor genotyping quality and/or genetic stratification.
 
 We can test the GW marker characteristics in controls by
 <<>>=
-descriptives.marker(ge03d2ex,ids=(dm2==0))
+descriptives.marker(ge03d2ex, ids=(dm2==0))
 @
 Apparently, HWE distribution holds better in controls than in 
 the total sample. 
 
 Let us check whether there are indications that deviation from HWE 
-is due to cases. At this stage we are only interested in HWE distribution 
+is due to cases. At this stage we are only interested in the HWE distribution 
 table, and therefore will ask to report the distrbution 
-for cases (\texttt{dm2==1}) and report only the table two:
+for cases (\texttt{dm2==1}) and report only table two:
 <<>>=
-descriptives.marker(ge03d2ex,ids=(dm2==1))[2]
+descriptives.marker(ge03d2ex, ids=(dm2==1))[2]
 @
 and for the controls
 <<>>=
-descriptives.marker(ge03d2ex,ids=(dm2==0))[2]
+descriptives.marker(ge03d2ex, ids=(dm2==0))[2]
 @
-
 It seems that indeed excessive number of markers are out of HWE in cases. 
-If no laboratory procedure (e.g. DNA extraction, genotyping, calling) were 
+If no laboratory procedure (\eg DNA extraction, genotyping, calling) were 
 done for cases and controls separately, this may indicate possible 
 genetic heterogeneity specific for cases. 
 
-In essence, '\texttt{descriptives.marker}' function 
-uses '\texttt{summary}' function to generate HW P-values 
+In essence, the '\texttt{descriptives.marker}' function 
+uses the '\texttt{summary}' function to generate the HW $P$-values 
 distribution. It may be interesting to generate this 
 distribution using the '\texttt{summary}' function 
 You do not need to do so, but this example shows how 
 you can generate summaries from underlying SNP-tables. 
-First, we need to 
-compute summary SNP statistics by
-
+First, we need to compute summary SNP statistics by
 <<>>=
-s <- summary(gtdata(ge03d2ex[(dm2==1),]))
+s <- summary( gtdata( ge03d2ex[(dm2==1), ] ) )
 @
 
 Note the you have produced the summary for the \texttt{gtdata} 
 slot of \texttt{ge03d2ex}; this is the slot which actually 
 contain all genetic data in special compressed format. 
 
-You can see first 5 rows of this very long summary table by
+You can see the first 5 rows of this very long summary table by
 <<>>=
-s[1:5,]
+s[1:5, ]
 @
 Note that the column 'Pexact' provides exact HWE test $P$-values we need. 
 We can extract these to a separate vector by
 <<>>=
-pexcas <- s[,"Pexact"]
+pexcas <- s[, "Pexact"]
 @
 and do characterization of the cummulative distribution by 
 <<>>=
-catable(pexcas,c(0.001,0.01,0.05,0.1,0.5),cumulative=TRUE)
+catable(pexcas, c(0.001,0.01,0.05,0.1,0.5), cumulative=TRUE)
 @
 
 You can generate the distribution for controls in similar manner.
@@ -172,13 +166,13 @@
 %%
 %%You can repeat this test for the controls, if time permits.
 
-Let us first try do GWA scan using the raw (before quality control) data.
+Let us first try do a GWA scan using the raw (before quality control) data.
 We will use the score test, as implemented in the 
 \texttt{qtscore()}\footnote{consider '\texttt{mlreg}' function 
-if you want to run true linear regression}
+if you want to run a true linear regression}
 function of \GA{} for testing:
 <<>>=
-an0 <- qtscore(dm2,ge03d2ex,trait="binomial")
+an0 <- qtscore(dm2, ge03d2ex, trait="binomial")
 @
 The first argument used describes the model; here it is rather 
 simple --- the affection status, \texttt{dm2}, is supposed 
@@ -187,41 +181,41 @@
 You can see what information is computed by this function by using
 <<>>=
 an0
-@
-Here, let us look at the 'Results table'. 
-\texttt{P1df}, \texttt{P2df} and \texttt{Pc1df} are most 
-interesting; the first two are vectors of 1 and 2 d.f. $P$-values 
-obtained in the GWA analysis, the last one is 1 d.f. 
-P-value corrected for inflation factor $\lambda$ (which is 
-presented in \texttt{lambda} object). \texttt{effB} corresponds to 
-(approximate) Odds Ratios estimate for the SNP. 
+@ 
+Here, let us look at the 'Results table'.  \texttt{P1df},
+\texttt{P2df} and \texttt{Pc1df} are most interesting; the first two
+are vectors of 1 and 2 d.f.~$P$-values obtained in the GWA analysis,
+the last one is 1 d.f.~$P$-value corrected for inflation factor
+$\lambda$ (which is in the \texttt{lambda}
+object). \texttt{effB} corresponds to the (approximate) Odds Ratio
+estimate for the SNP.
 
 Let us see if there is evidence for the inflation of the test 
 statistics; for that let us obtain $\lambda$ with
 <<>>=
 lambda(an0)
 @
-The estimate of $\lambda$ is \Sexpr{round(lambda(an0)$est,dig=2)},
+The estimate of $\lambda$ is \Sexpr{round(lambda(an0)$est, dig=2)},
 suggesting inflation of the test and some degree of stratification.
-Though the value abtained seems to be small, it should be noted 
-that $\lambda$ grows linearly with smaple size, so for this small 
+Though the value obtained seems to be small, it should be noted 
+that $\lambda$ grows linearly with sample size, so for this small 
 number of cases and controls the value is worrisome. 
 
 The $\lambda$ is computed by regression in a Q-Q plot. Both estimation 
- of $\lambda$ and production of the $\chi^2-\chi^2$ plot can be done 
+of $\lambda$ and production of the $\chi^2-\chi^2$ plot can be done 
 using the \texttt{estlambda} function; this was already done automatically 
 when running \texttt{qtscore} function, but let us repeat this manually:
 
 <<>>=
-estlambda(an0[,"P1df"],plot=TRUE)
+estlambda(an0[, "P1df"], plot=TRUE)
 @
-The corresponding $\chi^2-\chi^2$ plot is presented in Figure 
-\ref{fig:chi2chi2_P1df}.
+The corresponding $\chi^2-\chi^2$ plot is shown in 
+Figure~\ref{fig:chi2chi2_P1df}.
 
 \begin{figure}[t]
 \centering
 <<fig=true,echo=false,results=hide>>=
-estlambda(an0[,"P1df"],plot=TRUE)
+estlambda(an0[, "P1df"], plot=TRUE)
 @
 \caption{$\chi^2-\chi^2$ plot for a GWA scan. 
 Black line of slope 1: expected under no inflation; Red line: 
@@ -233,72 +227,72 @@
 \begin{tip}
 The 'se' produced by \texttt{estlambda} \emph{can not} 
 be used to test if inflation is significant and make 
-conclusions about presence of significant or insignificant 
+conclusions about the presence of significant or insignificant 
 stratification.
 \end{tip}
 
 We can also present the obtained results using the 
-''Manhatten plot'', where a SNP genomic position is on 
-the X-axes and $-log_{10}$ of the $P$-value is 
-shown on Y-axes: 
+''Manhatten plot'', where the SNP genomic position is on 
+the horizontal axis and $-\log_{10}$ of the $P$-value is 
+shown on the vertical axis: 
 <<>>=
 plot(an0)
 @
-The resulting plot is presented in the figure \ref{fig:scan_initial}. 
-By default, $-log_{10}(P-value)$ of not corrected 1 d.f. test are 
-presented; see help to figure out how this behaviour can be changed. 
+The resulting plot is show in Figure~\ref{fig:scan_initial}. 
+By default, $-\log_{10}(P\text{-value})$ of the uncorrected 1 d.f.~test are 
+shown; see thehelp to figure out how this behaviour can be changed. 
 
-We can also add the corrected P-values to the plot with
+We can also add the corrected $P$-values to the plot with
 <<>>=
-add.plot(an0,df="Pc1df",col=c("lightblue","lightgreen"))
+add.plot(an0, df="Pc1df", col=c("lightblue", "lightgreen"))
 @
 \begin{figure}[t]
 \centering
 <<fig=true,echo=false>>=
-plot(an0,col=c("blue","green"))
-add.plot(an0,df="Pc1df",col=c("lightblue","lightgreen"))
+plot(an0, col=c("blue", "green"))
+add.plot(an0, df="Pc1df", col=c("lightblue", "lightgreen"))
 @
-\caption{$-log_{10}(P-value)$ from the genome scan before QC procedure.
+\caption{$-\log_{10}(P\text{-value})$ from the genome scan before QC procedure.
 Raw analysis: darker circles; corrected analysis: lighter circles}
 \label{fig:scan_initial}
 \end{figure}
-You can see that $P$-values corrected by genoic control 
+You can see that the $P$-values corrected by genomic control 
 are uniformly lower than the $P$-values from 'raw' analysis. 
 This is to be expected as genomic control simply divides the 
 'raw' $\chi^2$ statistics by a constant $\lambda$ for all 
 SNPs. 
 
 You can also generate a descriptive table for the "top" (as ranked by 
-P-value) results by
+$P$-value) results by
 <<>>=
 descriptives.scan(an0)
 @
 or, equivalently, by '\texttt{summary(an0)}' 
 
-Here you see top 10 results, sorted by P-value with 1 d.f. If you want 
-to sort by the corrected P-value, you can use 
-\texttt{descriptives.scan(an0,sort="Pc1df")}; to see more then 10 
-(e.g. 25) top results, use \texttt{descriptives.scan(an0,top=25)}. 
+Here you see top 10 results, sorted by $P$-value with 1 d.f. If you want 
+to sort by the corrected $P$-value, you can use 
+\texttt{descriptives.scan(an0, sort="Pc1df")}; to see more than 10 
+(\eg 25) top results, use \texttt{descriptives.scan(an0, top=25)}. 
 You can combine all these options. Large part of results reports NA 
-as effect estimates and 9.99 as \textit{P-value} for 2 d.f. test -- 
-for these markers only two out of three possible 
-genotypes were observed, and consequently 2 d.f. test could not be 
+as effect estimates and 9.99 as $P$-value for 2 d.f.~test -- 
+for these markers only two out of the three possible 
+genotypes were observed, and consequently the 2 d.f.~test could not be 
 performed. 
 
-Now let us apply \texttt{qtscore()} function with \texttt{times}
+Now let us apply the \texttt{qtscore()} function with \texttt{times}
 argument, which tells it to compute empirical GW 
 (or experiment-wise) significance
 <<>>=
-an0.e <- qtscore(dm2,ge03d2ex,times=200,quiet=TRUE)
+an0.e <- qtscore(dm2, ge03d2ex, times=200, quiet=TRUE)
 @
 (you may skip the 'quiet=TRUE' argument, then you will see progress)
 
 Now let us generate the summary of the results
 <<>>=
-descriptives.scan(an0.e,sort="Pc1df")
+descriptives.scan(an0.e, sort="Pc1df")
 @
 
-Experimental-wise significance is computed by the empirical 
+Experimental-wise significance is computed by an empirical 
 procedure, thus we consider $P$-values $\le 0.05$ to be GW-significant.
 However, none of the SNPs hits GW significance. 
 If, actually, any did pass the threshold, we 
@@ -308,36 +302,38 @@
 Therefore before association analysis we need to do 
 rigorous Quality Control (QC).
 
-Note that at certain SNP, corrected \textit{P-values} become equal 
-to 1 -- at this point order is arbitrary because sorting could not be done. 
+Note that at a certain SNP, the corrected $P$-values become equal to 1
+-- at this point the order in the list is arbitrary because sorting
+could not be done.
 
 \begin{summary}
 \item The \texttt{descriptives} family of functions was developed 
-to facilitate production of tables which can be directly used 
+to facilitate the production of tables which can be directly used 
 in a manuscript --- it is possible to save the output as a file, 
-which can be open by Excel or Word. See e.g. 
+which can be open by Excel or Word. See \eg 
 \texttt{help(descriptives.trait)} for details. 
 \item The inflation of test statistics compared to null (1 d.f.) 
 may be estimated with \texttt{estlambda} function.
 \end{summary}
 
+
 \section{Genetic data QC}
 
 The major genetic data QC function of \GA{} is \texttt{check.marker()}.
 We will now use that to perform our data QC; the output is rather self-explaining.
 Because of possible genetic heterogeneity of the study data it 
-is good idea to skip Hardy-Weinberg checks in the first round of QC. 
-This can be achieved by setting HWE P-value 
+is a good idea to skip Hardy-Weinberg checks in the first round of QC. 
+This can be achieved by setting HWE $P$-value 
 selection threshold to zero (\texttt{p.level=0}):
 <<>>=
-qc1 <- check.marker(ge03d2ex,p.level=0)
+qc1 <- check.marker(ge03d2ex, p.level=0)
 @
 Note that normally you will \emph{NEVER} run 
 this simple form of the QC function -- you should always 
 provide a number of thresholds specific to the 
 platform you used for genotyping. See help to 
 \texttt{check.marker()} for detailed list of arguments. 
-Default values used by the function are rather 
+The default values used by the function are rather 
 relaxed compared to the thresholds routinely used 
 nowadays with most of the platforms. 
 
@@ -350,7 +346,7 @@
 subjects. This is not 
 a problem with the small number of people we use for 
 this example or when modern computers are used. However,
-the computers in the Nihes computer room are very old. 
+the computers in the computer room are very old. 
 Therefore be prepared to wait for long time when you 
 will do a self-exercise with 1,000 people.
 \end{tip}
@@ -361,30 +357,29 @@
 Next, X-chromosomal errors are identified. The function 
 finds out that all errors (heterozygous male X-genotypes) are due to two people with 
 wrong sex assigned and one marker, which looks like an autosomal one. 
-This actually could be a marker from pseudoautosomal region, which 
+This actually could be a marker from the pseudoautosomal region, which 
 should have been arranged as a separate "autosome". 
 Nine people are found to have intermediate inbreeding at the 
-X-chromosome and also excluded from analysis.  
+X-chromosome and are also excluded from analysis.  
 
 Then, the procedure finds the markers with low call rate ($\le 0.95$ 
 by default) 
 across people, markers with low MAF (by default, low MAF is defined as 
-less than few copies of the rare allele, see help for details); 
+less than a few copies of the rare allele, see help for details); 
 people with low call rate (default value: $\le 0.95$) across SNPs, people with extreme 
-heterozygosity (at FDR 0.01) and these who have GW IBS $\ge 0.95$. 
-These default parameters may be changed if you wish (consult help).
+heterozygosity (at FDR 0.01) and those who have GW IBS $\ge 0.95$. 
+These default parameters may be changed if you wish (consult the help).
 
 Because some of the people fail to pass the tests, the data set is not 
-guaranteed to be really "clean" after single iteration, e.g. some marker 
+guaranteed to be really "clean" after single iteration, \eg some marker 
 may not pass the call threshold after we exclude few 
 informative (but apparently having low quality) samples. Therefore the 
 QC is repeated iteratively until no further errors are found.
 
-You can generate short summary of QC by marker and by person through
+You can generate a short summary of the QC by marker and by person through
 <<>>=
 summary(qc1)
 @
-
 Note that the original data, \texttt{ge03d2ex}, are not modified during the 
 procedure; rather, \texttt{check.markers()} generate a list of markers and 
 people which pass or do not pass certain QC criteria. The objects returned 
@@ -398,7 +393,7 @@
 You can easily generate a new data set, which will consist only of these 
 people and markers by
 <<>>=
-data1 <- ge03d2ex[qc1$idok,qc1$snpok]
+data1 <- ge03d2ex[qc1$idok, qc1$snpok]
 @
 
 If there are any residual sporadic X-errors (male heterozygosity), these 
@@ -430,10 +425,9 @@
 by genotyping errors which we hopefully (at least partly!) eliminated:
 <<>>=
 descriptives.marker(data1)[2]
-descriptives.marker(data1[dm2==1,])[2]
-descriptives.marker(data1[dm2==0,])[2]
+descriptives.marker(data1[dm2==1, ])[2]
+descriptives.marker(data1[dm2==0, ])[2]
 @
-
 You can see that the fit to HWE improved, but 
 cases are still have excess number of markers 
 out of HWE. This may be due to genetic 
@@ -462,12 +456,12 @@
 technique, which resembles the one suggested by Price at al. 
 
 As the first step, we will compute a matrix of genomic kinship between all 
-pairs of people, using only autosomal\footnote{the list of 
-autosomal markers contained in \texttt{data} is returned by 
-\texttt{autosomla(data)} function} 
+pairs of individuals, using only autosomal\footnote{the list of 
+autosomal markers contained in \texttt{data} is returned by the
+\texttt{autosomal(data)} function} 
 markers by
 <<>>=
-data1.gkin <- ibs(data1[,autosomal(data1)],weight="freq")
+data1.gkin <- ibs(data1[, autosomal(data1)], weight="freq")
 @
 
 \begin{tip}
@@ -475,9 +469,9 @@
 when using old computers!
 \end{tip}
 
-You can see the 5x5 upper left sub-matrix by
+You can see the $5 \times 5$ upper left sub-matrix by
 <<>>=
-data1.gkin[1:5,1:5]
+data1.gkin[1:5, 1:5]
 @
 
 The numbers below the diagonal show the genomic estimate 
@@ -498,7 +492,7 @@
 <<>>=
 data1.mds <- cmdscale(data1.dist)
 @
-by default, the first two principal components are computed and 
+By default, the first two principal components are computed and 
 returned.
 
 \begin{tip}
@@ -510,32 +504,31 @@
 <<>>=
 plot(data1.mds)
 @
-
-The resulting plot is presented in figure \ref{fig:mds}.
+The resulting plot is shown in Figure~\ref{fig:mds}.
 Each point on the plot corresponds to a person, and the 2D 
 distances between points were fitted to be as close as possible 
-to these presented in the original IBS matrix. You can see that 
+to those presented in the original IBS matrix. You can see that 
 study subjects clearly cluster in two groups. 
 
 You can identify the points belonging to clusters by
 <<>>=
-km <- kmeans(data1.mds,centers=2,nstart=1000)
+km  <- kmeans(data1.mds, centers=2, nstart=1000)
 cl1 <- names(which(km$cluster==1))
 cl2 <- names(which(km$cluster==2))
-if (length(cl1)>length(cl2)) {x<-cl2;cl2<-cl1;cl1<-x}
+if (length(cl1) > length(cl2)) {x<-cl2; cl2<-cl1; cl1<-x}
 cl1
 cl2
 @
-Four outliers are presented in the smaller cluster.
+Four outliers are shown in the smaller cluster.
 \begin{figure}[t]
 \centering
 <<fig=true,echo=false>>=
 plot(data1.mds)
-points(data1.mds[cl1,],pch=19,col="red")
+points(data1.mds[cl1,], pch=19, col="red")
 @
-\caption{Mapping samples on the space of the first two Principle 
+\caption{Mapping samples to the space of the first two Principle 
 Components resulting from analysis of genomic kinship. Red dots 
-identify genetic outliers}
+identify genetic outliers.}
 \label{fig:mds}
 \end{figure}
 
@@ -549,7 +542,7 @@
 We can form a data set which is free from outliers by
 using only people from the bigger cluster:
 <<>>=
-data2 <- data1[cl2,]
+data2 <- data1[cl2, ]
 @
 
 After we dropped the outliers, we need to repeat QC using
@@ -557,21 +550,21 @@
 allow for HWE checks (we will use only controls and 
 exclude markers with FDR $\le 0.2$): 
 <<>>=
-qc2 <- check.marker(data2,hweids=(phdata(data2)$dm2==0),fdr=0.2)
+qc2 <- check.marker(data2, hweids=(phdata(data2)$dm2==0), fdr=0.2)
 summary(qc2)
 @
 \begin{tip}
-If the procedure did not run, check previous Note.
+If the procedure did not run, check the previous Note.
 \end{tip}
 
-Indeed, in the updated data set few markers do not pass our 
+Indeed, in the updated data set several markers do not pass our 
 QC criteria and we need to drop a few markers. This is done by
 <<>>=
-data2 <- data2[qc2$idok,qc2$snpok]
+data2 <- data2[qc2$idok, qc2$snpok]
 @
 
 This is going to be our final analysis data set, therefore 
-let us attach the phenotypic data to the search path, then we do not 
+let us attach the phenotypic data to the search path, so we do not 
 need to type \texttt{phdata(data2)\$...} to access \texttt{dm2} status or other 
 variables:
 <<>>=
@@ -579,7 +572,7 @@
 attach(phdata(data2))
 ####
 #ge03d2ex.clean <- data2
-#save(ge03d2ex.clean,file="ge03d2ex.clean.RData")
+#save(ge03d2ex.clean, file="ge03d2ex.clean.RData")
 ####
 @
 
@@ -587,142 +580,145 @@
 of genetic data to HWE:
 <<>>=
 descriptives.marker(data2)[2]
-descriptives.marker(data2[phdata(data2)$dm2==1,])[2]
-descriptives.marker(data2[phdata(data2)$dm2==0,])[2]
+descriptives.marker(data2[phdata(data2)$dm2==1, ])[2]
+descriptives.marker(data2[phdata(data2)$dm2==0, ])[2]
 @
-
 You can see that now there is no excessive number of SNPs 
 out of HWE in the sample (total, or cases, or controls)
 
+
 \section{GWA association analysis}
 
 Let us start again with descriptives of the phenotypic 
 and marker data
 <<>>=
-descriptives.trait(data2,by=dm2)
+descriptives.trait(data2, by=dm2)
 @
 
-You can see that relation to weight is maintained in this smaller, 
-but hopefully cleaner, data set; moreover, relation to age becomes 
-boundary significant. 
+You can see that the relation to weight is maintained in this smaller, 
+but hopefully cleaner, data set; moreover, the relation to age becomes 
+borderline significant. 
 
 If you check descriptives of markers (only HWE part shown)
 <<>>=
 descriptives.marker(data2)[2]
 @
 you can see that the problems with HWE are apparently fixed; 
-we may guess that these were caused by the Wahlund's effect.
+we may guess that these were caused by Wahlund's effect.
 
 Run the score test on the cleaned data by
-
 <<>>=
-data2.qt <- qtscore(dm2,data2,trait="binomial")
+data2.qt <- qtscore(dm2, data2, trait="binomial")
 @
 and check lambda
 <<>>=
 lambda(data2.qt)
 @
 there is still some inflation, which is explained by the fact 
-that we investigate only few short chromosomes with high LD and few 
+that we investigate only a few short chromosomes with high LD and few 
 causative variants. 
 
 Produce the association analysis plot by
 <<>>=
-plot(data2.qt,df="Pc1df")
+plot(data2.qt, df="Pc1df")
 @
-(figure \ref{fig:scan_QC}).
+(Figure~\ref{fig:scan_QC}).
 
 \begin{figure}[t]
 \centering
 <<fig=true,echo=false>>=
-plot(data2.qt,df="Pc1df")
+plot(data2.qt, df="Pc1df")
 @
-\caption{$-log_{10}(Corrected P-value)$ from the genome scan after the QC procedure}
+\caption{$-\log_{10}(\text{Corrected }P\text{-value})$ from the genome scan 
+  after the QC procedure.}
 \label{fig:scan_QC}
 \end{figure}
 
 Produce the scan summary by
 <<>>=
-descriptives.scan(data2.qt,sort="Pc1df")
+descriptives.scan(data2.qt, sort="Pc1df")
 @
 
 Comparison with the top 10 from the scan before QC shows that results changed 
 substantially with only few markers overlapping. 
 
-You can see similar results when accessing empirical GW 
+You can see similar results when assessing empirical GW 
 significance:
 <<>>=
-data2.qte <- qtscore(dm2,data2,times=200,quiet=TRUE,trait="binomial")
-descriptives.scan(data2.qte,sort="Pc1df")
+data2.qte <- qtscore(dm2, 
+                     data2, times=200, quiet=TRUE, trait="binomial")
+descriptives.scan(data2.qte, sort="Pc1df")
 @
 
 Again, none of the SNPs hits GW 5\% significance. Still,
-you can see that after QC top markers achieve somewhat "better" 
+you can see that after QC the top markers achieve somewhat ``better''
 significance.
 
 In the last part, we will do several adjusted and stratified analyses.
-Only empirical P-values will be estimated to make the story shorter.
+Only empirical $P$-values will be estimated to make the story shorter.
 To adjust for sex and age, we can 
 <<>>=
-data2.qtae <- qtscore(dm2~sex+age,data2,times=200,quiet=T,trait="binomial")
+data2.qtae <- qtscore(dm2~sex+age, 
+                      data2, times=200, quiet=TRUE, trait="binomial")
 descriptives.scan(data2.qtae)
 @
-
 You can see that there is little difference between adjusted and 
 unadjusted analysis, but this is not always the case; adjustment 
 may make your study much more powerful when covariates explain a 
 large proportion of environmental trait variation.
 
 Finally, let us do stratified (by BMI) analysis. We will contracts 
-obese ($BMI \ge 30$) cases to all controls. 
+obese ($\text{BMI} \ge 30$) cases to all controls. 
 <<>>=
-data2.qtse <- qtscore(dm2~sex+age,data2,ids=((bmi>30 & dm2==1) | dm2==0),times=200,quiet=TRUE,trait="binomial")
-descriptives.scan(data2.qtse,sort="Pc1df")
+data2.qtse <- qtscore(dm2~sex+age,
+                      data2, ids=((bmi>30 & dm2==1) | dm2==0),
+                      times=200, quiet=TRUE, trait="binomial")
+descriptives.scan(data2.qtse, sort="Pc1df")
 @
-
-Again, noting interesting at GW significance level. If we would have had found 
-something, naturally, we would not known if we mapped a T2D or obesity gene
+Again, nothing interesting at the GW significance level. If we would have had found 
+something, naturally, we would not have known if we mapped a T2D or obesity gene
 (or a gene for obesity in presence of T2D, or the one for T2D in presence 
 of obesity).
 
-Let us save 'data1', 'data1.gkin', 'data2.qt' and 'cl1' objects now 
+Let us save the 'data1', 'data1.gkin', 'data2.qt' and 'cl1' objects now 
 <<>>=
-save(data1,cl1,data1.gkin,data2.qt,file="data1.RData")
+save(data1, cl1, data1.gkin, data2.qt, file="data1.RData")
 @
 These data will be used later 
 in section \ref{sec:strat}, "\nameref{sec:strat}", in which 
 we will perform GWA using different methods 
-to account for strtatification, and will compare the results 
+to account for stratification, and will compare the results 
 with these obtaining by removing outliers ('data2.qt').
 
 Let us also save the cleaned 'data2' object, which will be later 
 used in section \ref{sec:meta} ("\nameref{sec:meta}"):
 <<>>=
-save(data2,file="data2.RData")
+save(data2, file="data2.RData")
 @
 
 At this point, you acquired the knowledge necessary for the self-exercise. 
 Please close \texttt{R} by \texttt{q()} command and proceed to the next 
 section.
 
+
 \section{Genome-wide association analysis exercise}
 
 \begin{ExerciseList}
 During the exercise, you will work with a larger data set 
 (approximately 1,000 people and 7,000+ SNPs). You are to 
-do complete three-round QC; perform GWA analysis with 
+do the complete three-round QC; perform GWA analysis with 
 \texttt{dm2} as the outcome of interest and identify 
 10 SNPs which you would like to take to the stage 2 (replication) 
 scan. You will do replication analysis using a 
 confirmatory data set. If you did everything right, the 
 SNPs which you identified as significant or replicated 
-will be located in know T2D genes. 
+will be located in known T2D genes. 
 
 Please keep in mind that the data are simulated, and do not 
 take your findings too seriously!
 
 Start \texttt{R} by going to "Start -> Programs -> R -> R-?.?.?". 
-Load \GA{} library by
+Load the \GA{} library by
 %<<echo=false,results=hide>>=
 %rm(list=ls())
 %@
@@ -731,7 +727,7 @@
 @
 
 The two data sets we will use in this exercise are part of the \GA{} 
-distribution. The first one ("discovery" set) can be loaded by 
+distribution. The first one (the "discovery" set) can be loaded by 
 <<>>=
 data(ge03d2)
 @
@@ -739,7 +735,7 @@
 Please move along the lines detailed in the guided exercise and 
 try to answer following questions:
 
-\Exercise How many cases and controls are presented 
+\Exercise How many cases and controls are present
 in the original data set?
 \Answer :
 <<>>=
@@ -747,7 +743,7 @@
 @
 
 
-\Exercise How many markers are presented 
+\Exercise How many markers are present
 in the original data set?
 \Answer :
 <<>>=
@@ -756,7 +752,7 @@
 
 
 \Exercise Is there evidence for inflation of the 
-HWE test staistics?
+HWE test statistics?
 \Answer Yes:
 <<>>=
 descriptives.marker(ge03d2)[2]
@@ -764,15 +760,16 @@
 
 
 \Exercise Analyse empirical GW significance. How many SNPs 
-pass genome-wide significance threshold, after correction for 
+pass the genome-wide significance threshold, after correction for 
 the inflation factor? Write down the names of these SNPs for 
 further comparison.
 \Answer :
 <<>>=
-res0 <- qtscore(dm2,data=ge03d2,times=200,quiet=TRUE,trait="binomial")
+res0 <- qtscore(dm2, data=ge03d2, times=200, 
+                quiet=TRUE, trait="binomial")
 ### something funny is going on here
 ### disabeling next line
-# lambda(res0)
+#lambda(res0)
 ds <- descriptives.scan(res0)
 ds
 @
@@ -782,17 +779,18 @@
 snps0 <- rownames(ds)
 snps0
 @
-(note that if empirical P is 1, the rank is assigned quite arbitrarily)
+(note that if the empirical $P = 1$, the rank is assigned quite arbitrarily)
 
 
 \Exercise Perform first steps of the genetic data QC.
 \Answer First step of QC
 <<>>=
-qc1 <- check.marker(ge03d2,call=0.95,perid.call=0.95,p.level=0,ibs.exclude="both")
+qc1 <- check.marker(ge03d2, call=0.95, perid.call=0.95, 
+                    p.level=0, ibs.exclude="both")
 summary(qc1)
-data1 <- ge03d2[qc1$idok,qc1$snpok]
+data1 <- ge03d2[qc1$idok, qc1$snpok]
 data1 <- Xfix(data1)
-qc2 <- check.marker(data1,call=0.95,perid.call=0.95,p.level=0)
+qc2 <- check.marker(data1, call=0.95, perid.call=0.95, p.level=0)
 summary(qc2)
 @
 
@@ -814,7 +812,7 @@
  
 \Exercise How many sporadic X errors do you still 
 observe even when the female male and non-X X-markers are removed?
-(do not forget to \texttt{Xfix()} these
+(do not forget to \texttt{Xfix(s)} these
 \Answer Eight 'sporadic' X-errors are left after removing people 
 with likely sex code errors (seven in the data set after first step 
 of QC)
@@ -828,22 +826,22 @@
 \Exercise Perform second step of QC.
 \Answer The second step of QC:
 <<>>=
-data1.gkin <- ibs(data1[,autosomal(data1)],weight="freq")
-data1.dist <- as.dist(0.5-data1.gkin)
+data1.gkin <- ibs(data1[, autosomal(data1)], weight="freq")
+data1.dist <- as.dist(0.5 - data1.gkin)
 data1.mds <- cmdscale(data1.dist)
-km <- kmeans(data1.mds,centers=2,nstart=1000)
-cl1 <- names(which(km$cluster==1))
-cl2 <- names(which(km$cluster==2))
-if (length(cl1)>length(cl2)) {x<-cl2;cl2<-cl1;cl1<-x}
+km <- kmeans(data1.mds, centers=2, nstart=1000)
+cl1 <- names( which(km$cluster==1) )
+cl2 <- names( which(km$cluster==2) )
+if (length(cl1) > length(cl2)) {x<-cl2; cl2<-cl1; cl1<-x}
 cl1
 data2 <- data1[cl2,]
-qc2 <- check.marker(data2,hweids=(phdata(data2)$dm2==0),fdr=0.2)
+qc2 <- check.marker(data2, hweids=(phdata(data2)$dm2==0), fdr=0.2)
 summary(qc2)
-data2 <- data2[qc2$idok,qc2$snpok]
+data2 <- data2[qc2$idok, qc2$snpok]
 data2 <- Xfix(data2)
 ####
 #ge03d2.clean <- data2
-#save(ge03d2.clean,file="ge03d2.clean.RData")
+#save(ge03d2.clean, file="ge03d2.clean.RData")
 ####
 @
 
@@ -882,10 +880,10 @@
 
 \Exercise Perform GWA analysis of the cleaned data, 
 using asimptotic test and plot the results.
-What is the estimate of $\lambda$ for the 1 d.f. test?
+What is the estimate of $\lambda$ for the 1 d.f.~test?
 \Answer :
 <<>>=
-#qts <- qtscore(dm2,data2,trait="binomial")
+#qts <- qtscore(dm2, data2, trait="binomial")
 #lambda(qts)
 @
 
@@ -895,7 +893,7 @@
 the inflation factor?
 \Answer : 
 <<>>=
-#res1 <- qtscore(dm2,data=data2,times=200,quiet=TRUE,trait="binomial")
+#res1 <- qtscore(dm2, data=data2, times=200, quiet=TRUE, trait="binomial")
 #ds1 <- descriptives.scan(res1)
 #ds1
 @
@@ -923,7 +921,8 @@
 rs2456488, rs1292700, and rs8183220.
 Make a vector of these SNPs with
 <<>>=
-#vec12<-c("rs1646456","rs7950586","rs4785242","rs4435802","rs2847446","rs946364","rs299251","rs2456488","rs1292700","rs8183220")
+#vec12<-c("rs1646456", "rs7950586", "rs4785242", "rs4435802", "rs2847446",
+#         "rs946364", "rs299251", "rs2456488", "rs1292700", "rs8183220")
 @ 
 Load the stage 2 (replicaton) data set by 
 <<>>=
@@ -931,7 +930,7 @@
 @
 and select the subset of SNPs you need by
 <<>>=
-#confdat <- ge03d2c[,vec12]
+#confdat <- ge03d2c[, vec12]
 @
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/genabel -r 1191


More information about the Genabel-commits mailing list