[Genabel-commits] r1137 - tutorials/GenABEL_general

Wed Mar 13 18:35:20 CET 2013

Author: yurii
Date: 2013-03-13 18:35:20 +0100 (Wed, 13 Mar 2013)
New Revision: 1137

Added:
   tutorials/GenABEL_general/GWA.Rnw
   tutorials/GenABEL_general/GWAprotocol.Rnw
   tutorials/GenABEL_general/GWAsimple.Rnw
   tutorials/GenABEL_general/GenABEL-tutorial.Rnw
   tutorials/GenABEL_general/ImputedDataAnalysis.Rnw
   tutorials/GenABEL_general/Makefile
   tutorials/GenABEL_general/QC.Rnw
   tutorials/GenABEL_general/Sweave.sty
   tutorials/GenABEL_general/answers.Rnw
   tutorials/GenABEL_general/answers2.Rnw
   tutorials/GenABEL_general/assoc.Rnw
   tutorials/GenABEL_general/bibliography.bib
   tutorials/GenABEL_general/dataimport.Rnw
   tutorials/GenABEL_general/dbasing.Rnw
   tutorials/GenABEL_general/fetchData.Rnw
   tutorials/GenABEL_general/genetics.bst
   tutorials/GenABEL_general/impute.Rnw
   tutorials/GenABEL_general/internals.Rnw
   tutorials/GenABEL_general/intro.Rnw
   tutorials/GenABEL_general/introR.Rnw
   tutorials/GenABEL_general/meta.Rnw
   tutorials/GenABEL_general/myxhtml.cfg
   tutorials/GenABEL_general/preface.Rnw
   tutorials/GenABEL_general/probabel.Rnw
   tutorials/GenABEL_general/reg.Rnw
   tutorials/GenABEL_general/siman.Rnw
   tutorials/GenABEL_general/strat.Rnw
   tutorials/GenABEL_general/strat0.Rnw
   tutorials/GenABEL_general/workgwaaclass.Rnw
Log:
first version of GenABEL-general tutorial

Added: tutorials/GenABEL_general/GWA.Rnw
===================================================================

--- tutorials/GenABEL_general/GWA.Rnw	                        (rev 0)
+++ tutorials/GenABEL_general/GWA.Rnw	2013-03-13 17:35:20 UTC (rev 1137)
@@ -0,0 +1,1038 @@
+\chapter{Genome-wide association analysis}
+\label{sec:GWA}
+
+In the first parts of this section you will be guided 
+through a GWA analysis of a small data set. In the 
+last part you will investigate a larger data set by 
+yourself, do a verification study and will answer the 
+questions.
+All data sets used assume a study in a relatively 
+homogeneous population. 
+Try to finish the first part in the morning and the 
+second part in the afternoon.
+
+Though only few thousands of markers located at four 
+small chromosomes are used in the scan, we still going to 
+call it Genome-Wide (GW), as the amount of data we 
+will use is approaches the amount to be expected 
+in a real experiment. However, because the regions 
+are small, and the LD between SNPs is high, 
+some specific features (e.g. relatively high 
+residual inflation, which occurs because 
+large proportion of SNPs are in LD with 
+the reuly associated ones)
+are specific 
+features of this data set, which are not observed in 
+true GWA studies. 
+
+Start R and load \GA{} library 
+by typing 
+<<results=hide>>=
+library(GenABEL)
+@
+and load the data which we will use in this section by 
+<<>>=
+data(ge03d2ex)
+@
+
+Investigate the objects loaded by command
+<<>>=
+ls()
+@
+
+The \texttt{ge03d2ex} is an object of the class \texttt{gwaa.data}:
+<<>>=
+class(ge03d2ex)
+@
+
+To check what are the names of variables in the phenotypic data frame, use 
+<<>>=
+names(phdata(ge03d2ex))
+@
+
+We can attach this data frame
+to the \texttt{R} search path by
+<<>>=
+attach(phdata(ge03d2ex))
+@
+
+\section{Data descriptives and first round of GWA analysis}
+
+Let us investigate what are the traits presented in the data frame loaded 
+and what are the characteristics of the distribution by using specific 
+\GA{} function \texttt{descriptive.trait}:
+<<>>=
+descriptives.trait(ge03d2ex)
+@
+
+You can see that phenotypic frame contains the data on 136 people; the data 
+on sex, age, height, weight, diet and body mass index (BMI) are available. 
+Our trait of interest is \texttt{dm2} (type 2 diabetes). Note that every 
+single piece of information in this data set is simulated; however, we 
+tried to keep our simulations in a way we think the control of T2D may work. 
+
+You can produce a summary for cases and controls separately and 
+compare distributions of the traits by 
+<<>>=
+descriptives.trait(ge03d2ex,by=dm2)
+@
+Here, the \texttt{by} argument specifies the grouping variable.
+You can see that cases and controls are different in weight, which 
+is expected, as T2D is associated with obesity. 
+
+Similarly, you can produce grand GW descriptives of the marker data 
+by using 
+<<>>=
+descriptives.marker(ge03d2ex)
+@
+It is of note that we can see inflation of the proportion of the 
+tests for HWE at particular threshold, as compared to the expected. 
+This may indicate poor genotyping quality and/or genetic stratification.
+
+We can test the GW marker characteristics in controls by
+<<>>=
+descriptives.marker(ge03d2ex,ids=(dm2==0))
+@
+Apparently, HWE distribution holds better in controls than in 
+the total sample. 
+
+Let us check whether there are indications that deviation from HWE 
+is due to cases. At this stage we are only interested in HWE distribution 
+table, and therefore will ask to report the distrbution 
+for cases (\texttt{dm2==1}) and report only the table two:
+<<>>=
+descriptives.marker(ge03d2ex,ids=(dm2==1))[2]
+@
+and for the controls
+<<>>=
+descriptives.marker(ge03d2ex,ids=(dm2==0))[2]
+@
+
+It seems that indeed excessive number of markers are out of HWE in cases. 
+If no laboratory procedure (e.g. DNA extraction, genotyping, calling) were 
+done for cases and controls separately, this may indicate possible 
+genetic heterogeneity specific for cases. 
+
+In essence, '\texttt{descriptives.marker}' function 
+uses '\texttt{summary}' function to generate HW P-values 
+distribution. It may be interesting to generate this 
+distribution using the '\texttt{summary}' function 
+You do not need to do so, but this example shows how 
+you can generate summaries from underlying SNP-tables. 
+First, we need to 
+compute summary SNP statistics by
+
+<<>>=
+s <- summary(gtdata(ge03d2ex[(dm2==1),]))
+@
+
+Note the you have produced the summary for the \texttt{gtdata} 
+slot of \texttt{ge03d2ex}; this is the slot which actually 
+contain all genetic data in special compressed format. 
+
+You can see first 5 rows of this very long summary table by
+<<>>=
+s[1:5,]
+@
+Note that the column 'Pexact' provides exact HWE test $P$-values we need. 
+We can extract these to a separate vector by
+<<>>=
+pexcas <- s[,"Pexact"]
+@
+and do characterization of the cummulative distribution by 
+<<>>=
+catable(pexcas,c(0.001,0.01,0.05,0.1,0.5),cumulative=TRUE)
+@
+
+You can generate the distribution for controls in similar manner.
+
+%%and produce the $\chi^2-\chi^2$ plot and estimate inflation 
+%%factor by command \texttt{estlambda()}, which operates with a 
+%%vector of P-values or $\chi^2$s:
+%%<<>>=
+%%estlambda(pexcas)
+%%@
+%%
+%%By default, this function also produces a $\chi^2-\chi^2$ plot, at 
+%%which you can see some extreme deviation of observed from expected. 
+%%The resulting plot
+%%(figure \ref{fig:chi2chi2_Pexcas}) shows extreme deviation for  
+%%high values of the test.
+%%Looking at the $\lambda$ estimate, we indeed see inflation of the test statistics. 
+%%\begin{figure}[t]
+%%\centering
+%%<<fig=true,echo=false,results=hide>>=
+%%estlambda(s[,"Pexact"])
+%%@
+%%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE. 
+%%Black line of slope 1: expected under no inflation; Red line: 
+%%fitted slope.}
+%%\label{fig:chi2chi2_Pexcas}
+%%\end{figure}
+%%
+%%You can repeat this test for the controls, if time permits.
+
+Let us first try do GWA scan using the raw (before quality control) data.
+We will use the score test, as implemented in the 
+\texttt{qtscore()}\footnote{consider '\texttt{mlreg}' function 
+if you want to run true linear regression}
+function of \GA{} for testing:
+<<>>=
+an0 <- qtscore(dm2,ge03d2ex,trait="binomial")
+@
+The first argument used describes the model; here it is rather 
+simple --- the affection status, \texttt{dm2}, is supposed 
+to depend on SNP genotype only.
+
+You can see what information is computed by this function by using
+<<>>=
+an0
+@
+Here, let us look at the 'Results table'. 
+\texttt{P1df}, \texttt{P2df} and \texttt{Pc1df} are most 
+interesting; the first two are vectors of 1 and 2 d.f. $P$-values 
+obtained in the GWA analysis, the last one is 1 d.f. 
+P-value corrected for inflation factor $\lambda$ (which is 
+presented in \texttt{lambda} object). \texttt{effB} corresponds to 
+(approximate) Odds Ratios estimate for the SNP. 
+
+Let us see if there is evidence for the inflation of the test 
+statistics; for that let us obtain $\lambda$ with
+<<>>=
+lambda(an0)
+@
+The estimate of $\lambda$ is \Sexpr{round(lambda(an0)$est,dig=2)},
+suggesting inflation of the test and some degree of stratification.
+Though the value abtained seems to be small, it should be noted 
+that $\lambda$ grows linearly with smaple size, so for this small 
+number of cases and controls the value is worrisome. 
+
+The $\lambda$ is computed by regression in a Q-Q plot. Both estimation 
+ of $\lambda$ and production of the $\chi^2-\chi^2$ plot can be done 
+using the \texttt{estlambda} function; this was already done automatically 
+when running \texttt{qtscore} function, but let us repeat this manually:
+
+<<>>=
+estlambda(an0[,"P1df"],plot=TRUE)
+@
+The corresponding $\chi^2-\chi^2$ plot is presented in Figure 
+\ref{fig:chi2chi2_P1df}.
+
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false,results=hide>>=
+estlambda(an0[,"P1df"],plot=TRUE)
+@
+\caption{$\chi^2-\chi^2$ plot for a GWA scan. 
+Black line of slope 1: expected under no inflation; Red line: 
+fitted slope.}
+\label{fig:chi2chi2_P1df}
+\end{figure}
+
+
+\begin{tip}
+The 'se' produced by \texttt{estlambda} \emph{can not} 
+be used to test if inflation is significant and make 
+conclusions about presence of significant or insignificant 
+stratification.
+\end{tip}
+
+We can also present the obtained results using the 
+''Manhatten plot'', where a SNP genomic position is on 
+the X-axes and $-log_{10}$ of the $P$-value is 
+shown on Y-axes: 
+<<>>=
+plot(an0)
+@
+The resulting plot is presented in the figure \ref{fig:scan_initial}. 
+By default, $-log_{10}(P-value)$ of not corrected 1 d.f. test are 
+presented; see help to figure out how this behaviour can be changed. 
+
+We can also add the corrected P-values to the plot with
+<<>>=
+add.plot(an0,df="Pc1df",col=c("lightblue","lightgreen"))
+@
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false>>=
+plot(an0,col=c("blue","green"))
+add.plot(an0,df="Pc1df",col=c("lightblue","lightgreen"))
+@
+\caption{$-log_{10}(P-value)$ from the genome scan before QC procedure.
+Raw analysis: darker circles; corrected analysis: lighter circles}
+\label{fig:scan_initial}
+\end{figure}
+You can see that $P$-values corrected by genoic control 
+are uniformly lower than the $P$-values from 'raw' analysis. 
+This is to be expected as genomic control simply divides the 
+'raw' $\chi^2$ statistics by a constant $\lambda$ for all 
+SNPs. 
+
+You can also generate a descriptive table for the "top" (as ranked by 
+P-value) results by
+<<>>=
+descriptives.scan(an0)
+@
+or, equivalently, by '\texttt{summary(an0)}' 
+
+Here you see top 10 results, sorted by P-value with 1 d.f. If you want 
+to sort by the corrected P-value, you can use 
+\texttt{descriptives.scan(an0,sort="Pc1df")}; to see more then 10 
+(e.g. 25) top results, use \texttt{descriptives.scan(an0,top=25)}. 
+You can combine all these options. Large part of results reports NA 
+as effect estimates and 9.99 as \textit{P-value} for 2 d.f. test -- 
+for these markers only two out of three possible 
+genotypes were observed, and consequently 2 d.f. test could not be 
+performed. 
+
+Now let us apply \texttt{qtscore()} function with \texttt{times}
+argument, which tells it to compute empirical GW 
+(or experiment-wise) significance
+<<>>=
+an0.e <- qtscore(dm2,ge03d2ex,times=200,quiet=TRUE)
+@
+(you may skip the 'quiet=TRUE' argument, then you will see progress)
+
+Now let us generate the summary of the results
+<<>>=
+descriptives.scan(an0.e,sort="Pc1df")
+@
+
+Experimental-wise significance is computed by the empirical 
+procedure, thus we consider $P$-values $\le 0.05$ to be GW-significant.
+However, none of the SNPs hits GW significance. 
+If, actually, any did pass the threshold, we 
+could not trust the results, because the distribution of the HWE test 
+and presence of inflation factor for the association test statistics 
+suggest that the data may contain multiple errors (indeed they do).
+Therefore before association analysis we need to do 
+rigorous Quality Control (QC).
+
+Note that at certain SNP, corrected \textit{P-values} become equal 
+to 1 -- at this point order is arbitrary because sorting could not be done. 
+
+\begin{summary}
+\item The \texttt{descriptives} family of functions was developed 
+to facilitate production of tables which can be directly used 
+in a manuscript --- it is possible to save the output as a file, 
+which can be open by Excel or Word. See e.g. 
+\texttt{help(descriptives.trait)} for details. 
+\item The inflation of test statistics compared to null (1 d.f.) 
+may be estimated with \texttt{estlambda} function.
+\end{summary}
+
+\section{Genetic data QC}
+
+The major genetic data QC function of \GA{} is \texttt{check.marker()}.
+We will now use that to perform our data QC; the output is rather self-explaining.
+Because of possible genetic heterogeneity of the study data it 
+is good idea to skip Hardy-Weinberg checks in the first round of QC. 
+This can be achieved by setting HWE P-value 
+selection threshold to zero (\texttt{p.level=0}):
+<<>>=
+qc1 <- check.marker(ge03d2ex,p.level=0)
+@
+Note that normally you will \emph{NEVER} run 
+this simple form of the QC function -- you should always 
+provide a number of thresholds specific to the 
+platform you used for genotyping. See help to 
+\texttt{check.marker()} for detailed list of arguments. 
+Default values used by the function are rather 
+relaxed compared to the thresholds routinely used 
+nowadays with most of the platforms. 
+
+\begin{tip}
+The computation of all pairwise 
+proportion of alleles identical-by-state (IBS) 
+by \texttt{ibs()} function, which is also called by 
+\texttt{check.marker()} may take quite some time, 
+which is proportional to the square of the number of 
+subjects. This is not 
+a problem with the small number of people we use for 
+this example or when modern computers are used. However,
+the computers in the Nihes computer room are very old. 
+Therefore be prepared to wait for long time when you 
+will do a self-exercise with 1,000 people.
+\end{tip}
+
+From the output you can see that QC starts with checking the data for 
+SNPs and people with extremely low call rate. Six markers are 
+excluded from further analysis due to very low call rate. 
+Next, X-chromosomal errors are identified. The function 
+finds out that all errors (heterozygous male X-genotypes) are due to two people with 
+wrong sex assigned and one marker, which looks like an autosomal one. 
+This actually could be a marker from pseudoautosomal region, which 
+should have been arranged as a separate "autosome". 
+Nine people are found to have intermediate inbreeding at the 
+X-chromosome and also excluded from analysis.  
+
+Then, the procedure finds the markers with low call rate ($\le 0.95$ 
+by default) 
+across people, markers with low MAF (by default, low MAF is defined as 
+less than few copies of the rare allele, see help for details); 
+people with low call rate (default value: $\le 0.95$) across SNPs, people with extreme 
+heterozygosity (at FDR 0.01) and these who have GW IBS $\ge 0.95$. 
+These default parameters may be changed if you wish (consult help).
+
+Because some of the people fail to pass the tests, the data set is not 
+guaranteed to be really "clean" after single iteration, e.g. some marker 
+may not pass the call threshold after we exclude few 
+informative (but apparently having low quality) samples. Therefore the 
+QC is repeated iteratively until no further errors are found.
+
+You can generate short summary of QC by marker and by person through
+<<>>=
+summary(qc1)
+@
+
+Note that the original data, \texttt{ge03d2ex}, are not modified during the 
+procedure; rather, \texttt{check.markers()} generate a list of markers and 
+people which pass or do not pass certain QC criteria. The objects returned 
+by \texttt{check.markers()} are:
+<<>>=
+names(qc1)
+@
+
+The element \texttt{idok} provides the list of people who passed all QC criteria, 
+and \texttt{snpok} provides the list of SNPs which passed all criteria.
+You can easily generate a new data set, which will consist only of these 
+people and markers by
+<<>>=
+data1 <- ge03d2ex[qc1$idok,qc1$snpok]
+@
+
+If there are any residual sporadic X-errors (male heterozygosity), these 
+can (and should!) be fixed (set to NA) by
+<<>>=
+data1 <- Xfix(data1)
+@
+Applying this function does not make any difference for the example data set, 
+but you will need to use it for the bigger data set.
+
+At this point, we are ready to work with the new, cleaned, data set 
+\texttt{data1}. However, if we try
+<<>>=
+length(dm2)
+@
+we can see that the original phenotypic data are attached to the search 
+path (there are only \Sexpr{nids(data1)} people left in the 'clean' data set). 
+Therefore we need to detach the data by
+<<>>=
+detach(phdata(ge03d2ex))
+@
+and attach new data by
+<<>>=
+attach(phdata(data1))
+@
+
+At this stage, let us check if the first round of QC improves the fit 
+of genetic data to HWE, which may have been violated due to 
+by genotyping errors which we hopefully (at least partly!) eliminated:
+<<>>=
+descriptives.marker(data1)[2]
+descriptives.marker(data1[dm2==1,])[2]
+descriptives.marker(data1[dm2==0,])[2]
+@
+
+You can see that the fit to HWE improved, but 
+cases are still have excess number of markers 
+out of HWE. This may be due to genetic 
+sub-structure.  
+%%estlambda(summary(data1 at gtdata[dm2==1,])[,"Pexact"])
+
+%Apparently, the distribution (figure \ref{fig:chi2chi2_Pexcas2}) looks 
+%better (note the scale difference between the graphs), but the test 
+%statistics is still inflated.
+%\begin{figure}[t]
+%\centering
+%<<fig=true,echo=false,results=hide>>=
+%estlambda(summary(data1 at gtdata[data1 at phdata$dm2==1,])[,"Pexact"])
+%@
+%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE. 
+%Black line of slope 1: expected under no inflation; Red line: 
+%fitted slope.}
+%\label{fig:chi2chi2_Pexcas2}
+%\end{figure}
+
+\section{Finding genetic sub-structure}
+
+Now, we are ready for the second round of QC -- detection of 
+genetic outliers which may contaminate our results.
+We will detect genetic outliers using a  
+technique, which resembles the one suggested by Price at al. 
+
+As the first step, we will compute a matrix of genomic kinship between all 
+pairs of people, using only autosomal\footnote{the list of 
+autosomal markers contained in \texttt{data} is returned by 
+\texttt{autosomla(data)} function} 
+markers by
+<<>>=
+data1.gkin <- ibs(data1[,autosomal(data1)],weight="freq")
+@
+
+\begin{tip}
+This step may take few minutes on large data sets or 
+when using old computers!
+\end{tip}
+
+You can see the 5x5 upper left sub-matrix by
+<<>>=
+data1.gkin[1:5,1:5]
+@
+
+The numbers below the diagonal show the genomic estimate 
+of kinship (aka 'genomic kinship' or 'genome-wide IBD'), 
+the numbers on the diagonal correspond to 0.5 plus the genomic 
+homozigosity, and 
+the numbers above the diagonal tell how many SNPs were typed 
+successfully for both subjects
+(thus the IBD estimate is derived using this number of SNPs).
+
+Second, we transform this matrix to a distance 
+matrix using standard \texttt{R} command
+<<>>=
+data1.dist <- as.dist(0.5-data1.gkin)
+@
+
+Finally, we perform Classical Multidimensional Scaling by
+<<>>=
+data1.mds <- cmdscale(data1.dist)
+@
+by default, the first two principal components are computed and 
+returned.
+
+\begin{tip}
+This may take few minutes on large data sets or 
+when using old computers!
+\end{tip}
+
+We can present the results graphically by 
+<<>>=
+plot(data1.mds)
+@
+
+The resulting plot is presented in figure \ref{fig:mds}.
+Each point on the plot corresponds to a person, and the 2D 
+distances between points were fitted to be as close as possible 
+to these presented in the original IBS matrix. You can see that 
+study subjects clearly cluster in two groups. 
+
+You can identify the points belonging to clusters by
+<<>>=
+km <- kmeans(data1.mds,centers=2,nstart=1000)
+cl1 <- names(which(km$cluster==1))
+cl2 <- names(which(km$cluster==2))
+if (length(cl1)>length(cl2)) {x<-cl2;cl2<-cl1;cl1<-x}
+cl1
+cl2
+@
+Four outliers are presented in the smaller cluster.
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false>>=
+plot(data1.mds)
+points(data1.mds[cl1,],pch=19,col="red")
+@
+\caption{Mapping samples on the space of the first two Principle 
+Components resulting from analysis of genomic kinship. Red dots 
+identify genetic outliers}
+\label{fig:mds}
+\end{figure}
+
+\begin{tip}
+Now you will need to use the BIGGER cluster for 
+to select study subjects. Whether this 
+will be \texttt{cl1} or \texttt{cl2} in you case, is totally 
+random. 
+\end{tip}
+
+We can form a data set which is free from outliers by
+using only people from the bigger cluster:
+<<>>=
+data2 <- data1[cl2,]
+@
+
+After we dropped the outliers, we need to repeat QC using
+\texttt{check.markers()}. At this stage, we want to 
+allow for HWE checks (we will use only controls and 
+exclude markers with FDR $\le 0.2$): 
+<<>>=
+qc2 <- check.marker(data2,hweids=(phdata(data2)$dm2==0),fdr=0.2)
+summary(qc2)
+@
+\begin{tip}
+If the procedure did not run, check previous Note.
+\end{tip}
+
+Indeed, in the updated data set few markers do not pass our 
+QC criteria and we need to drop a few markers. This is done by
+<<>>=
+data2 <- data2[qc2$idok,qc2$snpok]
+@
+
+This is going to be our final analysis data set, therefore 
+let us attach the phenotypic data to the search path, then we do not 
+need to type \texttt{phdata(data2)\$...} to access \texttt{dm2} status or other 
+variables:
+<<>>=
+detach(phdata(data1))
+attach(phdata(data2))
+####
+#ge03d2ex.clean <- data2
+#save(ge03d2ex.clean,file="ge03d2ex.clean.RData")
+####
+@
+
+Before proceeding to GWA, let us check if complete QC improved the fit 
+of genetic data to HWE:
+<<>>=
+descriptives.marker(data2)[2]
+descriptives.marker(data2[phdata(data2)$dm2==1,])[2]
+descriptives.marker(data2[phdata(data2)$dm2==0,])[2]
+@
+
+You can see that now there is no excessive number of SNPs 
+out of HWE in the sample (total, or cases, or controls)
+
+\section{GWA association analysis}
+
+Let us start again with descriptives of the phenotypic 
+and marker data
+<<>>=
+descriptives.trait(data2,by=dm2)
+@
+
+You can see that relation to weight is maintained in this smaller, 
+but hopefully cleaner, data set; moreover, relation to age becomes 
+boundary significant. 
+
+If you check descriptives of markers (only HWE part shown)
+<<>>=
+descriptives.marker(data2)[2]
+@
+you can see that the problems with HWE are apparently fixed; 
+we may guess that these were caused by the Wahlund's effect.
+
+Run the score test on the cleaned data by
+
+<<>>=
+data2.qt <- qtscore(dm2,data2,trait="binomial")
+@
+and check lambda
+<<>>=
+lambda(data2.qt)
+@
+there is still some inflation, which is explained by the fact 
+that we investigate only few short chromosomes with high LD and few 
+causative variants. 
+
+Produce the association analysis plot by
+<<>>=
+plot(data2.qt,df="Pc1df")
+@
+(figure \ref{fig:scan_QC}).
+
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false>>=
+plot(data2.qt,df="Pc1df")
+@
+\caption{$-log_{10}(Corrected P-value)$ from the genome scan after the QC procedure}
+\label{fig:scan_QC}
+\end{figure}
+
+Produce the scan summary by
+<<>>=
+descriptives.scan(data2.qt,sort="Pc1df")
+@
+
+Comparison with the top 10 from the scan before QC shows that results changed 
+substantially with only few markers overlapping. 
+
+You can see similar results when accessing empirical GW 
+significance:
+<<>>=
+data2.qte <- qtscore(dm2,data2,times=200,quiet=TRUE,trait="binomial")
+descriptives.scan(data2.qte,sort="Pc1df")
+@
+
+Again, none of the SNPs hits GW 5\% significance. Still,
+you can see that after QC top markers achieve somewhat "better" 
+significance.
+
+In the last part, we will do several adjusted and stratified analyses.
+Only empirical P-values will be estimated to make the story shorter.
+To adjust for sex and age, we can 
+<<>>=
+data2.qtae <- qtscore(dm2~sex+age,data2,times=200,quiet=T,trait="binomial")
+descriptives.scan(data2.qtae)
+@
+
+You can see that there is little difference between adjusted and 
+unadjusted analysis, but this is not always the case; adjustment 
+may make your study much more powerful when covariates explain a 
+large proportion of environmental trait variation.
+
+Finally, let us do stratified (by BMI) analysis. We will contracts 
+obese ($BMI \ge 30$) cases to all controls. 
+<<>>=
+data2.qtse <- qtscore(dm2~sex+age,data2,ids=((bmi>30 & dm2==1) | dm2==0),times=200,quiet=TRUE,trait="binomial")
+descriptives.scan(data2.qtse,sort="Pc1df")
+@
+
+Again, noting interesting at GW significance level. If we would have had found 
+something, naturally, we would not known if we mapped a T2D or obesity gene
+(or a gene for obesity in presence of T2D, or the one for T2D in presence 
+of obesity).
+
+Let us save 'data1', 'data1.gkin', 'data2.qt' and 'cl1' objects now 
+<<>>=
+save(data1,cl1,data1.gkin,data2.qt,file="data1.RData")
+@
+These data will be used later 
+in section \ref{sec:strat}, "\nameref{sec:strat}", in which 
+we will perform GWA using different methods 
+to account for strtatification, and will compare the results 
+with these obtaining by removing outliers ('data2.qt').
+
+Let us also save the cleaned 'data2' object, which will be later 
+used in section \ref{sec:meta} ("\nameref{sec:meta}"):
+<<>>=
+save(data2,file="data2.RData")
+@
+
+At this point, you acquired the knowledge necessary for the self-exercise. 
+Please close \texttt{R} by \texttt{q()} command and proceed to the next 
+section.
+
+\section{Genome-wide association analysis exercise}
+
+\begin{ExerciseList}
+During the exercise, you will work with a larger data set 
+(approximately 1,000 people and 7,000+ SNPs). You are to 
+do complete three-round QC; perform GWA analysis with 
+\texttt{dm2} as the outcome of interest and identify 
+10 SNPs which you would like to take to the stage 2 (replication) 
+scan. You will do replication analysis using a 
+confirmatory data set. If you did everything right, the 
+SNPs which you identified as significant or replicated 
+will be located in know T2D genes. 
+
+Please keep in mind that the data are simulated, and do not 
+take your findings too seriously!
+
+Start \texttt{R} by going to "Start -> Programs -> R -> R-?.?.?". 
+Load \GA{} library by
+%<<echo=false,results=hide>>=
+%rm(list=ls())
+%@
+<<>>=
+library(GenABEL)
+@
+
+The two data sets we will use in this exercise are part of the \GA{} 
+distribution. The first one ("discovery" set) can be loaded by 
+<<>>=
+data(ge03d2)
+@
+
+Please move along the lines detailed in the guided exercise and 
+try to answer following questions:
+
+\Exercise How many cases and controls are presented 
+in the original data set?
+\Answer :
+<<>>=
+table(phdata(ge03d2)$dm2)
+@
+
+
+\Exercise How many markers are presented 
+in the original data set?
+\Answer :
+<<>>=
+nsnps(ge03d2)
+@
+
+
+\Exercise Is there evidence for inflation of the 
+HWE test staistics?
+\Answer Yes:
+<<>>=
+descriptives.marker(ge03d2)[2]
+@
+
+
+\Exercise Analyse empirical GW significance. How many SNPs 
+pass genome-wide significance threshold, after correction for 
+the inflation factor? Write down the names of these SNPs for 
+further comparison.
+\Answer :
+<<>>=
+res0 <- qtscore(dm2,data=ge03d2,times=200,quiet=TRUE,trait="binomial")
+lambda(res0)
+ds <- descriptives.scan(res0)
+ds
+@
+Thus, there are no genome-wide empirically significant results.
+The 'top' 10 SNPs are
+<<>>=
+snps0 <- rownames(ds)
+snps0
+@
+(note that if empirical P is 1, the rank is assigned quite arbitrarily)
+
+
+\Exercise Perform first steps of the genetic data QC.
+\Answer First step of QC
+<<>>=
+qc1 <- check.marker(ge03d2,call=0.95,perid.call=0.95,p.level=0,ibs.exclude="both")
+summary(qc1)
+data1 <- ge03d2[qc1$idok,qc1$snpok]
+data1 <- Xfix(data1)
+qc2 <- check.marker(data1,call=0.95,perid.call=0.95,p.level=0)
+summary(qc2)
+@
+
+
+\Exercise How many males are 'genetically' females?
+\Answer The list of genetic females who are coded as males is 
+<<>>=
+qc1$isfemale
+@
+ 
+\Exercise How many females are 'genetically' males?
+\Answer The list of genetic males who are coded as females is 
+<<>>=
+qc1$ismale
+@
+ 
+\Exercise How many people are quessed to have 'XXY' genotype?
+\Answer The number of 'XXY' people is \Sexpr{length(qc1$isXXY)}
+ 
+\Exercise How many sporadic X errors do you still 
+observe even when the female male and non-X X-markers are removed?
+(do not forget to \texttt{Xfix()} these
+\Answer Eight 'sporadic' X-errors are left after removing people 
+with likely sex code errors (seven in the data set after first step 
+of QC)
+
+\Exercise How many "twin" DNAs did you discover?
+\Answer The list of IDs failing IBS checks ('twin' DNAs) is 
+<<>>=
+qc1$ibsfail
+@
+
+\Exercise Perform second step of QC.
+\Answer The second step of QC:
+<<>>=
+data1.gkin <- ibs(data1[,autosomal(data1)],weight="freq")
+data1.dist <- as.dist(0.5-data1.gkin)
+data1.mds <- cmdscale(data1.dist)
+km <- kmeans(data1.mds,centers=2,nstart=1000)
+cl1 <- names(which(km$cluster==1))
+cl2 <- names(which(km$cluster==2))
+if (length(cl1)>length(cl2)) {x<-cl2;cl2<-cl1;cl1<-x}
+cl1
+data2 <- data1[cl2,]
+qc2 <- check.marker(data2,hweids=(phdata(data2)$dm2==0),fdr=0.2)
+summary(qc2)
+data2 <- data2[qc2$idok,qc2$snpok]
+data2 <- Xfix(data2)
+####
+#ge03d2.clean <- data2
+#save(ge03d2.clean,file="ge03d2.clean.RData")
+####
+@
+
+\Exercise How many genetic outliers did you discover?
+\Answer The list of genetic outliers is 
+<<>>=
+cl1
+@
+
+\Exercise 
+How many cases and controls are presented 
+in the data after QC?
+\Answer :
+<<>>=
+table(phdata(data2)$dm2)
+@
+
+
+\Exercise How many markers are presented 
+in the data after QC?
+\Answer :
+<<>>=
+nsnps(data2)
+@
+
+
+\Exercise Is there evidence for inflation of the 
+HWE test staistics?
+\Answer No:
+<<>>=
+descriptives.marker(data2)[2]
+@
+
+FROM THIS POINT ON THE RUNNING OF THE TUTORIAL FAILS (2013.02.20,
+USING GENABEL DEV VERSION 1.7-4). I COMMENTED OUT THE R CODE BELOW. 
+
+\Exercise Perform GWA analysis of the cleaned data, 
+using asimptotic test and plot the results.
+What is the estimate of $\lambda$ for the 1 d.f. test?
+\Answer :
+<<>>=
+#qts <- qtscore(dm2,data2,trait="binomial")
+#lambda(qts)
+@
+
+
+\Exercise Analyse empirical GW significance. How many SNPs 
+pass genome-wide significance threshold, after correction for 
+the inflation factor?
+\Answer : 
+<<>>=
+#res1 <- qtscore(dm2,data=data2,times=200,quiet=TRUE,trait="binomial")
+#ds1 <- descriptives.scan(res1)
+#ds1
+@
+There are SNPs which are empirically genome-wide significant
+in the data.
+To get the list of 'top' 10 SNPs:
+<<>>=
+#snps1 <- rownames(ds1)
+#snps1
+@
+
+
+\Exercise Do these SNPs overlap much with the ones 
+ranked at the top before the QC? If not, what could be the reason?
+\Answer There is little overlap between SNPs before 
+and after QC:
+<<>>=
+#snps0
+#snps1
+@
+
+
+\Exercise Select 10 SNPs which you would like to follow-up. Say, you've selected
+rs1646456, rs7950586, rs4785242, rs4435802, rs2847446, rs946364, rs299251, 
+rs2456488, rs1292700, and rs8183220.
+Make a vector of these SNPs with
+<<>>=
+#vec12<-c("rs1646456","rs7950586","rs4785242","rs4435802","rs2847446","rs946364","rs299251","rs2456488","rs1292700","rs8183220")
+@ 
+Load the stage 2 (replicaton) data set by 
+<<>>=
+#data(ge03d2c)
+@
+and select the subset of SNPs you need by
+<<>>=
+#confdat <- ge03d2c[,vec12]
+@
+Analyse the \texttt{confdat} for association with \texttt{dm2}. 
+
+\Answer :
+<<>>=
+#data(ge03d2c)
+#snps1
+#confdat <- ge03d2c[,snps1]
+#rep <- qtscore(dm2,confdat,times=10000,quiet=TRUE)
+#descriptives.scan(rep)
+@
+
+
+\Exercise Given the two-stage design, and applying the 
+puristic criteria specified in the lecture, for how many SNPs 
+you can claim a significant finding?
+\Answer Two-stage P-value is
+<<>>=
+#    snps1
+#    finres <- matrix(NA,10,3)
+#    colnames(finres) <- c("Stage 1","Replication","Combined")
+#    rownames(finres) <- snps1
+#    for (i in 1:10) {
+#        finres[i,1] <- res1[which(snpnames(data2)==snps1[i]),"Pc1df"]
+#        finres[i,2] <- rep[which(snpnames(confdat)==snps1[i]),"P1df"]
+#        finres[i,3] <- finres[i,1]*finres[i,2]
+#    }
+#    finres
+#    for (i in 1:10) {
+#        if (finres[i,3] <= 0.05) {
+#            print(c("---------",rownames(finres)[i],"-------"))
+#            
+#            print(c(rownames(finres)[i],"stage 1:"))
+#            ph <- phdata(data2)$dm2
+#            gt <- as.numeric(data2[,rownames(finres)[i]])
+#            print(summary(glm(ph~gt,family=binomial))$coef)
+#            
+#            print(c(rownames(finres)[i],"stage 2:"))
+#            ph <- phdata(confdat)$dm2
+#            gt <- as.numeric(confdat[,rownames(finres)[i]])
+#            print(summary(glm(ph~gt,family=binomial))$coef)
+#
+#            print(c(rownames(finres)[i],"Joint:"))
+#            ph <- c(phdata(data2)$dm2,phdata(confdat)$dm2)
+#            gt <- c(as.numeric(data2[,rownames(finres)[i]]),as.numeric(confdat[,rownames(finres)[i]]))
+#            print(summary(glm(ph~gt,family=binomial))$coef)
+#        }
+#    }
+#replicatedsnps <- rownames(finres)[finres[,"Stage 1"]<=0.05 & 
+#				finres[,"Replication"]<=0.05 & finres[,"Combined"]<=0.05]
+#replicatedsnps
+#sigsnps <- rownames(finres)[finres[,"Combined"]<=0.05]
+#sigsnps
+@
+At the first glance, NWSexpr{length(replicatedsnps)} SNPs  
+may be claimed as replicated 
+because both first stage and replication $P$-values are
+$\le 0.05$ and effects are consistent, and additionally   
+NWSexpr{length(sigsnps) - length(replicatedsnps)} may 
+be claimed as 'significant' because 
+joint $p$-values are $\le 0.05$ and the effects are consistent. 
+Generally, a more thorough simulation experiment should be performed.
+
+
+\Exercise Using the same criteria, for how many SNPs 
+you can claim a replicated finding?
+\Answer SNP ''rs7903146'' had empirical $p$-value $\le 0.05$ 
+at both stages, and very strong 
+joint significance. It can be claimed as replicated.
+
+You can check if any of the SNPs you have 
+identified as significant or replicated are the ones which were 
+simulated to be associated with \texttt{dm2} by using the command 
+\texttt{show.ncbi(c("snpname1","snpname2","snpname3"))}
+where \texttt{snpnameX} stands for the name of your identified SNP. 
+The "true" SNPs can be found on NCBI and some are located in known T2D 
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/genabel -r 1137