[Genabel-commits] r2091 - tutorials/GenABEL_general

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Tue Apr 6 00:06:09 CEST 2021


Author: lckarssen
Date: 2021-04-06 00:06:09 +0200 (Tue, 06 Apr 2021)
New Revision: 2091

Added:
   tutorials/GenABEL_general/README
Removed:
   tutorials/GenABEL_general/GWA.Rnw
   tutorials/GenABEL_general/GWAprotocol.Rnw
   tutorials/GenABEL_general/GWAsimple.Rnw
   tutorials/GenABEL_general/GenABEL-tutorial.Rnw
   tutorials/GenABEL_general/ImputedDataAnalysis.Rnw
   tutorials/GenABEL_general/Makefile
   tutorials/GenABEL_general/QC.Rnw
   tutorials/GenABEL_general/Sweave.sty
   tutorials/GenABEL_general/answers.Rnw
   tutorials/GenABEL_general/answers2.Rnw
   tutorials/GenABEL_general/assoc.Rnw
   tutorials/GenABEL_general/bibliography.bib
   tutorials/GenABEL_general/dataimport.Rnw
   tutorials/GenABEL_general/dbasing.Rnw
   tutorials/GenABEL_general/fetchData.Rnw
   tutorials/GenABEL_general/genetics.bst
   tutorials/GenABEL_general/impute.Rnw
   tutorials/GenABEL_general/internals.Rnw
   tutorials/GenABEL_general/intro.Rnw
   tutorials/GenABEL_general/introR.Rnw
   tutorials/GenABEL_general/meta.Rnw
   tutorials/GenABEL_general/myxhtml.cfg
   tutorials/GenABEL_general/preface.Rnw
   tutorials/GenABEL_general/probabel.Rnw
   tutorials/GenABEL_general/reg.Rnw
   tutorials/GenABEL_general/siman.Rnw
   tutorials/GenABEL_general/strat.Rnw
   tutorials/GenABEL_general/strat0.Rnw
   tutorials/GenABEL_general/workgwaaclass.Rnw
Log:
Remove the code of the GenABEL tutorial after moving to Github

Development will continue at https://github.com/GenABEL-Project/GenABEL_tutorial.
The remaining README file points to this location.



Deleted: tutorials/GenABEL_general/GWA.Rnw
===================================================================
--- tutorials/GenABEL_general/GWA.Rnw	2019-06-25 11:35:07 UTC (rev 2090)
+++ tutorials/GenABEL_general/GWA.Rnw	2021-04-05 22:06:09 UTC (rev 2091)
@@ -1,1038 +0,0 @@
-\chapter{Genome-wide association analysis}
-\label{sec:GWA}
-
-In the first parts of this section you will be guided 
-through a GWA analysis of a small data set. In the 
-last part you will investigate a larger data set by 
-yourself, do a verification study and will answer the 
-questions.
-All data sets used assume a study in a relatively 
-homogeneous population. 
-Try to finish the first part in the morning and the 
-second part in the afternoon.
-
-Though only few thousands of markers located at four 
-small chromosomes are used in the scan, we still going to 
-call it Genome-Wide (GW), as the amount of data we 
-will use is approaches the amount to be expected 
-in a real experiment. However, because the regions 
-are small, and the LD between SNPs is high, 
-some specific features (\eg relatively high 
-residual inflation, which occurs because 
-large proportion of SNPs are in LD with 
-the really associated ones)
-are specific 
-features of this data set, which are not observed in 
-true GWA studies. 
-
-Start R and load the \GA{} library 
-by typing 
-<<results=hide>>=
-library(GenABEL)
-@
-and load the data which we will use in this section by 
-<<>>=
-data(ge03d2ex)
-@
-
-Investigate the objects loaded by the command
-<<>>=
-ls()
-@
-
-The \texttt{ge03d2ex} is an object of the class \texttt{gwaa.data}:
-<<>>=
-class(ge03d2ex)
-@
-To check what are the names of variables in the phenotypic data frame, use 
-<<>>=
-names(phdata(ge03d2ex))
-@
-We can attach this data frame to the \texttt{R} search path by
-<<>>=
-attach(phdata(ge03d2ex))
-@
-
-\section{Data descriptives and first round of GWA analysis}
-
-Let us investigate which traits are present in the loaded data frame 
-and what are the characteristics of the distribution by using the 
-\GA{} function \texttt{descriptive.trait}:
-<<>>=
-descriptives.trait(ge03d2ex)
-@
-
-You can see that the phenotypic frame contains data on 136 people; the data 
-on sex, age, height, weight, diet and body mass index (BMI) are available. 
-Our trait of interest is \texttt{dm2} (type 2 diabetes). Note that every 
-single piece of information in this data set is simulated; however, we 
-tried to keep our simulations in a way we think the control of T2D may work. 
-
-You can produce a summary for cases and controls separately and 
-compare distributions of the traits by 
-<<>>=
-descriptives.trait(ge03d2ex, by=dm2)
-@
-Here, the \texttt{by} argument specifies the grouping variable.
-You can see that cases and controls are different in weight, which 
-is expected, as T2D is associated with obesity. 
-
-Similarly, you can produce grand GW descriptives of the marker data 
-by using 
-<<>>=
-descriptives.marker(ge03d2ex)
-@
-It is of note that we can see inflation of the proportion of the 
-tests for HWE at a particular threshold, as compared to the expected. 
-This may indicate poor genotyping quality and/or genetic stratification.
-
-We can test the GW marker characteristics in controls by
-<<>>=
-descriptives.marker(ge03d2ex, ids=(dm2==0))
-@
-Apparently, HWE distribution holds better in controls than in 
-the total sample. 
-
-Let us check whether there are indications that deviation from HWE 
-is due to cases. At this stage we are only interested in the HWE distribution 
-table, and therefore will ask to report the distrbution 
-for cases (\texttt{dm2==1}) and report only table two:
-<<>>=
-descriptives.marker(ge03d2ex, ids=(dm2==1))[2]
-@
-and for the controls
-<<>>=
-descriptives.marker(ge03d2ex, ids=(dm2==0))[2]
-@
-It seems that indeed excessive number of markers are out of HWE in cases. 
-If no laboratory procedure (\eg DNA extraction, genotyping, calling) were 
-done for cases and controls separately, this may indicate possible 
-genetic heterogeneity specific for cases. 
-
-In essence, the '\texttt{descriptives.marker}' function 
-uses the '\texttt{summary}' function to generate the HW $P$-values 
-distribution. It may be interesting to generate this 
-distribution using the '\texttt{summary}' function 
-You do not need to do so, but this example shows how 
-you can generate summaries from underlying SNP-tables. 
-First, we need to compute summary SNP statistics by
-<<>>=
-s <- summary( gtdata( ge03d2ex[(dm2==1), ] ) )
-@
-
-Note the you have produced the summary for the \texttt{gtdata} 
-slot of \texttt{ge03d2ex}; this is the slot which actually 
-contain all genetic data in special compressed format. 
-
-You can see the first 5 rows of this very long summary table by
-<<>>=
-s[1:5, ]
-@
-Note that the column 'Pexact' provides exact HWE test $P$-values we need. 
-We can extract these to a separate vector by
-<<>>=
-pexcas <- s[, "Pexact"]
-@
-and do characterization of the cummulative distribution by 
-<<>>=
-catable(pexcas, c(0.001,0.01,0.05,0.1,0.5), cumulative=TRUE)
-@
-
-You can generate the distribution for controls in similar manner.
-
-%%and produce the $\chi^2-\chi^2$ plot and estimate inflation 
-%%factor by command \texttt{estlambda()}, which operates with a 
-%%vector of P-values or $\chi^2$s:
-%%<<>>=
-%%estlambda(pexcas)
-%%@
-%%
-%%By default, this function also produces a $\chi^2-\chi^2$ plot, at 
-%%which you can see some extreme deviation of observed from expected. 
-%%The resulting plot
-%%(figure \ref{fig:chi2chi2_Pexcas}) shows extreme deviation for  
-%%high values of the test.
-%%Looking at the $\lambda$ estimate, we indeed see inflation of the test statistics. 
-%%\begin{figure}[t]
-%%\centering
-%%<<fig=true,echo=false,results=hide>>=
-%%estlambda(s[,"Pexact"])
-%%@
-%%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE. 
-%%Black line of slope 1: expected under no inflation; Red line: 
-%%fitted slope.}
-%%\label{fig:chi2chi2_Pexcas}
-%%\end{figure}
-%%
-%%You can repeat this test for the controls, if time permits.
-
-Let us first try do a GWA scan using the raw (before quality control) data.
-We will use the score test, as implemented in the 
-\texttt{qtscore()}\footnote{consider '\texttt{mlreg}' function 
-if you want to run a true linear regression}
-function of \GA{} for testing:
-<<>>=
-an0 <- qtscore(dm2, ge03d2ex, trait="binomial")
-@
-The first argument used describes the model; here it is rather 
-simple --- the affection status, \texttt{dm2}, is supposed 
-to depend on SNP genotype only.
-
-You can see what information is computed by this function by using
-<<>>=
-an0
-@ 
-Here, let us look at the 'Results table'.  \texttt{P1df},
-\texttt{P2df} and \texttt{Pc1df} are most interesting; the first two
-are vectors of 1 and 2 d.f.~$P$-values obtained in the GWA analysis,
-the last one is 1 d.f.~$P$-value corrected for inflation factor
-$\lambda$ (which is in the \texttt{lambda}
-object). \texttt{effB} corresponds to the (approximate) Odds Ratio
-estimate for the SNP.
-
-Let us see if there is evidence for the inflation of the test 
-statistics; for that let us obtain $\lambda$ with
-<<>>=
-lambda(an0)
-@
-The estimate of $\lambda$ is \Sexpr{round(lambda(an0)$est, dig=2)},
-suggesting inflation of the test and some degree of stratification.
-Though the value obtained seems to be small, it should be noted 
-that $\lambda$ grows linearly with sample size, so for this small 
-number of cases and controls the value is worrisome. 
-
-The $\lambda$ is computed by regression in a Q-Q plot. Both estimation 
-of $\lambda$ and production of the $\chi^2-\chi^2$ plot can be done 
-using the \texttt{estlambda} function; this was already done automatically 
-when running \texttt{qtscore} function, but let us repeat this manually:
-
-<<>>=
-estlambda(an0[, "P1df"], plot=TRUE)
-@
-The corresponding $\chi^2-\chi^2$ plot is shown in 
-Figure~\ref{fig:chi2chi2_P1df}.
-
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false,results=hide>>=
-estlambda(an0[, "P1df"], plot=TRUE)
-@
-\caption{$\chi^2-\chi^2$ plot for a GWA scan. 
-Black line of slope 1: expected under no inflation; Red line: 
-fitted slope.}
-\label{fig:chi2chi2_P1df}
-\end{figure}
-
-
-\begin{tip}
-The 'se' produced by \texttt{estlambda} \emph{can not} 
-be used to test if inflation is significant and make 
-conclusions about the presence of significant or insignificant 
-stratification.
-\end{tip}
-
-We can also present the obtained results using the 
-''Manhatten plot'', where the SNP genomic position is on 
-the horizontal axis and $-\log_{10}$ of the $P$-value is 
-shown on the vertical axis: 
-<<>>=
-plot(an0)
-@
-The resulting plot is show in Figure~\ref{fig:scan_initial}. 
-By default, $-\log_{10}(P\text{-value})$ of the uncorrected 1 d.f.~test are 
-shown; see thehelp to figure out how this behaviour can be changed. 
-
-We can also add the corrected $P$-values to the plot with
-<<>>=
-add.plot(an0, df="Pc1df", col=c("lightblue", "lightgreen"))
-@
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false>>=
-par(mfcol=c(2,1))
-plot(an0, col=c("blue", "green"))
-plot(an0, df="Pc1df", col=c("blue", "green"))
-@
-\caption{$-\log_{10}(P\text{-value})$ from the genome scan before QC procedure.
-Raw analysis: darker circles; corrected analysis: lighter circles}
-\label{fig:scan_initial}
-\end{figure}
-You can see that the $P$-values corrected by genomic control 
-are uniformly lower than the $P$-values from 'raw' analysis. 
-This is to be expected as genomic control simply divides the 
-'raw' $\chi^2$ statistics by a constant $\lambda$ for all 
-SNPs. 
-
-You can also generate a descriptive table for the "top" (as ranked by 
-$P$-value) results by
-<<>>=
-descriptives.scan(an0)
-@
-or, equivalently, by '\texttt{summary(an0)}' 
-
-Here you see top 10 results, sorted by $P$-value with 1 d.f. If you want 
-to sort by the corrected $P$-value, you can use 
-\texttt{descriptives.scan(an0, sort="Pc1df")}; to see more than 10 
-(\eg 25) top results, use \texttt{descriptives.scan(an0, top=25)}. 
-You can combine all these options. Large part of results reports NA 
-as effect estimates and 9.99 as $P$-value for 2 d.f.~test -- 
-for these markers only two out of the three possible 
-genotypes were observed, and consequently the 2 d.f.~test could not be 
-performed. 
-
-Now let us apply the \texttt{qtscore()} function with \texttt{times}
-argument, which tells it to compute empirical GW 
-(or experiment-wise) significance
-<<>>=
-an0.e <- qtscore(dm2, ge03d2ex, times=200, quiet=TRUE)
-@
-(you may skip the 'quiet=TRUE' argument, then you will see progress)
-
-Now let us generate the summary of the results
-<<>>=
-descriptives.scan(an0.e, sort="Pc1df")
-@
-
-Experimental-wise significance is computed by an empirical 
-procedure, thus we consider $P$-values $\le 0.05$ to be GW-significant.
-However, none of the SNPs hits GW significance. 
-If, actually, any did pass the threshold, we 
-could not trust the results, because the distribution of the HWE test 
-and presence of inflation factor for the association test statistics 
-suggest that the data may contain multiple errors (indeed they do).
-Therefore before association analysis we need to do 
-rigorous Quality Control (QC).
-
-Note that at a certain SNP, the corrected $P$-values become equal to 1
--- at this point the order in the list is arbitrary because sorting
-could not be done.
-
-\begin{summary}
-\item The \texttt{descriptives} family of functions was developed 
-to facilitate the production of tables which can be directly used 
-in a manuscript --- it is possible to save the output as a file, 
-which can be open by Excel or Word. See \eg 
-\texttt{help(descriptives.trait)} for details. 
-\item The inflation of test statistics compared to null (1 d.f.) 
-may be estimated with \texttt{estlambda} function.
-\end{summary}
-
-
-\section{Genetic data QC}
-
-The major genetic data QC function of \GA{} is \texttt{check.marker()}.
-We will now use that to perform our data QC; the output is rather self-explaining.
-Because of possible genetic heterogeneity of the study data it 
-is a good idea to skip Hardy-Weinberg checks in the first round of QC. 
-This can be achieved by setting HWE $P$-value 
-selection threshold to zero (\texttt{p.level=0}):
-<<>>=
-qc1 <- check.marker(ge03d2ex, p.level=0)
-@
-Note that normally you will \emph{NEVER} run 
-this simple form of the QC function -- you should always 
-provide a number of thresholds specific to the 
-platform you used for genotyping. See help to 
-\texttt{check.marker()} for detailed list of arguments. 
-The default values used by the function are rather 
-relaxed compared to the thresholds routinely used 
-nowadays with most of the platforms. 
-
-\begin{tip}
-The computation of all pairwise 
-proportion of alleles identical-by-state (IBS) 
-by \texttt{ibs()} function, which is also called by 
-\texttt{check.marker()} may take quite some time, 
-which is proportional to the square of the number of 
-subjects. This is not 
-a problem with the small number of people we use for 
-this example or when modern computers are used. However,
-the computers in the computer room are very old. 
-Therefore be prepared to wait for long time when you 
-will do a self-exercise with 1,000 people.
-\end{tip}
-
-From the output you can see that QC starts with checking the data for 
-SNPs and people with extremely low call rate. Six markers are 
-excluded from further analysis due to very low call rate. 
-Next, X-chromosomal errors are identified. The function 
-finds out that all errors (heterozygous male X-genotypes) are due to two people with 
-wrong sex assigned and one marker, which looks like an autosomal one. 
-This actually could be a marker from the pseudoautosomal region, which 
-should have been arranged as a separate "autosome". 
-Nine people are found to have intermediate inbreeding at the 
-X-chromosome and are also excluded from analysis.  
-
-Then, the procedure finds the markers with low call rate ($\le 0.95$ 
-by default) 
-across people, markers with low MAF (by default, low MAF is defined as 
-less than a few copies of the rare allele, see help for details); 
-people with low call rate (default value: $\le 0.95$) across SNPs, people with extreme 
-heterozygosity (at FDR 0.01) and those who have GW IBS $\ge 0.95$. 
-These default parameters may be changed if you wish (consult the help).
-
-Because some of the people fail to pass the tests, the data set is not 
-guaranteed to be really "clean" after single iteration, \eg some marker 
-may not pass the call threshold after we exclude few 
-informative (but apparently having low quality) samples. Therefore the 
-QC is repeated iteratively until no further errors are found.
-
-You can generate a short summary of the QC by marker and by person through
-<<>>=
-summary(qc1)
-@
-Note that the original data, \texttt{ge03d2ex}, are not modified during the 
-procedure; rather, \texttt{check.markers()} generate a list of markers and 
-people which pass or do not pass certain QC criteria. The objects returned 
-by \texttt{check.markers()} are:
-<<>>=
-names(qc1)
-@
-
-The element \texttt{idok} provides the list of people who passed all QC criteria, 
-and \texttt{snpok} provides the list of SNPs which passed all criteria.
-You can easily generate a new data set, which will consist only of these 
-people and markers by
-<<>>=
-data1 <- ge03d2ex[qc1$idok, qc1$snpok]
-@
-
-If there are any residual sporadic X-errors (male heterozygosity), these 
-can (and should!) be fixed (set to NA) by
-<<>>=
-data1 <- Xfix(data1)
-@
-Applying this function does not make any difference for the example data set, 
-but you will need to use it for the bigger data set.
-
-At this point, we are ready to work with the new, cleaned, data set 
-\texttt{data1}. However, if we try
-<<>>=
-length(dm2)
-@
-we can see that the original phenotypic data are attached to the search 
-path (there are only \Sexpr{nids(data1)} people left in the 'clean' data set). 
-Therefore we need to detach the data by
-<<>>=
-detach(phdata(ge03d2ex))
-@
-and attach new data by
-<<>>=
-attach(phdata(data1))
-@
-
-At this stage, let us check if the first round of QC improves the fit 
-of genetic data to HWE, which may have been violated due to 
-by genotyping errors which we hopefully (at least partly!) eliminated:
-<<>>=
-descriptives.marker(data1)[2]
-descriptives.marker(data1[dm2==1, ])[2]
-descriptives.marker(data1[dm2==0, ])[2]
-@
-You can see that the fit to HWE improved, but 
-cases are still have excess number of markers 
-out of HWE. This may be due to genetic 
-sub-structure.  
-%%estlambda(summary(data1 at gtdata[dm2==1,])[,"Pexact"])
-
-%Apparently, the distribution (figure \ref{fig:chi2chi2_Pexcas2}) looks 
-%better (note the scale difference between the graphs), but the test 
-%statistics is still inflated.
-%\begin{figure}[t]
-%\centering
-%<<fig=true,echo=false,results=hide>>=
-%estlambda(summary(data1 at gtdata[data1 at phdata$dm2==1,])[,"Pexact"])
-%@
-%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE. 
-%Black line of slope 1: expected under no inflation; Red line: 
-%fitted slope.}
-%\label{fig:chi2chi2_Pexcas2}
-%\end{figure}
-
-\section{Finding genetic sub-structure}
-
-Now, we are ready for the second round of QC -- detection of 
-genetic outliers which may contaminate our results.
-We will detect genetic outliers using a  
-technique, which resembles the one suggested by Price at al. 
-
-As the first step, we will compute a matrix of genomic kinship between all 
-pairs of individuals, using only autosomal\footnote{the list of 
-autosomal markers contained in \texttt{data} is returned by the
-\texttt{autosomal(data)} function} 
-markers by
-<<>>=
-data1.gkin <- ibs(data1[, autosomal(data1)], weight="freq")
-@
-
-\begin{tip}
-This step may take few minutes on large data sets or 
-when using old computers!
-\end{tip}
-
-You can see the $5 \times 5$ upper left sub-matrix by
-<<>>=
-data1.gkin[1:5, 1:5]
-@
-
-The numbers below the diagonal show the genomic estimate 
-of kinship (aka 'genomic kinship' or 'genome-wide IBD'), 
-the numbers on the diagonal correspond to 0.5 plus the genomic 
-homozigosity, and 
-the numbers above the diagonal tell how many SNPs were typed 
-successfully for both subjects
-(thus the IBD estimate is derived using this number of SNPs).
-
-Second, we transform this matrix to a distance 
-matrix using standard \texttt{R} command
-<<>>=
-data1.dist <- as.dist(0.5-data1.gkin)
-@
-
-Finally, we perform Classical Multidimensional Scaling by
-<<>>=
-data1.mds <- cmdscale(data1.dist)
-@
-By default, the first two principal components are computed and 
-returned.
-
-\begin{tip}
-This may take few minutes on large data sets or 
-when using old computers!
-\end{tip}
-
-We can present the results graphically by 
-<<>>=
-plot(data1.mds)
-@
-The resulting plot is shown in Figure~\ref{fig:mds}.
-Each point on the plot corresponds to a person, and the 2D 
-distances between points were fitted to be as close as possible 
-to those presented in the original IBS matrix. You can see that 
-study subjects clearly cluster in two groups. 
-
-You can identify the points belonging to clusters by
-<<>>=
-km  <- kmeans(data1.mds, centers=2, nstart=1000)
-cl1 <- names(which(km$cluster==1))
-cl2 <- names(which(km$cluster==2))
-if (length(cl1) > length(cl2)) {x<-cl2; cl2<-cl1; cl1<-x}
-cl1
-cl2
-@
-Four outliers are shown in the smaller cluster.
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false>>=
-plot(data1.mds)
-points(data1.mds[cl1,], pch=19, col="red")
-@
-\caption{Mapping samples to the space of the first two Principle 
-Components resulting from analysis of genomic kinship. Red dots 
-identify genetic outliers.}
-\label{fig:mds}
-\end{figure}
-
-\begin{tip}
-Now you will need to use the BIGGER cluster for 
-to select study subjects. Whether this 
-will be \texttt{cl1} or \texttt{cl2} in you case, is totally 
-random. 
-\end{tip}
-
-We can form a data set which is free from outliers by
-using only people from the bigger cluster:
-<<>>=
-data2 <- data1[cl2, ]
-@
-
-After we dropped the outliers, we need to repeat QC using
-\texttt{check.markers()}. At this stage, we want to 
-allow for HWE checks (we will use only controls and 
-exclude markers with FDR $\le 0.2$): 
-<<>>=
-qc2 <- check.marker(data2, hweids=(phdata(data2)$dm2==0), fdr=0.2)
-summary(qc2)
-@
-\begin{tip}
-If the procedure did not run, check the previous Note.
-\end{tip}
-
-Indeed, in the updated data set several markers do not pass our 
-QC criteria and we need to drop a few markers. This is done by
-<<>>=
-data2 <- data2[qc2$idok, qc2$snpok]
-@
-
-This is going to be our final analysis data set, therefore 
-let us attach the phenotypic data to the search path, so we do not 
-need to type \texttt{phdata(data2)\$...} to access \texttt{dm2} status or other 
-variables:
-<<>>=
-detach(phdata(data1))
-attach(phdata(data2))
-####
-#ge03d2ex.clean <- data2
-#save(ge03d2ex.clean, file="ge03d2ex.clean.RData")
-####
-@
-
-Before proceeding to GWA, let us check if complete QC improved the fit 
-of genetic data to HWE:
-<<>>=
-descriptives.marker(data2)[2]
-descriptives.marker(data2[phdata(data2)$dm2==1, ])[2]
-descriptives.marker(data2[phdata(data2)$dm2==0, ])[2]
-@
-You can see that now there is no excessive number of SNPs 
-out of HWE in the sample (total, or cases, or controls)
-
-
-\section{GWA association analysis}
-
-Let us start again with descriptives of the phenotypic 
-and marker data
-<<>>=
-descriptives.trait(data2, by=dm2)
-@
-
-You can see that the relation to weight is maintained in this smaller, 
-but hopefully cleaner, data set; moreover, the relation to age becomes 
-borderline significant. 
-
-If you check descriptives of markers (only HWE part shown)
-<<>>=
-descriptives.marker(data2)[2]
-@
-you can see that the problems with HWE are apparently fixed; 
-we may guess that these were caused by Wahlund's effect.
-
-Run the score test on the cleaned data by
-<<>>=
-data2.qt <- qtscore(dm2, data2, trait="binomial")
-@
-and check lambda
-<<>>=
-lambda(data2.qt)
-@
-there is still some inflation, which is explained by the fact 
-that we investigate only a few short chromosomes with high LD and few 
-causative variants. 
-
-Produce the association analysis plot by
-<<>>=
-plot(data2.qt, df="Pc1df")
-@
-(Figure~\ref{fig:scan_QC}).
-
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false>>=
-plot(data2.qt, df="Pc1df")
-@
-\caption{$-\log_{10}(\text{Corrected }P\text{-value})$ from the genome scan 
-  after the QC procedure.}
-\label{fig:scan_QC}
-\end{figure}
-
-Produce the scan summary by
-<<>>=
-descriptives.scan(data2.qt, sort="Pc1df")
-@
-
-Comparison with the top 10 from the scan before QC shows that results changed 
-substantially with only few markers overlapping. 
-
-You can see similar results when assessing empirical GW 
-significance:
-<<>>=
-data2.qte <- qtscore(dm2, 
-                     data2, times=200, quiet=TRUE, trait="binomial")
-descriptives.scan(data2.qte, sort="Pc1df")
-@
-
-Again, none of the SNPs hits GW 5\% significance. Still,
-you can see that after QC the top markers achieve somewhat ``better''
-significance.
-
-In the last part, we will do several adjusted and stratified analyses.
-Only empirical $P$-values will be estimated to make the story shorter.
-To adjust for sex and age, we can 
-<<>>=
-data2.qtae <- qtscore(dm2~sex+age, 
-                      data2, times=200, quiet=TRUE, trait="binomial")
-descriptives.scan(data2.qtae)
-@
-You can see that there is little difference between adjusted and 
-unadjusted analysis, but this is not always the case; adjustment 
-may make your study much more powerful when covariates explain a 
-large proportion of environmental trait variation.
-
-Finally, let us do stratified (by BMI) analysis. We will contracts 
-obese ($\text{BMI} \ge 30$) cases to all controls. 
-<<>>=
-data2.qtse <- qtscore(dm2~sex+age,
-                      data2, ids=((bmi>30 & dm2==1) | dm2==0),
-                      times=200, quiet=TRUE, trait="binomial")
-descriptives.scan(data2.qtse, sort="Pc1df")
-@
-Again, nothing interesting at the GW significance level. If we would have had found 
-something, naturally, we would not have known if we mapped a T2D or obesity gene
-(or a gene for obesity in presence of T2D, or the one for T2D in presence 
-of obesity).
-
-Let us save the 'data1', 'data1.gkin', 'data2.qt' and 'cl1' objects now 
-<<>>=
-save(data1, cl1, data1.gkin, data2.qt, file="data1.RData")
-@
-These data will be used later 
-in section \ref{sec:strat}, "\nameref{sec:strat}", in which 
-we will perform GWA using different methods 
-to account for stratification, and will compare the results 
-with these obtaining by removing outliers ('data2.qt').
-
-Let us also save the cleaned 'data2' object, which will be later 
-used in section \ref{sec:meta} ("\nameref{sec:meta}"):
-<<>>=
-save(data2, file="data2.RData")
-@
-
-At this point, you acquired the knowledge necessary for the self-exercise. 
-Please close \texttt{R} by \texttt{q()} command and proceed to the next 
-section.
-
-
-\section{Genome-wide association analysis exercise}
-
-\begin{ExerciseList}
-During the exercise, you will work with a larger data set 
-(approximately 1,000 people and 7,000+ SNPs). You are to 
-do the complete three-round QC; perform GWA analysis with 
-\texttt{dm2} as the outcome of interest and identify 
-10 SNPs which you would like to take to the stage 2 (replication) 
-scan. You will do replication analysis using a 
-confirmatory data set. If you did everything right, the 
-SNPs which you identified as significant or replicated 
-will be located in known T2D genes. 
-
-Please keep in mind that the data are simulated, and do not 
-take your findings too seriously!
-
-Start \texttt{R} by going to "Start -> Programs -> R -> R-?.?.?". 
-Load the \GA{} library by
-%<<echo=false,results=hide>>=
-%rm(list=ls())
-%@
-<<>>=
-library(GenABEL)
-@
-
-The two data sets we will use in this exercise are part of the \GA{} 
-distribution. The first one (the "discovery" set) can be loaded by 
-<<>>=
-data(ge03d2)
-@
-
-Please move along the lines detailed in the guided exercise and 
-try to answer following questions:
-
-\Exercise How many cases and controls are present
-in the original data set?
-\Answer :
-<<>>=
-table(phdata(ge03d2)$dm2)
-@
-
-
-\Exercise How many markers are present
-in the original data set?
-\Answer :
-<<>>=
-nsnps(ge03d2)
-@
-
-
-\Exercise Is there evidence for inflation of the 
-HWE test statistics?
-\Answer Yes:
-<<>>=
-descriptives.marker(ge03d2)[2]
-@
-
-
-\Exercise Analyse empirical GW significance. How many SNPs 
-pass the genome-wide significance threshold, after correction for 
-the inflation factor? Write down the names of these SNPs for 
-further comparison.
-\Answer :
-<<>>=
-res0 <- qtscore(dm2, data=ge03d2, times=200, 
-                quiet=TRUE, trait="binomial")
-### something funny is going on here
-### disabeling next line
-#lambda(res0)
-ds <- descriptives.scan(res0)
-ds
-@
-Thus, there are no genome-wide empirically significant results.
-The 'top' 10 SNPs are
-<<>>=
-snps0 <- rownames(ds)
-snps0
-@
-(note that if the empirical $P = 1$, the rank is assigned quite arbitrarily)
-
-
-\Exercise Perform first steps of the genetic data QC.
-\Answer First step of QC
-<<>>=
-qc1 <- check.marker(ge03d2, call=0.95, perid.call=0.95, 
-                    p.level=0, ibs.exclude="both")
-summary(qc1)
-data1 <- ge03d2[qc1$idok, qc1$snpok]
-data1 <- Xfix(data1)
-qc2 <- check.marker(data1, call=0.95, perid.call=0.95, p.level=0)
-summary(qc2)
-@
-
-
-\Exercise How many males are 'genetically' females?
-\Answer The list of genetic females who are coded as males is 
-<<>>=
-qc1$isfemale
-@
- 
-\Exercise How many females are 'genetically' males?
-\Answer The list of genetic males who are coded as females is 
-<<>>=
-qc1$ismale
-@
- 
-\Exercise How many people are quessed to have 'XXY' genotype?
-\Answer The number of 'XXY' people is \Sexpr{length(qc1$isXXY)}
- 
-\Exercise How many sporadic X errors do you still 
-observe even when the female male and non-X X-markers are removed?
-(do not forget to \texttt{Xfix(s)} these
-\Answer Eight 'sporadic' X-errors are left after removing people 
-with likely sex code errors (seven in the data set after first step 
-of QC)
-
-\Exercise How many "twin" DNAs did you discover?
-\Answer The list of IDs failing IBS checks ('twin' DNAs) is 
-<<>>=
-qc1$ibsfail
-@
-
-\Exercise Perform second step of QC.
-\Answer The second step of QC:
-<<>>=
-data1.gkin <- ibs(data1[, autosomal(data1)], weight="freq")
-data1.dist <- as.dist(0.5 - data1.gkin)
-data1.mds <- cmdscale(data1.dist)
-km <- kmeans(data1.mds, centers=2, nstart=1000)
-cl1 <- names( which(km$cluster==1) )
-cl2 <- names( which(km$cluster==2) )
-if (length(cl1) > length(cl2)) {x<-cl2; cl2<-cl1; cl1<-x}
-cl1
-data2 <- data1[cl2,]
-qc2 <- check.marker(data2, hweids=(phdata(data2)$dm2==0), fdr=0.2)
-summary(qc2)
-data2 <- data2[qc2$idok, qc2$snpok]
-data2 <- Xfix(data2)
-####
-#ge03d2.clean <- data2
-#save(ge03d2.clean, file="ge03d2.clean.RData")
-####
-@
-
-\Exercise How many genetic outliers did you discover?
-\Answer The list of genetic outliers is 
-<<>>=
-cl1
-@
-
-\Exercise 
-How many cases and controls are presented 
-in the data after QC?
-\Answer :
-<<>>=
-table(phdata(data2)$dm2)
-@
-
-
-\Exercise How many markers are presented 
-in the data after QC?
-\Answer :
-<<>>=
-nsnps(data2)
-@
-
-
-\Exercise Is there evidence for inflation of the 
-HWE test staistics?
-\Answer No:
-<<>>=
-descriptives.marker(data2)[2]
-@
-
-FROM THIS POINT ON THE RUNNING OF THE TUTORIAL FAILS (2013.02.20,
-USING GENABEL DEV VERSION 1.7-4). I COMMENTED OUT THE R CODE BELOW. 
-
-\Exercise Perform GWA analysis of the cleaned data, 
-using asimptotic test and plot the results.
-What is the estimate of $\lambda$ for the 1 d.f.~test?
-\Answer :
-<<>>=
-#qts <- qtscore(dm2, data2, trait="binomial")
-#lambda(qts)
-@
-
-
-\Exercise Analyse empirical GW significance. How many SNPs 
-pass genome-wide significance threshold, after correction for 
-the inflation factor?
-\Answer : 
-<<>>=
-#res1 <- qtscore(dm2, data=data2, times=200, quiet=TRUE, trait="binomial")
-#ds1 <- descriptives.scan(res1)
-#ds1
-@
-There are SNPs which are empirically genome-wide significant
-in the data.
-To get the list of 'top' 10 SNPs:
-<<>>=
-#snps1 <- rownames(ds1)
-#snps1
-@
-
-
-\Exercise Do these SNPs overlap much with the ones 
-ranked at the top before the QC? If not, what could be the reason?
-\Answer There is little overlap between SNPs before 
-and after QC:
-<<>>=
-#snps0
-#snps1
-@
-
-
-\Exercise Select 10 SNPs which you would like to follow-up. Say, you've selected
-rs1646456, rs7950586, rs4785242, rs4435802, rs2847446, rs946364, rs299251, 
-rs2456488, rs1292700, and rs8183220.
-Make a vector of these SNPs with
-<<>>=
-#vec12<-c("rs1646456", "rs7950586", "rs4785242", "rs4435802", "rs2847446",
-#         "rs946364", "rs299251", "rs2456488", "rs1292700", "rs8183220")
-@ 
-Load the stage 2 (replicaton) data set by 
-<<>>=
-#data(ge03d2c)
-@
-and select the subset of SNPs you need by
-<<>>=
-#confdat <- ge03d2c[, vec12]
-@
-Analyse the \texttt{confdat} for association with \texttt{dm2}. 
-
-\Answer :
-<<>>=
-#data(ge03d2c)
-#snps1
-#confdat <- ge03d2c[, snps1]
-#rep <- qtscore(dm2, confdat, times=10000, quiet=TRUE)
-#descriptives.scan(rep)
-@
-
-
-\Exercise Given the two-stage design, and applying the 
-puristic criteria specified in the lecture, for how many SNPs 
-you can claim a significant finding?
-\Answer Two-stage $P$-value is
-<<>>=
-#    snps1
-#    finres <- matrix(NA, 10, 3)
-#    colnames(finres) <- c("Stage 1", "Replication", "Combined")
-#    rownames(finres) <- snps1
-#    for (i in 1:10) {
-#        finres[i, 1] <- res1[which(snpnames(data2)==snps1[i]), "Pc1df"]
-#        finres[i, 2] <- rep[which(snpnames(confdat)==snps1[i]), "P1df"]
-#        finres[i, 3] <- finres[i, 1]*finres[i, 2]
-#    }
-#    finres
-#    for (i in 1:10) {
-#        if (finres[i, 3] <= 0.05) {
-#            print(c("---------", rownames(finres)[i], "-------"))
-#            
-#            print(c(rownames(finres)[i], "stage 1:"))
-#            ph <- phdata(data2)$dm2
-#            gt <- as.numeric(data2[, rownames(finres)[i]])
-#            print(summary( glm(ph~gt, family=binomial) )$coef)
-#            
-#            print(c(rownames(finres)[i], "stage 2:"))
-#            ph <- phdata(confdat)$dm2
-#            gt <- as.numeric(confdat[, rownames(finres)[i]])
-#            print(summary(glm(ph~gt, family=binomial))$coef)
-#
-#            print(c(rownames(finres)[i], "Joint:"))
-#            ph <- c(phdata(data2)$dm2, phdata(confdat)$dm2)
-#            gt <- c(as.numeric(data2[, rownames(finres)[i]]), 
-#                     as.numeric(confdat[, rownames(finres)[i]]))
-#            print(summary( glm(ph~gt, family=binomial) )$coef)
-#        }
-#    }
-#replicatedsnps <- rownames(finres)[finres[, "Stage 1"] <= 0.05 & 
-#                                   finres[, "Replication"] <= 0.05 & 
-#                                   finres[, "Combined"] <= 0.05]
-#replicatedsnps
-#sigsnps <- rownames(finres)[finres[, "Combined"] <= 0.05]
-#sigsnps
-@
-At the first glance, NWSexpr{length(replicatedsnps)} SNPs  
-may be claimed as replicated 
-because both first stage and replication $P$-values are
-$\le 0.05$ and effects are consistent, and additionally   
-NWSexpr{length(sigsnps) - length(replicatedsnps)} may 
-be claimed as 'significant' because 
-joint $P$-values are $\le 0.05$ and the effects are consistent. 
-Generally, a more thorough simulation experiment should be performed.
-
-
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/genabel -r 2091


More information about the Genabel-commits mailing list