[Genabel-commits] r2091 - tutorials/GenABEL_general
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Tue Apr 6 00:06:09 CEST 2021
Author: lckarssen
Date: 2021-04-06 00:06:09 +0200 (Tue, 06 Apr 2021)
New Revision: 2091
Added:
tutorials/GenABEL_general/README
Removed:
tutorials/GenABEL_general/GWA.Rnw
tutorials/GenABEL_general/GWAprotocol.Rnw
tutorials/GenABEL_general/GWAsimple.Rnw
tutorials/GenABEL_general/GenABEL-tutorial.Rnw
tutorials/GenABEL_general/ImputedDataAnalysis.Rnw
tutorials/GenABEL_general/Makefile
tutorials/GenABEL_general/QC.Rnw
tutorials/GenABEL_general/Sweave.sty
tutorials/GenABEL_general/answers.Rnw
tutorials/GenABEL_general/answers2.Rnw
tutorials/GenABEL_general/assoc.Rnw
tutorials/GenABEL_general/bibliography.bib
tutorials/GenABEL_general/dataimport.Rnw
tutorials/GenABEL_general/dbasing.Rnw
tutorials/GenABEL_general/fetchData.Rnw
tutorials/GenABEL_general/genetics.bst
tutorials/GenABEL_general/impute.Rnw
tutorials/GenABEL_general/internals.Rnw
tutorials/GenABEL_general/intro.Rnw
tutorials/GenABEL_general/introR.Rnw
tutorials/GenABEL_general/meta.Rnw
tutorials/GenABEL_general/myxhtml.cfg
tutorials/GenABEL_general/preface.Rnw
tutorials/GenABEL_general/probabel.Rnw
tutorials/GenABEL_general/reg.Rnw
tutorials/GenABEL_general/siman.Rnw
tutorials/GenABEL_general/strat.Rnw
tutorials/GenABEL_general/strat0.Rnw
tutorials/GenABEL_general/workgwaaclass.Rnw
Log:
Remove the code of the GenABEL tutorial after moving to Github
Development will continue at https://github.com/GenABEL-Project/GenABEL_tutorial.
The remaining README file points to this location.
Deleted: tutorials/GenABEL_general/GWA.Rnw
===================================================================
--- tutorials/GenABEL_general/GWA.Rnw 2019-06-25 11:35:07 UTC (rev 2090)
+++ tutorials/GenABEL_general/GWA.Rnw 2021-04-05 22:06:09 UTC (rev 2091)
@@ -1,1038 +0,0 @@
-\chapter{Genome-wide association analysis}
-\label{sec:GWA}
-
-In the first parts of this section you will be guided
-through a GWA analysis of a small data set. In the
-last part you will investigate a larger data set by
-yourself, do a verification study and will answer the
-questions.
-All data sets used assume a study in a relatively
-homogeneous population.
-Try to finish the first part in the morning and the
-second part in the afternoon.
-
-Though only few thousands of markers located at four
-small chromosomes are used in the scan, we still going to
-call it Genome-Wide (GW), as the amount of data we
-will use is approaches the amount to be expected
-in a real experiment. However, because the regions
-are small, and the LD between SNPs is high,
-some specific features (\eg relatively high
-residual inflation, which occurs because
-large proportion of SNPs are in LD with
-the really associated ones)
-are specific
-features of this data set, which are not observed in
-true GWA studies.
-
-Start R and load the \GA{} library
-by typing
-<<results=hide>>=
-library(GenABEL)
-@
-and load the data which we will use in this section by
-<<>>=
-data(ge03d2ex)
-@
-
-Investigate the objects loaded by the command
-<<>>=
-ls()
-@
-
-The \texttt{ge03d2ex} is an object of the class \texttt{gwaa.data}:
-<<>>=
-class(ge03d2ex)
-@
-To check what are the names of variables in the phenotypic data frame, use
-<<>>=
-names(phdata(ge03d2ex))
-@
-We can attach this data frame to the \texttt{R} search path by
-<<>>=
-attach(phdata(ge03d2ex))
-@
-
-\section{Data descriptives and first round of GWA analysis}
-
-Let us investigate which traits are present in the loaded data frame
-and what are the characteristics of the distribution by using the
-\GA{} function \texttt{descriptive.trait}:
-<<>>=
-descriptives.trait(ge03d2ex)
-@
-
-You can see that the phenotypic frame contains data on 136 people; the data
-on sex, age, height, weight, diet and body mass index (BMI) are available.
-Our trait of interest is \texttt{dm2} (type 2 diabetes). Note that every
-single piece of information in this data set is simulated; however, we
-tried to keep our simulations in a way we think the control of T2D may work.
-
-You can produce a summary for cases and controls separately and
-compare distributions of the traits by
-<<>>=
-descriptives.trait(ge03d2ex, by=dm2)
-@
-Here, the \texttt{by} argument specifies the grouping variable.
-You can see that cases and controls are different in weight, which
-is expected, as T2D is associated with obesity.
-
-Similarly, you can produce grand GW descriptives of the marker data
-by using
-<<>>=
-descriptives.marker(ge03d2ex)
-@
-It is of note that we can see inflation of the proportion of the
-tests for HWE at a particular threshold, as compared to the expected.
-This may indicate poor genotyping quality and/or genetic stratification.
-
-We can test the GW marker characteristics in controls by
-<<>>=
-descriptives.marker(ge03d2ex, ids=(dm2==0))
-@
-Apparently, HWE distribution holds better in controls than in
-the total sample.
-
-Let us check whether there are indications that deviation from HWE
-is due to cases. At this stage we are only interested in the HWE distribution
-table, and therefore will ask to report the distrbution
-for cases (\texttt{dm2==1}) and report only table two:
-<<>>=
-descriptives.marker(ge03d2ex, ids=(dm2==1))[2]
-@
-and for the controls
-<<>>=
-descriptives.marker(ge03d2ex, ids=(dm2==0))[2]
-@
-It seems that indeed excessive number of markers are out of HWE in cases.
-If no laboratory procedure (\eg DNA extraction, genotyping, calling) were
-done for cases and controls separately, this may indicate possible
-genetic heterogeneity specific for cases.
-
-In essence, the '\texttt{descriptives.marker}' function
-uses the '\texttt{summary}' function to generate the HW $P$-values
-distribution. It may be interesting to generate this
-distribution using the '\texttt{summary}' function
-You do not need to do so, but this example shows how
-you can generate summaries from underlying SNP-tables.
-First, we need to compute summary SNP statistics by
-<<>>=
-s <- summary( gtdata( ge03d2ex[(dm2==1), ] ) )
-@
-
-Note the you have produced the summary for the \texttt{gtdata}
-slot of \texttt{ge03d2ex}; this is the slot which actually
-contain all genetic data in special compressed format.
-
-You can see the first 5 rows of this very long summary table by
-<<>>=
-s[1:5, ]
-@
-Note that the column 'Pexact' provides exact HWE test $P$-values we need.
-We can extract these to a separate vector by
-<<>>=
-pexcas <- s[, "Pexact"]
-@
-and do characterization of the cummulative distribution by
-<<>>=
-catable(pexcas, c(0.001,0.01,0.05,0.1,0.5), cumulative=TRUE)
-@
-
-You can generate the distribution for controls in similar manner.
-
-%%and produce the $\chi^2-\chi^2$ plot and estimate inflation
-%%factor by command \texttt{estlambda()}, which operates with a
-%%vector of P-values or $\chi^2$s:
-%%<<>>=
-%%estlambda(pexcas)
-%%@
-%%
-%%By default, this function also produces a $\chi^2-\chi^2$ plot, at
-%%which you can see some extreme deviation of observed from expected.
-%%The resulting plot
-%%(figure \ref{fig:chi2chi2_Pexcas}) shows extreme deviation for
-%%high values of the test.
-%%Looking at the $\lambda$ estimate, we indeed see inflation of the test statistics.
-%%\begin{figure}[t]
-%%\centering
-%%<<fig=true,echo=false,results=hide>>=
-%%estlambda(s[,"Pexact"])
-%%@
-%%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE.
-%%Black line of slope 1: expected under no inflation; Red line:
-%%fitted slope.}
-%%\label{fig:chi2chi2_Pexcas}
-%%\end{figure}
-%%
-%%You can repeat this test for the controls, if time permits.
-
-Let us first try do a GWA scan using the raw (before quality control) data.
-We will use the score test, as implemented in the
-\texttt{qtscore()}\footnote{consider '\texttt{mlreg}' function
-if you want to run a true linear regression}
-function of \GA{} for testing:
-<<>>=
-an0 <- qtscore(dm2, ge03d2ex, trait="binomial")
-@
-The first argument used describes the model; here it is rather
-simple --- the affection status, \texttt{dm2}, is supposed
-to depend on SNP genotype only.
-
-You can see what information is computed by this function by using
-<<>>=
-an0
-@
-Here, let us look at the 'Results table'. \texttt{P1df},
-\texttt{P2df} and \texttt{Pc1df} are most interesting; the first two
-are vectors of 1 and 2 d.f.~$P$-values obtained in the GWA analysis,
-the last one is 1 d.f.~$P$-value corrected for inflation factor
-$\lambda$ (which is in the \texttt{lambda}
-object). \texttt{effB} corresponds to the (approximate) Odds Ratio
-estimate for the SNP.
-
-Let us see if there is evidence for the inflation of the test
-statistics; for that let us obtain $\lambda$ with
-<<>>=
-lambda(an0)
-@
-The estimate of $\lambda$ is \Sexpr{round(lambda(an0)$est, dig=2)},
-suggesting inflation of the test and some degree of stratification.
-Though the value obtained seems to be small, it should be noted
-that $\lambda$ grows linearly with sample size, so for this small
-number of cases and controls the value is worrisome.
-
-The $\lambda$ is computed by regression in a Q-Q plot. Both estimation
-of $\lambda$ and production of the $\chi^2-\chi^2$ plot can be done
-using the \texttt{estlambda} function; this was already done automatically
-when running \texttt{qtscore} function, but let us repeat this manually:
-
-<<>>=
-estlambda(an0[, "P1df"], plot=TRUE)
-@
-The corresponding $\chi^2-\chi^2$ plot is shown in
-Figure~\ref{fig:chi2chi2_P1df}.
-
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false,results=hide>>=
-estlambda(an0[, "P1df"], plot=TRUE)
-@
-\caption{$\chi^2-\chi^2$ plot for a GWA scan.
-Black line of slope 1: expected under no inflation; Red line:
-fitted slope.}
-\label{fig:chi2chi2_P1df}
-\end{figure}
-
-
-\begin{tip}
-The 'se' produced by \texttt{estlambda} \emph{can not}
-be used to test if inflation is significant and make
-conclusions about the presence of significant or insignificant
-stratification.
-\end{tip}
-
-We can also present the obtained results using the
-''Manhatten plot'', where the SNP genomic position is on
-the horizontal axis and $-\log_{10}$ of the $P$-value is
-shown on the vertical axis:
-<<>>=
-plot(an0)
-@
-The resulting plot is show in Figure~\ref{fig:scan_initial}.
-By default, $-\log_{10}(P\text{-value})$ of the uncorrected 1 d.f.~test are
-shown; see thehelp to figure out how this behaviour can be changed.
-
-We can also add the corrected $P$-values to the plot with
-<<>>=
-add.plot(an0, df="Pc1df", col=c("lightblue", "lightgreen"))
-@
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false>>=
-par(mfcol=c(2,1))
-plot(an0, col=c("blue", "green"))
-plot(an0, df="Pc1df", col=c("blue", "green"))
-@
-\caption{$-\log_{10}(P\text{-value})$ from the genome scan before QC procedure.
-Raw analysis: darker circles; corrected analysis: lighter circles}
-\label{fig:scan_initial}
-\end{figure}
-You can see that the $P$-values corrected by genomic control
-are uniformly lower than the $P$-values from 'raw' analysis.
-This is to be expected as genomic control simply divides the
-'raw' $\chi^2$ statistics by a constant $\lambda$ for all
-SNPs.
-
-You can also generate a descriptive table for the "top" (as ranked by
-$P$-value) results by
-<<>>=
-descriptives.scan(an0)
-@
-or, equivalently, by '\texttt{summary(an0)}'
-
-Here you see top 10 results, sorted by $P$-value with 1 d.f. If you want
-to sort by the corrected $P$-value, you can use
-\texttt{descriptives.scan(an0, sort="Pc1df")}; to see more than 10
-(\eg 25) top results, use \texttt{descriptives.scan(an0, top=25)}.
-You can combine all these options. Large part of results reports NA
-as effect estimates and 9.99 as $P$-value for 2 d.f.~test --
-for these markers only two out of the three possible
-genotypes were observed, and consequently the 2 d.f.~test could not be
-performed.
-
-Now let us apply the \texttt{qtscore()} function with \texttt{times}
-argument, which tells it to compute empirical GW
-(or experiment-wise) significance
-<<>>=
-an0.e <- qtscore(dm2, ge03d2ex, times=200, quiet=TRUE)
-@
-(you may skip the 'quiet=TRUE' argument, then you will see progress)
-
-Now let us generate the summary of the results
-<<>>=
-descriptives.scan(an0.e, sort="Pc1df")
-@
-
-Experimental-wise significance is computed by an empirical
-procedure, thus we consider $P$-values $\le 0.05$ to be GW-significant.
-However, none of the SNPs hits GW significance.
-If, actually, any did pass the threshold, we
-could not trust the results, because the distribution of the HWE test
-and presence of inflation factor for the association test statistics
-suggest that the data may contain multiple errors (indeed they do).
-Therefore before association analysis we need to do
-rigorous Quality Control (QC).
-
-Note that at a certain SNP, the corrected $P$-values become equal to 1
--- at this point the order in the list is arbitrary because sorting
-could not be done.
-
-\begin{summary}
-\item The \texttt{descriptives} family of functions was developed
-to facilitate the production of tables which can be directly used
-in a manuscript --- it is possible to save the output as a file,
-which can be open by Excel or Word. See \eg
-\texttt{help(descriptives.trait)} for details.
-\item The inflation of test statistics compared to null (1 d.f.)
-may be estimated with \texttt{estlambda} function.
-\end{summary}
-
-
-\section{Genetic data QC}
-
-The major genetic data QC function of \GA{} is \texttt{check.marker()}.
-We will now use that to perform our data QC; the output is rather self-explaining.
-Because of possible genetic heterogeneity of the study data it
-is a good idea to skip Hardy-Weinberg checks in the first round of QC.
-This can be achieved by setting HWE $P$-value
-selection threshold to zero (\texttt{p.level=0}):
-<<>>=
-qc1 <- check.marker(ge03d2ex, p.level=0)
-@
-Note that normally you will \emph{NEVER} run
-this simple form of the QC function -- you should always
-provide a number of thresholds specific to the
-platform you used for genotyping. See help to
-\texttt{check.marker()} for detailed list of arguments.
-The default values used by the function are rather
-relaxed compared to the thresholds routinely used
-nowadays with most of the platforms.
-
-\begin{tip}
-The computation of all pairwise
-proportion of alleles identical-by-state (IBS)
-by \texttt{ibs()} function, which is also called by
-\texttt{check.marker()} may take quite some time,
-which is proportional to the square of the number of
-subjects. This is not
-a problem with the small number of people we use for
-this example or when modern computers are used. However,
-the computers in the computer room are very old.
-Therefore be prepared to wait for long time when you
-will do a self-exercise with 1,000 people.
-\end{tip}
-
-From the output you can see that QC starts with checking the data for
-SNPs and people with extremely low call rate. Six markers are
-excluded from further analysis due to very low call rate.
-Next, X-chromosomal errors are identified. The function
-finds out that all errors (heterozygous male X-genotypes) are due to two people with
-wrong sex assigned and one marker, which looks like an autosomal one.
-This actually could be a marker from the pseudoautosomal region, which
-should have been arranged as a separate "autosome".
-Nine people are found to have intermediate inbreeding at the
-X-chromosome and are also excluded from analysis.
-
-Then, the procedure finds the markers with low call rate ($\le 0.95$
-by default)
-across people, markers with low MAF (by default, low MAF is defined as
-less than a few copies of the rare allele, see help for details);
-people with low call rate (default value: $\le 0.95$) across SNPs, people with extreme
-heterozygosity (at FDR 0.01) and those who have GW IBS $\ge 0.95$.
-These default parameters may be changed if you wish (consult the help).
-
-Because some of the people fail to pass the tests, the data set is not
-guaranteed to be really "clean" after single iteration, \eg some marker
-may not pass the call threshold after we exclude few
-informative (but apparently having low quality) samples. Therefore the
-QC is repeated iteratively until no further errors are found.
-
-You can generate a short summary of the QC by marker and by person through
-<<>>=
-summary(qc1)
-@
-Note that the original data, \texttt{ge03d2ex}, are not modified during the
-procedure; rather, \texttt{check.markers()} generate a list of markers and
-people which pass or do not pass certain QC criteria. The objects returned
-by \texttt{check.markers()} are:
-<<>>=
-names(qc1)
-@
-
-The element \texttt{idok} provides the list of people who passed all QC criteria,
-and \texttt{snpok} provides the list of SNPs which passed all criteria.
-You can easily generate a new data set, which will consist only of these
-people and markers by
-<<>>=
-data1 <- ge03d2ex[qc1$idok, qc1$snpok]
-@
-
-If there are any residual sporadic X-errors (male heterozygosity), these
-can (and should!) be fixed (set to NA) by
-<<>>=
-data1 <- Xfix(data1)
-@
-Applying this function does not make any difference for the example data set,
-but you will need to use it for the bigger data set.
-
-At this point, we are ready to work with the new, cleaned, data set
-\texttt{data1}. However, if we try
-<<>>=
-length(dm2)
-@
-we can see that the original phenotypic data are attached to the search
-path (there are only \Sexpr{nids(data1)} people left in the 'clean' data set).
-Therefore we need to detach the data by
-<<>>=
-detach(phdata(ge03d2ex))
-@
-and attach new data by
-<<>>=
-attach(phdata(data1))
-@
-
-At this stage, let us check if the first round of QC improves the fit
-of genetic data to HWE, which may have been violated due to
-by genotyping errors which we hopefully (at least partly!) eliminated:
-<<>>=
-descriptives.marker(data1)[2]
-descriptives.marker(data1[dm2==1, ])[2]
-descriptives.marker(data1[dm2==0, ])[2]
-@
-You can see that the fit to HWE improved, but
-cases are still have excess number of markers
-out of HWE. This may be due to genetic
-sub-structure.
-%%estlambda(summary(data1 at gtdata[dm2==1,])[,"Pexact"])
-
-%Apparently, the distribution (figure \ref{fig:chi2chi2_Pexcas2}) looks
-%better (note the scale difference between the graphs), but the test
-%statistics is still inflated.
-%\begin{figure}[t]
-%\centering
-%<<fig=true,echo=false,results=hide>>=
-%estlambda(summary(data1 at gtdata[data1 at phdata$dm2==1,])[,"Pexact"])
-%@
-%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE.
-%Black line of slope 1: expected under no inflation; Red line:
-%fitted slope.}
-%\label{fig:chi2chi2_Pexcas2}
-%\end{figure}
-
-\section{Finding genetic sub-structure}
-
-Now, we are ready for the second round of QC -- detection of
-genetic outliers which may contaminate our results.
-We will detect genetic outliers using a
-technique, which resembles the one suggested by Price at al.
-
-As the first step, we will compute a matrix of genomic kinship between all
-pairs of individuals, using only autosomal\footnote{the list of
-autosomal markers contained in \texttt{data} is returned by the
-\texttt{autosomal(data)} function}
-markers by
-<<>>=
-data1.gkin <- ibs(data1[, autosomal(data1)], weight="freq")
-@
-
-\begin{tip}
-This step may take few minutes on large data sets or
-when using old computers!
-\end{tip}
-
-You can see the $5 \times 5$ upper left sub-matrix by
-<<>>=
-data1.gkin[1:5, 1:5]
-@
-
-The numbers below the diagonal show the genomic estimate
-of kinship (aka 'genomic kinship' or 'genome-wide IBD'),
-the numbers on the diagonal correspond to 0.5 plus the genomic
-homozigosity, and
-the numbers above the diagonal tell how many SNPs were typed
-successfully for both subjects
-(thus the IBD estimate is derived using this number of SNPs).
-
-Second, we transform this matrix to a distance
-matrix using standard \texttt{R} command
-<<>>=
-data1.dist <- as.dist(0.5-data1.gkin)
-@
-
-Finally, we perform Classical Multidimensional Scaling by
-<<>>=
-data1.mds <- cmdscale(data1.dist)
-@
-By default, the first two principal components are computed and
-returned.
-
-\begin{tip}
-This may take few minutes on large data sets or
-when using old computers!
-\end{tip}
-
-We can present the results graphically by
-<<>>=
-plot(data1.mds)
-@
-The resulting plot is shown in Figure~\ref{fig:mds}.
-Each point on the plot corresponds to a person, and the 2D
-distances between points were fitted to be as close as possible
-to those presented in the original IBS matrix. You can see that
-study subjects clearly cluster in two groups.
-
-You can identify the points belonging to clusters by
-<<>>=
-km <- kmeans(data1.mds, centers=2, nstart=1000)
-cl1 <- names(which(km$cluster==1))
-cl2 <- names(which(km$cluster==2))
-if (length(cl1) > length(cl2)) {x<-cl2; cl2<-cl1; cl1<-x}
-cl1
-cl2
-@
-Four outliers are shown in the smaller cluster.
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false>>=
-plot(data1.mds)
-points(data1.mds[cl1,], pch=19, col="red")
-@
-\caption{Mapping samples to the space of the first two Principle
-Components resulting from analysis of genomic kinship. Red dots
-identify genetic outliers.}
-\label{fig:mds}
-\end{figure}
-
-\begin{tip}
-Now you will need to use the BIGGER cluster for
-to select study subjects. Whether this
-will be \texttt{cl1} or \texttt{cl2} in you case, is totally
-random.
-\end{tip}
-
-We can form a data set which is free from outliers by
-using only people from the bigger cluster:
-<<>>=
-data2 <- data1[cl2, ]
-@
-
-After we dropped the outliers, we need to repeat QC using
-\texttt{check.markers()}. At this stage, we want to
-allow for HWE checks (we will use only controls and
-exclude markers with FDR $\le 0.2$):
-<<>>=
-qc2 <- check.marker(data2, hweids=(phdata(data2)$dm2==0), fdr=0.2)
-summary(qc2)
-@
-\begin{tip}
-If the procedure did not run, check the previous Note.
-\end{tip}
-
-Indeed, in the updated data set several markers do not pass our
-QC criteria and we need to drop a few markers. This is done by
-<<>>=
-data2 <- data2[qc2$idok, qc2$snpok]
-@
-
-This is going to be our final analysis data set, therefore
-let us attach the phenotypic data to the search path, so we do not
-need to type \texttt{phdata(data2)\$...} to access \texttt{dm2} status or other
-variables:
-<<>>=
-detach(phdata(data1))
-attach(phdata(data2))
-####
-#ge03d2ex.clean <- data2
-#save(ge03d2ex.clean, file="ge03d2ex.clean.RData")
-####
-@
-
-Before proceeding to GWA, let us check if complete QC improved the fit
-of genetic data to HWE:
-<<>>=
-descriptives.marker(data2)[2]
-descriptives.marker(data2[phdata(data2)$dm2==1, ])[2]
-descriptives.marker(data2[phdata(data2)$dm2==0, ])[2]
-@
-You can see that now there is no excessive number of SNPs
-out of HWE in the sample (total, or cases, or controls)
-
-
-\section{GWA association analysis}
-
-Let us start again with descriptives of the phenotypic
-and marker data
-<<>>=
-descriptives.trait(data2, by=dm2)
-@
-
-You can see that the relation to weight is maintained in this smaller,
-but hopefully cleaner, data set; moreover, the relation to age becomes
-borderline significant.
-
-If you check descriptives of markers (only HWE part shown)
-<<>>=
-descriptives.marker(data2)[2]
-@
-you can see that the problems with HWE are apparently fixed;
-we may guess that these were caused by Wahlund's effect.
-
-Run the score test on the cleaned data by
-<<>>=
-data2.qt <- qtscore(dm2, data2, trait="binomial")
-@
-and check lambda
-<<>>=
-lambda(data2.qt)
-@
-there is still some inflation, which is explained by the fact
-that we investigate only a few short chromosomes with high LD and few
-causative variants.
-
-Produce the association analysis plot by
-<<>>=
-plot(data2.qt, df="Pc1df")
-@
-(Figure~\ref{fig:scan_QC}).
-
-\begin{figure}[t]
-\centering
-<<fig=true,echo=false>>=
-plot(data2.qt, df="Pc1df")
-@
-\caption{$-\log_{10}(\text{Corrected }P\text{-value})$ from the genome scan
- after the QC procedure.}
-\label{fig:scan_QC}
-\end{figure}
-
-Produce the scan summary by
-<<>>=
-descriptives.scan(data2.qt, sort="Pc1df")
-@
-
-Comparison with the top 10 from the scan before QC shows that results changed
-substantially with only few markers overlapping.
-
-You can see similar results when assessing empirical GW
-significance:
-<<>>=
-data2.qte <- qtscore(dm2,
- data2, times=200, quiet=TRUE, trait="binomial")
-descriptives.scan(data2.qte, sort="Pc1df")
-@
-
-Again, none of the SNPs hits GW 5\% significance. Still,
-you can see that after QC the top markers achieve somewhat ``better''
-significance.
-
-In the last part, we will do several adjusted and stratified analyses.
-Only empirical $P$-values will be estimated to make the story shorter.
-To adjust for sex and age, we can
-<<>>=
-data2.qtae <- qtscore(dm2~sex+age,
- data2, times=200, quiet=TRUE, trait="binomial")
-descriptives.scan(data2.qtae)
-@
-You can see that there is little difference between adjusted and
-unadjusted analysis, but this is not always the case; adjustment
-may make your study much more powerful when covariates explain a
-large proportion of environmental trait variation.
-
-Finally, let us do stratified (by BMI) analysis. We will contracts
-obese ($\text{BMI} \ge 30$) cases to all controls.
-<<>>=
-data2.qtse <- qtscore(dm2~sex+age,
- data2, ids=((bmi>30 & dm2==1) | dm2==0),
- times=200, quiet=TRUE, trait="binomial")
-descriptives.scan(data2.qtse, sort="Pc1df")
-@
-Again, nothing interesting at the GW significance level. If we would have had found
-something, naturally, we would not have known if we mapped a T2D or obesity gene
-(or a gene for obesity in presence of T2D, or the one for T2D in presence
-of obesity).
-
-Let us save the 'data1', 'data1.gkin', 'data2.qt' and 'cl1' objects now
-<<>>=
-save(data1, cl1, data1.gkin, data2.qt, file="data1.RData")
-@
-These data will be used later
-in section \ref{sec:strat}, "\nameref{sec:strat}", in which
-we will perform GWA using different methods
-to account for stratification, and will compare the results
-with these obtaining by removing outliers ('data2.qt').
-
-Let us also save the cleaned 'data2' object, which will be later
-used in section \ref{sec:meta} ("\nameref{sec:meta}"):
-<<>>=
-save(data2, file="data2.RData")
-@
-
-At this point, you acquired the knowledge necessary for the self-exercise.
-Please close \texttt{R} by \texttt{q()} command and proceed to the next
-section.
-
-
-\section{Genome-wide association analysis exercise}
-
-\begin{ExerciseList}
-During the exercise, you will work with a larger data set
-(approximately 1,000 people and 7,000+ SNPs). You are to
-do the complete three-round QC; perform GWA analysis with
-\texttt{dm2} as the outcome of interest and identify
-10 SNPs which you would like to take to the stage 2 (replication)
-scan. You will do replication analysis using a
-confirmatory data set. If you did everything right, the
-SNPs which you identified as significant or replicated
-will be located in known T2D genes.
-
-Please keep in mind that the data are simulated, and do not
-take your findings too seriously!
-
-Start \texttt{R} by going to "Start -> Programs -> R -> R-?.?.?".
-Load the \GA{} library by
-%<<echo=false,results=hide>>=
-%rm(list=ls())
-%@
-<<>>=
-library(GenABEL)
-@
-
-The two data sets we will use in this exercise are part of the \GA{}
-distribution. The first one (the "discovery" set) can be loaded by
-<<>>=
-data(ge03d2)
-@
-
-Please move along the lines detailed in the guided exercise and
-try to answer following questions:
-
-\Exercise How many cases and controls are present
-in the original data set?
-\Answer :
-<<>>=
-table(phdata(ge03d2)$dm2)
-@
-
-
-\Exercise How many markers are present
-in the original data set?
-\Answer :
-<<>>=
-nsnps(ge03d2)
-@
-
-
-\Exercise Is there evidence for inflation of the
-HWE test statistics?
-\Answer Yes:
-<<>>=
-descriptives.marker(ge03d2)[2]
-@
-
-
-\Exercise Analyse empirical GW significance. How many SNPs
-pass the genome-wide significance threshold, after correction for
-the inflation factor? Write down the names of these SNPs for
-further comparison.
-\Answer :
-<<>>=
-res0 <- qtscore(dm2, data=ge03d2, times=200,
- quiet=TRUE, trait="binomial")
-### something funny is going on here
-### disabeling next line
-#lambda(res0)
-ds <- descriptives.scan(res0)
-ds
-@
-Thus, there are no genome-wide empirically significant results.
-The 'top' 10 SNPs are
-<<>>=
-snps0 <- rownames(ds)
-snps0
-@
-(note that if the empirical $P = 1$, the rank is assigned quite arbitrarily)
-
-
-\Exercise Perform first steps of the genetic data QC.
-\Answer First step of QC
-<<>>=
-qc1 <- check.marker(ge03d2, call=0.95, perid.call=0.95,
- p.level=0, ibs.exclude="both")
-summary(qc1)
-data1 <- ge03d2[qc1$idok, qc1$snpok]
-data1 <- Xfix(data1)
-qc2 <- check.marker(data1, call=0.95, perid.call=0.95, p.level=0)
-summary(qc2)
-@
-
-
-\Exercise How many males are 'genetically' females?
-\Answer The list of genetic females who are coded as males is
-<<>>=
-qc1$isfemale
-@
-
-\Exercise How many females are 'genetically' males?
-\Answer The list of genetic males who are coded as females is
-<<>>=
-qc1$ismale
-@
-
-\Exercise How many people are quessed to have 'XXY' genotype?
-\Answer The number of 'XXY' people is \Sexpr{length(qc1$isXXY)}
-
-\Exercise How many sporadic X errors do you still
-observe even when the female male and non-X X-markers are removed?
-(do not forget to \texttt{Xfix(s)} these
-\Answer Eight 'sporadic' X-errors are left after removing people
-with likely sex code errors (seven in the data set after first step
-of QC)
-
-\Exercise How many "twin" DNAs did you discover?
-\Answer The list of IDs failing IBS checks ('twin' DNAs) is
-<<>>=
-qc1$ibsfail
-@
-
-\Exercise Perform second step of QC.
-\Answer The second step of QC:
-<<>>=
-data1.gkin <- ibs(data1[, autosomal(data1)], weight="freq")
-data1.dist <- as.dist(0.5 - data1.gkin)
-data1.mds <- cmdscale(data1.dist)
-km <- kmeans(data1.mds, centers=2, nstart=1000)
-cl1 <- names( which(km$cluster==1) )
-cl2 <- names( which(km$cluster==2) )
-if (length(cl1) > length(cl2)) {x<-cl2; cl2<-cl1; cl1<-x}
-cl1
-data2 <- data1[cl2,]
-qc2 <- check.marker(data2, hweids=(phdata(data2)$dm2==0), fdr=0.2)
-summary(qc2)
-data2 <- data2[qc2$idok, qc2$snpok]
-data2 <- Xfix(data2)
-####
-#ge03d2.clean <- data2
-#save(ge03d2.clean, file="ge03d2.clean.RData")
-####
-@
-
-\Exercise How many genetic outliers did you discover?
-\Answer The list of genetic outliers is
-<<>>=
-cl1
-@
-
-\Exercise
-How many cases and controls are presented
-in the data after QC?
-\Answer :
-<<>>=
-table(phdata(data2)$dm2)
-@
-
-
-\Exercise How many markers are presented
-in the data after QC?
-\Answer :
-<<>>=
-nsnps(data2)
-@
-
-
-\Exercise Is there evidence for inflation of the
-HWE test staistics?
-\Answer No:
-<<>>=
-descriptives.marker(data2)[2]
-@
-
-FROM THIS POINT ON THE RUNNING OF THE TUTORIAL FAILS (2013.02.20,
-USING GENABEL DEV VERSION 1.7-4). I COMMENTED OUT THE R CODE BELOW.
-
-\Exercise Perform GWA analysis of the cleaned data,
-using asimptotic test and plot the results.
-What is the estimate of $\lambda$ for the 1 d.f.~test?
-\Answer :
-<<>>=
-#qts <- qtscore(dm2, data2, trait="binomial")
-#lambda(qts)
-@
-
-
-\Exercise Analyse empirical GW significance. How many SNPs
-pass genome-wide significance threshold, after correction for
-the inflation factor?
-\Answer :
-<<>>=
-#res1 <- qtscore(dm2, data=data2, times=200, quiet=TRUE, trait="binomial")
-#ds1 <- descriptives.scan(res1)
-#ds1
-@
-There are SNPs which are empirically genome-wide significant
-in the data.
-To get the list of 'top' 10 SNPs:
-<<>>=
-#snps1 <- rownames(ds1)
-#snps1
-@
-
-
-\Exercise Do these SNPs overlap much with the ones
-ranked at the top before the QC? If not, what could be the reason?
-\Answer There is little overlap between SNPs before
-and after QC:
-<<>>=
-#snps0
-#snps1
-@
-
-
-\Exercise Select 10 SNPs which you would like to follow-up. Say, you've selected
-rs1646456, rs7950586, rs4785242, rs4435802, rs2847446, rs946364, rs299251,
-rs2456488, rs1292700, and rs8183220.
-Make a vector of these SNPs with
-<<>>=
-#vec12<-c("rs1646456", "rs7950586", "rs4785242", "rs4435802", "rs2847446",
-# "rs946364", "rs299251", "rs2456488", "rs1292700", "rs8183220")
-@
-Load the stage 2 (replicaton) data set by
-<<>>=
-#data(ge03d2c)
-@
-and select the subset of SNPs you need by
-<<>>=
-#confdat <- ge03d2c[, vec12]
-@
-Analyse the \texttt{confdat} for association with \texttt{dm2}.
-
-\Answer :
-<<>>=
-#data(ge03d2c)
-#snps1
-#confdat <- ge03d2c[, snps1]
-#rep <- qtscore(dm2, confdat, times=10000, quiet=TRUE)
-#descriptives.scan(rep)
-@
-
-
-\Exercise Given the two-stage design, and applying the
-puristic criteria specified in the lecture, for how many SNPs
-you can claim a significant finding?
-\Answer Two-stage $P$-value is
-<<>>=
-# snps1
-# finres <- matrix(NA, 10, 3)
-# colnames(finres) <- c("Stage 1", "Replication", "Combined")
-# rownames(finres) <- snps1
-# for (i in 1:10) {
-# finres[i, 1] <- res1[which(snpnames(data2)==snps1[i]), "Pc1df"]
-# finres[i, 2] <- rep[which(snpnames(confdat)==snps1[i]), "P1df"]
-# finres[i, 3] <- finres[i, 1]*finres[i, 2]
-# }
-# finres
-# for (i in 1:10) {
-# if (finres[i, 3] <= 0.05) {
-# print(c("---------", rownames(finres)[i], "-------"))
-#
-# print(c(rownames(finres)[i], "stage 1:"))
-# ph <- phdata(data2)$dm2
-# gt <- as.numeric(data2[, rownames(finres)[i]])
-# print(summary( glm(ph~gt, family=binomial) )$coef)
-#
-# print(c(rownames(finres)[i], "stage 2:"))
-# ph <- phdata(confdat)$dm2
-# gt <- as.numeric(confdat[, rownames(finres)[i]])
-# print(summary(glm(ph~gt, family=binomial))$coef)
-#
-# print(c(rownames(finres)[i], "Joint:"))
-# ph <- c(phdata(data2)$dm2, phdata(confdat)$dm2)
-# gt <- c(as.numeric(data2[, rownames(finres)[i]]),
-# as.numeric(confdat[, rownames(finres)[i]]))
-# print(summary( glm(ph~gt, family=binomial) )$coef)
-# }
-# }
-#replicatedsnps <- rownames(finres)[finres[, "Stage 1"] <= 0.05 &
-# finres[, "Replication"] <= 0.05 &
-# finres[, "Combined"] <= 0.05]
-#replicatedsnps
-#sigsnps <- rownames(finres)[finres[, "Combined"] <= 0.05]
-#sigsnps
-@
-At the first glance, NWSexpr{length(replicatedsnps)} SNPs
-may be claimed as replicated
-because both first stage and replication $P$-values are
-$\le 0.05$ and effects are consistent, and additionally
-NWSexpr{length(sigsnps) - length(replicatedsnps)} may
-be claimed as 'significant' because
-joint $P$-values are $\le 0.05$ and the effects are consistent.
-Generally, a more thorough simulation experiment should be performed.
-
-
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/genabel -r 2091
More information about the Genabel-commits
mailing list