[Genabel-commits] r1137 - tutorials/GenABEL_general
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Wed Mar 13 18:35:20 CET 2013
Author: yurii
Date: 2013-03-13 18:35:20 +0100 (Wed, 13 Mar 2013)
New Revision: 1137
Added:
tutorials/GenABEL_general/GWA.Rnw
tutorials/GenABEL_general/GWAprotocol.Rnw
tutorials/GenABEL_general/GWAsimple.Rnw
tutorials/GenABEL_general/GenABEL-tutorial.Rnw
tutorials/GenABEL_general/ImputedDataAnalysis.Rnw
tutorials/GenABEL_general/Makefile
tutorials/GenABEL_general/QC.Rnw
tutorials/GenABEL_general/Sweave.sty
tutorials/GenABEL_general/answers.Rnw
tutorials/GenABEL_general/answers2.Rnw
tutorials/GenABEL_general/assoc.Rnw
tutorials/GenABEL_general/bibliography.bib
tutorials/GenABEL_general/dataimport.Rnw
tutorials/GenABEL_general/dbasing.Rnw
tutorials/GenABEL_general/fetchData.Rnw
tutorials/GenABEL_general/genetics.bst
tutorials/GenABEL_general/impute.Rnw
tutorials/GenABEL_general/internals.Rnw
tutorials/GenABEL_general/intro.Rnw
tutorials/GenABEL_general/introR.Rnw
tutorials/GenABEL_general/meta.Rnw
tutorials/GenABEL_general/myxhtml.cfg
tutorials/GenABEL_general/preface.Rnw
tutorials/GenABEL_general/probabel.Rnw
tutorials/GenABEL_general/reg.Rnw
tutorials/GenABEL_general/siman.Rnw
tutorials/GenABEL_general/strat.Rnw
tutorials/GenABEL_general/strat0.Rnw
tutorials/GenABEL_general/workgwaaclass.Rnw
Log:
first version of GenABEL-general tutorial
Added: tutorials/GenABEL_general/GWA.Rnw
===================================================================
--- tutorials/GenABEL_general/GWA.Rnw (rev 0)
+++ tutorials/GenABEL_general/GWA.Rnw 2013-03-13 17:35:20 UTC (rev 1137)
@@ -0,0 +1,1038 @@
+\chapter{Genome-wide association analysis}
+\label{sec:GWA}
+
+In the first parts of this section you will be guided
+through a GWA analysis of a small data set. In the
+last part you will investigate a larger data set by
+yourself, do a verification study and will answer the
+questions.
+All data sets used assume a study in a relatively
+homogeneous population.
+Try to finish the first part in the morning and the
+second part in the afternoon.
+
+Though only few thousands of markers located at four
+small chromosomes are used in the scan, we still going to
+call it Genome-Wide (GW), as the amount of data we
+will use is approaches the amount to be expected
+in a real experiment. However, because the regions
+are small, and the LD between SNPs is high,
+some specific features (e.g. relatively high
+residual inflation, which occurs because
+large proportion of SNPs are in LD with
+the reuly associated ones)
+are specific
+features of this data set, which are not observed in
+true GWA studies.
+
+Start R and load \GA{} library
+by typing
+<<results=hide>>=
+library(GenABEL)
+@
+and load the data which we will use in this section by
+<<>>=
+data(ge03d2ex)
+@
+
+Investigate the objects loaded by command
+<<>>=
+ls()
+@
+
+The \texttt{ge03d2ex} is an object of the class \texttt{gwaa.data}:
+<<>>=
+class(ge03d2ex)
+@
+
+To check what are the names of variables in the phenotypic data frame, use
+<<>>=
+names(phdata(ge03d2ex))
+@
+
+We can attach this data frame
+to the \texttt{R} search path by
+<<>>=
+attach(phdata(ge03d2ex))
+@
+
+\section{Data descriptives and first round of GWA analysis}
+
+Let us investigate what are the traits presented in the data frame loaded
+and what are the characteristics of the distribution by using specific
+\GA{} function \texttt{descriptive.trait}:
+<<>>=
+descriptives.trait(ge03d2ex)
+@
+
+You can see that phenotypic frame contains the data on 136 people; the data
+on sex, age, height, weight, diet and body mass index (BMI) are available.
+Our trait of interest is \texttt{dm2} (type 2 diabetes). Note that every
+single piece of information in this data set is simulated; however, we
+tried to keep our simulations in a way we think the control of T2D may work.
+
+You can produce a summary for cases and controls separately and
+compare distributions of the traits by
+<<>>=
+descriptives.trait(ge03d2ex,by=dm2)
+@
+Here, the \texttt{by} argument specifies the grouping variable.
+You can see that cases and controls are different in weight, which
+is expected, as T2D is associated with obesity.
+
+Similarly, you can produce grand GW descriptives of the marker data
+by using
+<<>>=
+descriptives.marker(ge03d2ex)
+@
+It is of note that we can see inflation of the proportion of the
+tests for HWE at particular threshold, as compared to the expected.
+This may indicate poor genotyping quality and/or genetic stratification.
+
+We can test the GW marker characteristics in controls by
+<<>>=
+descriptives.marker(ge03d2ex,ids=(dm2==0))
+@
+Apparently, HWE distribution holds better in controls than in
+the total sample.
+
+Let us check whether there are indications that deviation from HWE
+is due to cases. At this stage we are only interested in HWE distribution
+table, and therefore will ask to report the distrbution
+for cases (\texttt{dm2==1}) and report only the table two:
+<<>>=
+descriptives.marker(ge03d2ex,ids=(dm2==1))[2]
+@
+and for the controls
+<<>>=
+descriptives.marker(ge03d2ex,ids=(dm2==0))[2]
+@
+
+It seems that indeed excessive number of markers are out of HWE in cases.
+If no laboratory procedure (e.g. DNA extraction, genotyping, calling) were
+done for cases and controls separately, this may indicate possible
+genetic heterogeneity specific for cases.
+
+In essence, '\texttt{descriptives.marker}' function
+uses '\texttt{summary}' function to generate HW P-values
+distribution. It may be interesting to generate this
+distribution using the '\texttt{summary}' function
+You do not need to do so, but this example shows how
+you can generate summaries from underlying SNP-tables.
+First, we need to
+compute summary SNP statistics by
+
+<<>>=
+s <- summary(gtdata(ge03d2ex[(dm2==1),]))
+@
+
+Note the you have produced the summary for the \texttt{gtdata}
+slot of \texttt{ge03d2ex}; this is the slot which actually
+contain all genetic data in special compressed format.
+
+You can see first 5 rows of this very long summary table by
+<<>>=
+s[1:5,]
+@
+Note that the column 'Pexact' provides exact HWE test $P$-values we need.
+We can extract these to a separate vector by
+<<>>=
+pexcas <- s[,"Pexact"]
+@
+and do characterization of the cummulative distribution by
+<<>>=
+catable(pexcas,c(0.001,0.01,0.05,0.1,0.5),cumulative=TRUE)
+@
+
+You can generate the distribution for controls in similar manner.
+
+%%and produce the $\chi^2-\chi^2$ plot and estimate inflation
+%%factor by command \texttt{estlambda()}, which operates with a
+%%vector of P-values or $\chi^2$s:
+%%<<>>=
+%%estlambda(pexcas)
+%%@
+%%
+%%By default, this function also produces a $\chi^2-\chi^2$ plot, at
+%%which you can see some extreme deviation of observed from expected.
+%%The resulting plot
+%%(figure \ref{fig:chi2chi2_Pexcas}) shows extreme deviation for
+%%high values of the test.
+%%Looking at the $\lambda$ estimate, we indeed see inflation of the test statistics.
+%%\begin{figure}[t]
+%%\centering
+%%<<fig=true,echo=false,results=hide>>=
+%%estlambda(s[,"Pexact"])
+%%@
+%%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE.
+%%Black line of slope 1: expected under no inflation; Red line:
+%%fitted slope.}
+%%\label{fig:chi2chi2_Pexcas}
+%%\end{figure}
+%%
+%%You can repeat this test for the controls, if time permits.
+
+Let us first try do GWA scan using the raw (before quality control) data.
+We will use the score test, as implemented in the
+\texttt{qtscore()}\footnote{consider '\texttt{mlreg}' function
+if you want to run true linear regression}
+function of \GA{} for testing:
+<<>>=
+an0 <- qtscore(dm2,ge03d2ex,trait="binomial")
+@
+The first argument used describes the model; here it is rather
+simple --- the affection status, \texttt{dm2}, is supposed
+to depend on SNP genotype only.
+
+You can see what information is computed by this function by using
+<<>>=
+an0
+@
+Here, let us look at the 'Results table'.
+\texttt{P1df}, \texttt{P2df} and \texttt{Pc1df} are most
+interesting; the first two are vectors of 1 and 2 d.f. $P$-values
+obtained in the GWA analysis, the last one is 1 d.f.
+P-value corrected for inflation factor $\lambda$ (which is
+presented in \texttt{lambda} object). \texttt{effB} corresponds to
+(approximate) Odds Ratios estimate for the SNP.
+
+Let us see if there is evidence for the inflation of the test
+statistics; for that let us obtain $\lambda$ with
+<<>>=
+lambda(an0)
+@
+The estimate of $\lambda$ is \Sexpr{round(lambda(an0)$est,dig=2)},
+suggesting inflation of the test and some degree of stratification.
+Though the value abtained seems to be small, it should be noted
+that $\lambda$ grows linearly with smaple size, so for this small
+number of cases and controls the value is worrisome.
+
+The $\lambda$ is computed by regression in a Q-Q plot. Both estimation
+ of $\lambda$ and production of the $\chi^2-\chi^2$ plot can be done
+using the \texttt{estlambda} function; this was already done automatically
+when running \texttt{qtscore} function, but let us repeat this manually:
+
+<<>>=
+estlambda(an0[,"P1df"],plot=TRUE)
+@
+The corresponding $\chi^2-\chi^2$ plot is presented in Figure
+\ref{fig:chi2chi2_P1df}.
+
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false,results=hide>>=
+estlambda(an0[,"P1df"],plot=TRUE)
+@
+\caption{$\chi^2-\chi^2$ plot for a GWA scan.
+Black line of slope 1: expected under no inflation; Red line:
+fitted slope.}
+\label{fig:chi2chi2_P1df}
+\end{figure}
+
+
+\begin{tip}
+The 'se' produced by \texttt{estlambda} \emph{can not}
+be used to test if inflation is significant and make
+conclusions about presence of significant or insignificant
+stratification.
+\end{tip}
+
+We can also present the obtained results using the
+''Manhatten plot'', where a SNP genomic position is on
+the X-axes and $-log_{10}$ of the $P$-value is
+shown on Y-axes:
+<<>>=
+plot(an0)
+@
+The resulting plot is presented in the figure \ref{fig:scan_initial}.
+By default, $-log_{10}(P-value)$ of not corrected 1 d.f. test are
+presented; see help to figure out how this behaviour can be changed.
+
+We can also add the corrected P-values to the plot with
+<<>>=
+add.plot(an0,df="Pc1df",col=c("lightblue","lightgreen"))
+@
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false>>=
+plot(an0,col=c("blue","green"))
+add.plot(an0,df="Pc1df",col=c("lightblue","lightgreen"))
+@
+\caption{$-log_{10}(P-value)$ from the genome scan before QC procedure.
+Raw analysis: darker circles; corrected analysis: lighter circles}
+\label{fig:scan_initial}
+\end{figure}
+You can see that $P$-values corrected by genoic control
+are uniformly lower than the $P$-values from 'raw' analysis.
+This is to be expected as genomic control simply divides the
+'raw' $\chi^2$ statistics by a constant $\lambda$ for all
+SNPs.
+
+You can also generate a descriptive table for the "top" (as ranked by
+P-value) results by
+<<>>=
+descriptives.scan(an0)
+@
+or, equivalently, by '\texttt{summary(an0)}'
+
+Here you see top 10 results, sorted by P-value with 1 d.f. If you want
+to sort by the corrected P-value, you can use
+\texttt{descriptives.scan(an0,sort="Pc1df")}; to see more then 10
+(e.g. 25) top results, use \texttt{descriptives.scan(an0,top=25)}.
+You can combine all these options. Large part of results reports NA
+as effect estimates and 9.99 as \textit{P-value} for 2 d.f. test --
+for these markers only two out of three possible
+genotypes were observed, and consequently 2 d.f. test could not be
+performed.
+
+Now let us apply \texttt{qtscore()} function with \texttt{times}
+argument, which tells it to compute empirical GW
+(or experiment-wise) significance
+<<>>=
+an0.e <- qtscore(dm2,ge03d2ex,times=200,quiet=TRUE)
+@
+(you may skip the 'quiet=TRUE' argument, then you will see progress)
+
+Now let us generate the summary of the results
+<<>>=
+descriptives.scan(an0.e,sort="Pc1df")
+@
+
+Experimental-wise significance is computed by the empirical
+procedure, thus we consider $P$-values $\le 0.05$ to be GW-significant.
+However, none of the SNPs hits GW significance.
+If, actually, any did pass the threshold, we
+could not trust the results, because the distribution of the HWE test
+and presence of inflation factor for the association test statistics
+suggest that the data may contain multiple errors (indeed they do).
+Therefore before association analysis we need to do
+rigorous Quality Control (QC).
+
+Note that at certain SNP, corrected \textit{P-values} become equal
+to 1 -- at this point order is arbitrary because sorting could not be done.
+
+\begin{summary}
+\item The \texttt{descriptives} family of functions was developed
+to facilitate production of tables which can be directly used
+in a manuscript --- it is possible to save the output as a file,
+which can be open by Excel or Word. See e.g.
+\texttt{help(descriptives.trait)} for details.
+\item The inflation of test statistics compared to null (1 d.f.)
+may be estimated with \texttt{estlambda} function.
+\end{summary}
+
+\section{Genetic data QC}
+
+The major genetic data QC function of \GA{} is \texttt{check.marker()}.
+We will now use that to perform our data QC; the output is rather self-explaining.
+Because of possible genetic heterogeneity of the study data it
+is good idea to skip Hardy-Weinberg checks in the first round of QC.
+This can be achieved by setting HWE P-value
+selection threshold to zero (\texttt{p.level=0}):
+<<>>=
+qc1 <- check.marker(ge03d2ex,p.level=0)
+@
+Note that normally you will \emph{NEVER} run
+this simple form of the QC function -- you should always
+provide a number of thresholds specific to the
+platform you used for genotyping. See help to
+\texttt{check.marker()} for detailed list of arguments.
+Default values used by the function are rather
+relaxed compared to the thresholds routinely used
+nowadays with most of the platforms.
+
+\begin{tip}
+The computation of all pairwise
+proportion of alleles identical-by-state (IBS)
+by \texttt{ibs()} function, which is also called by
+\texttt{check.marker()} may take quite some time,
+which is proportional to the square of the number of
+subjects. This is not
+a problem with the small number of people we use for
+this example or when modern computers are used. However,
+the computers in the Nihes computer room are very old.
+Therefore be prepared to wait for long time when you
+will do a self-exercise with 1,000 people.
+\end{tip}
+
+From the output you can see that QC starts with checking the data for
+SNPs and people with extremely low call rate. Six markers are
+excluded from further analysis due to very low call rate.
+Next, X-chromosomal errors are identified. The function
+finds out that all errors (heterozygous male X-genotypes) are due to two people with
+wrong sex assigned and one marker, which looks like an autosomal one.
+This actually could be a marker from pseudoautosomal region, which
+should have been arranged as a separate "autosome".
+Nine people are found to have intermediate inbreeding at the
+X-chromosome and also excluded from analysis.
+
+Then, the procedure finds the markers with low call rate ($\le 0.95$
+by default)
+across people, markers with low MAF (by default, low MAF is defined as
+less than few copies of the rare allele, see help for details);
+people with low call rate (default value: $\le 0.95$) across SNPs, people with extreme
+heterozygosity (at FDR 0.01) and these who have GW IBS $\ge 0.95$.
+These default parameters may be changed if you wish (consult help).
+
+Because some of the people fail to pass the tests, the data set is not
+guaranteed to be really "clean" after single iteration, e.g. some marker
+may not pass the call threshold after we exclude few
+informative (but apparently having low quality) samples. Therefore the
+QC is repeated iteratively until no further errors are found.
+
+You can generate short summary of QC by marker and by person through
+<<>>=
+summary(qc1)
+@
+
+Note that the original data, \texttt{ge03d2ex}, are not modified during the
+procedure; rather, \texttt{check.markers()} generate a list of markers and
+people which pass or do not pass certain QC criteria. The objects returned
+by \texttt{check.markers()} are:
+<<>>=
+names(qc1)
+@
+
+The element \texttt{idok} provides the list of people who passed all QC criteria,
+and \texttt{snpok} provides the list of SNPs which passed all criteria.
+You can easily generate a new data set, which will consist only of these
+people and markers by
+<<>>=
+data1 <- ge03d2ex[qc1$idok,qc1$snpok]
+@
+
+If there are any residual sporadic X-errors (male heterozygosity), these
+can (and should!) be fixed (set to NA) by
+<<>>=
+data1 <- Xfix(data1)
+@
+Applying this function does not make any difference for the example data set,
+but you will need to use it for the bigger data set.
+
+At this point, we are ready to work with the new, cleaned, data set
+\texttt{data1}. However, if we try
+<<>>=
+length(dm2)
+@
+we can see that the original phenotypic data are attached to the search
+path (there are only \Sexpr{nids(data1)} people left in the 'clean' data set).
+Therefore we need to detach the data by
+<<>>=
+detach(phdata(ge03d2ex))
+@
+and attach new data by
+<<>>=
+attach(phdata(data1))
+@
+
+At this stage, let us check if the first round of QC improves the fit
+of genetic data to HWE, which may have been violated due to
+by genotyping errors which we hopefully (at least partly!) eliminated:
+<<>>=
+descriptives.marker(data1)[2]
+descriptives.marker(data1[dm2==1,])[2]
+descriptives.marker(data1[dm2==0,])[2]
+@
+
+You can see that the fit to HWE improved, but
+cases are still have excess number of markers
+out of HWE. This may be due to genetic
+sub-structure.
+%%estlambda(summary(data1 at gtdata[dm2==1,])[,"Pexact"])
+
+%Apparently, the distribution (figure \ref{fig:chi2chi2_Pexcas2}) looks
+%better (note the scale difference between the graphs), but the test
+%statistics is still inflated.
+%\begin{figure}[t]
+%\centering
+%<<fig=true,echo=false,results=hide>>=
+%estlambda(summary(data1 at gtdata[data1 at phdata$dm2==1,])[,"Pexact"])
+%@
+%\caption{$\chi^2-\chi^2$ plot for the exact test for HWE.
+%Black line of slope 1: expected under no inflation; Red line:
+%fitted slope.}
+%\label{fig:chi2chi2_Pexcas2}
+%\end{figure}
+
+\section{Finding genetic sub-structure}
+
+Now, we are ready for the second round of QC -- detection of
+genetic outliers which may contaminate our results.
+We will detect genetic outliers using a
+technique, which resembles the one suggested by Price at al.
+
+As the first step, we will compute a matrix of genomic kinship between all
+pairs of people, using only autosomal\footnote{the list of
+autosomal markers contained in \texttt{data} is returned by
+\texttt{autosomla(data)} function}
+markers by
+<<>>=
+data1.gkin <- ibs(data1[,autosomal(data1)],weight="freq")
+@
+
+\begin{tip}
+This step may take few minutes on large data sets or
+when using old computers!
+\end{tip}
+
+You can see the 5x5 upper left sub-matrix by
+<<>>=
+data1.gkin[1:5,1:5]
+@
+
+The numbers below the diagonal show the genomic estimate
+of kinship (aka 'genomic kinship' or 'genome-wide IBD'),
+the numbers on the diagonal correspond to 0.5 plus the genomic
+homozigosity, and
+the numbers above the diagonal tell how many SNPs were typed
+successfully for both subjects
+(thus the IBD estimate is derived using this number of SNPs).
+
+Second, we transform this matrix to a distance
+matrix using standard \texttt{R} command
+<<>>=
+data1.dist <- as.dist(0.5-data1.gkin)
+@
+
+Finally, we perform Classical Multidimensional Scaling by
+<<>>=
+data1.mds <- cmdscale(data1.dist)
+@
+by default, the first two principal components are computed and
+returned.
+
+\begin{tip}
+This may take few minutes on large data sets or
+when using old computers!
+\end{tip}
+
+We can present the results graphically by
+<<>>=
+plot(data1.mds)
+@
+
+The resulting plot is presented in figure \ref{fig:mds}.
+Each point on the plot corresponds to a person, and the 2D
+distances between points were fitted to be as close as possible
+to these presented in the original IBS matrix. You can see that
+study subjects clearly cluster in two groups.
+
+You can identify the points belonging to clusters by
+<<>>=
+km <- kmeans(data1.mds,centers=2,nstart=1000)
+cl1 <- names(which(km$cluster==1))
+cl2 <- names(which(km$cluster==2))
+if (length(cl1)>length(cl2)) {x<-cl2;cl2<-cl1;cl1<-x}
+cl1
+cl2
+@
+Four outliers are presented in the smaller cluster.
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false>>=
+plot(data1.mds)
+points(data1.mds[cl1,],pch=19,col="red")
+@
+\caption{Mapping samples on the space of the first two Principle
+Components resulting from analysis of genomic kinship. Red dots
+identify genetic outliers}
+\label{fig:mds}
+\end{figure}
+
+\begin{tip}
+Now you will need to use the BIGGER cluster for
+to select study subjects. Whether this
+will be \texttt{cl1} or \texttt{cl2} in you case, is totally
+random.
+\end{tip}
+
+We can form a data set which is free from outliers by
+using only people from the bigger cluster:
+<<>>=
+data2 <- data1[cl2,]
+@
+
+After we dropped the outliers, we need to repeat QC using
+\texttt{check.markers()}. At this stage, we want to
+allow for HWE checks (we will use only controls and
+exclude markers with FDR $\le 0.2$):
+<<>>=
+qc2 <- check.marker(data2,hweids=(phdata(data2)$dm2==0),fdr=0.2)
+summary(qc2)
+@
+\begin{tip}
+If the procedure did not run, check previous Note.
+\end{tip}
+
+Indeed, in the updated data set few markers do not pass our
+QC criteria and we need to drop a few markers. This is done by
+<<>>=
+data2 <- data2[qc2$idok,qc2$snpok]
+@
+
+This is going to be our final analysis data set, therefore
+let us attach the phenotypic data to the search path, then we do not
+need to type \texttt{phdata(data2)\$...} to access \texttt{dm2} status or other
+variables:
+<<>>=
+detach(phdata(data1))
+attach(phdata(data2))
+####
+#ge03d2ex.clean <- data2
+#save(ge03d2ex.clean,file="ge03d2ex.clean.RData")
+####
+@
+
+Before proceeding to GWA, let us check if complete QC improved the fit
+of genetic data to HWE:
+<<>>=
+descriptives.marker(data2)[2]
+descriptives.marker(data2[phdata(data2)$dm2==1,])[2]
+descriptives.marker(data2[phdata(data2)$dm2==0,])[2]
+@
+
+You can see that now there is no excessive number of SNPs
+out of HWE in the sample (total, or cases, or controls)
+
+\section{GWA association analysis}
+
+Let us start again with descriptives of the phenotypic
+and marker data
+<<>>=
+descriptives.trait(data2,by=dm2)
+@
+
+You can see that relation to weight is maintained in this smaller,
+but hopefully cleaner, data set; moreover, relation to age becomes
+boundary significant.
+
+If you check descriptives of markers (only HWE part shown)
+<<>>=
+descriptives.marker(data2)[2]
+@
+you can see that the problems with HWE are apparently fixed;
+we may guess that these were caused by the Wahlund's effect.
+
+Run the score test on the cleaned data by
+
+<<>>=
+data2.qt <- qtscore(dm2,data2,trait="binomial")
+@
+and check lambda
+<<>>=
+lambda(data2.qt)
+@
+there is still some inflation, which is explained by the fact
+that we investigate only few short chromosomes with high LD and few
+causative variants.
+
+Produce the association analysis plot by
+<<>>=
+plot(data2.qt,df="Pc1df")
+@
+(figure \ref{fig:scan_QC}).
+
+\begin{figure}[t]
+\centering
+<<fig=true,echo=false>>=
+plot(data2.qt,df="Pc1df")
+@
+\caption{$-log_{10}(Corrected P-value)$ from the genome scan after the QC procedure}
+\label{fig:scan_QC}
+\end{figure}
+
+Produce the scan summary by
+<<>>=
+descriptives.scan(data2.qt,sort="Pc1df")
+@
+
+Comparison with the top 10 from the scan before QC shows that results changed
+substantially with only few markers overlapping.
+
+You can see similar results when accessing empirical GW
+significance:
+<<>>=
+data2.qte <- qtscore(dm2,data2,times=200,quiet=TRUE,trait="binomial")
+descriptives.scan(data2.qte,sort="Pc1df")
+@
+
+Again, none of the SNPs hits GW 5\% significance. Still,
+you can see that after QC top markers achieve somewhat "better"
+significance.
+
+In the last part, we will do several adjusted and stratified analyses.
+Only empirical P-values will be estimated to make the story shorter.
+To adjust for sex and age, we can
+<<>>=
+data2.qtae <- qtscore(dm2~sex+age,data2,times=200,quiet=T,trait="binomial")
+descriptives.scan(data2.qtae)
+@
+
+You can see that there is little difference between adjusted and
+unadjusted analysis, but this is not always the case; adjustment
+may make your study much more powerful when covariates explain a
+large proportion of environmental trait variation.
+
+Finally, let us do stratified (by BMI) analysis. We will contracts
+obese ($BMI \ge 30$) cases to all controls.
+<<>>=
+data2.qtse <- qtscore(dm2~sex+age,data2,ids=((bmi>30 & dm2==1) | dm2==0),times=200,quiet=TRUE,trait="binomial")
+descriptives.scan(data2.qtse,sort="Pc1df")
+@
+
+Again, noting interesting at GW significance level. If we would have had found
+something, naturally, we would not known if we mapped a T2D or obesity gene
+(or a gene for obesity in presence of T2D, or the one for T2D in presence
+of obesity).
+
+Let us save 'data1', 'data1.gkin', 'data2.qt' and 'cl1' objects now
+<<>>=
+save(data1,cl1,data1.gkin,data2.qt,file="data1.RData")
+@
+These data will be used later
+in section \ref{sec:strat}, "\nameref{sec:strat}", in which
+we will perform GWA using different methods
+to account for strtatification, and will compare the results
+with these obtaining by removing outliers ('data2.qt').
+
+Let us also save the cleaned 'data2' object, which will be later
+used in section \ref{sec:meta} ("\nameref{sec:meta}"):
+<<>>=
+save(data2,file="data2.RData")
+@
+
+At this point, you acquired the knowledge necessary for the self-exercise.
+Please close \texttt{R} by \texttt{q()} command and proceed to the next
+section.
+
+\section{Genome-wide association analysis exercise}
+
+\begin{ExerciseList}
+During the exercise, you will work with a larger data set
+(approximately 1,000 people and 7,000+ SNPs). You are to
+do complete three-round QC; perform GWA analysis with
+\texttt{dm2} as the outcome of interest and identify
+10 SNPs which you would like to take to the stage 2 (replication)
+scan. You will do replication analysis using a
+confirmatory data set. If you did everything right, the
+SNPs which you identified as significant or replicated
+will be located in know T2D genes.
+
+Please keep in mind that the data are simulated, and do not
+take your findings too seriously!
+
+Start \texttt{R} by going to "Start -> Programs -> R -> R-?.?.?".
+Load \GA{} library by
+%<<echo=false,results=hide>>=
+%rm(list=ls())
+%@
+<<>>=
+library(GenABEL)
+@
+
+The two data sets we will use in this exercise are part of the \GA{}
+distribution. The first one ("discovery" set) can be loaded by
+<<>>=
+data(ge03d2)
+@
+
+Please move along the lines detailed in the guided exercise and
+try to answer following questions:
+
+\Exercise How many cases and controls are presented
+in the original data set?
+\Answer :
+<<>>=
+table(phdata(ge03d2)$dm2)
+@
+
+
+\Exercise How many markers are presented
+in the original data set?
+\Answer :
+<<>>=
+nsnps(ge03d2)
+@
+
+
+\Exercise Is there evidence for inflation of the
+HWE test staistics?
+\Answer Yes:
+<<>>=
+descriptives.marker(ge03d2)[2]
+@
+
+
+\Exercise Analyse empirical GW significance. How many SNPs
+pass genome-wide significance threshold, after correction for
+the inflation factor? Write down the names of these SNPs for
+further comparison.
+\Answer :
+<<>>=
+res0 <- qtscore(dm2,data=ge03d2,times=200,quiet=TRUE,trait="binomial")
+lambda(res0)
+ds <- descriptives.scan(res0)
+ds
+@
+Thus, there are no genome-wide empirically significant results.
+The 'top' 10 SNPs are
+<<>>=
+snps0 <- rownames(ds)
+snps0
+@
+(note that if empirical P is 1, the rank is assigned quite arbitrarily)
+
+
+\Exercise Perform first steps of the genetic data QC.
+\Answer First step of QC
+<<>>=
+qc1 <- check.marker(ge03d2,call=0.95,perid.call=0.95,p.level=0,ibs.exclude="both")
+summary(qc1)
+data1 <- ge03d2[qc1$idok,qc1$snpok]
+data1 <- Xfix(data1)
+qc2 <- check.marker(data1,call=0.95,perid.call=0.95,p.level=0)
+summary(qc2)
+@
+
+
+\Exercise How many males are 'genetically' females?
+\Answer The list of genetic females who are coded as males is
+<<>>=
+qc1$isfemale
+@
+
+\Exercise How many females are 'genetically' males?
+\Answer The list of genetic males who are coded as females is
+<<>>=
+qc1$ismale
+@
+
+\Exercise How many people are quessed to have 'XXY' genotype?
+\Answer The number of 'XXY' people is \Sexpr{length(qc1$isXXY)}
+
+\Exercise How many sporadic X errors do you still
+observe even when the female male and non-X X-markers are removed?
+(do not forget to \texttt{Xfix()} these
+\Answer Eight 'sporadic' X-errors are left after removing people
+with likely sex code errors (seven in the data set after first step
+of QC)
+
+\Exercise How many "twin" DNAs did you discover?
+\Answer The list of IDs failing IBS checks ('twin' DNAs) is
+<<>>=
+qc1$ibsfail
+@
+
+\Exercise Perform second step of QC.
+\Answer The second step of QC:
+<<>>=
+data1.gkin <- ibs(data1[,autosomal(data1)],weight="freq")
+data1.dist <- as.dist(0.5-data1.gkin)
+data1.mds <- cmdscale(data1.dist)
+km <- kmeans(data1.mds,centers=2,nstart=1000)
+cl1 <- names(which(km$cluster==1))
+cl2 <- names(which(km$cluster==2))
+if (length(cl1)>length(cl2)) {x<-cl2;cl2<-cl1;cl1<-x}
+cl1
+data2 <- data1[cl2,]
+qc2 <- check.marker(data2,hweids=(phdata(data2)$dm2==0),fdr=0.2)
+summary(qc2)
+data2 <- data2[qc2$idok,qc2$snpok]
+data2 <- Xfix(data2)
+####
+#ge03d2.clean <- data2
+#save(ge03d2.clean,file="ge03d2.clean.RData")
+####
+@
+
+\Exercise How many genetic outliers did you discover?
+\Answer The list of genetic outliers is
+<<>>=
+cl1
+@
+
+\Exercise
+How many cases and controls are presented
+in the data after QC?
+\Answer :
+<<>>=
+table(phdata(data2)$dm2)
+@
+
+
+\Exercise How many markers are presented
+in the data after QC?
+\Answer :
+<<>>=
+nsnps(data2)
+@
+
+
+\Exercise Is there evidence for inflation of the
+HWE test staistics?
+\Answer No:
+<<>>=
+descriptives.marker(data2)[2]
+@
+
+FROM THIS POINT ON THE RUNNING OF THE TUTORIAL FAILS (2013.02.20,
+USING GENABEL DEV VERSION 1.7-4). I COMMENTED OUT THE R CODE BELOW.
+
+\Exercise Perform GWA analysis of the cleaned data,
+using asimptotic test and plot the results.
+What is the estimate of $\lambda$ for the 1 d.f. test?
+\Answer :
+<<>>=
+#qts <- qtscore(dm2,data2,trait="binomial")
+#lambda(qts)
+@
+
+
+\Exercise Analyse empirical GW significance. How many SNPs
+pass genome-wide significance threshold, after correction for
+the inflation factor?
+\Answer :
+<<>>=
+#res1 <- qtscore(dm2,data=data2,times=200,quiet=TRUE,trait="binomial")
+#ds1 <- descriptives.scan(res1)
+#ds1
+@
+There are SNPs which are empirically genome-wide significant
+in the data.
+To get the list of 'top' 10 SNPs:
+<<>>=
+#snps1 <- rownames(ds1)
+#snps1
+@
+
+
+\Exercise Do these SNPs overlap much with the ones
+ranked at the top before the QC? If not, what could be the reason?
+\Answer There is little overlap between SNPs before
+and after QC:
+<<>>=
+#snps0
+#snps1
+@
+
+
+\Exercise Select 10 SNPs which you would like to follow-up. Say, you've selected
+rs1646456, rs7950586, rs4785242, rs4435802, rs2847446, rs946364, rs299251,
+rs2456488, rs1292700, and rs8183220.
+Make a vector of these SNPs with
+<<>>=
+#vec12<-c("rs1646456","rs7950586","rs4785242","rs4435802","rs2847446","rs946364","rs299251","rs2456488","rs1292700","rs8183220")
+@
+Load the stage 2 (replicaton) data set by
+<<>>=
+#data(ge03d2c)
+@
+and select the subset of SNPs you need by
+<<>>=
+#confdat <- ge03d2c[,vec12]
+@
+Analyse the \texttt{confdat} for association with \texttt{dm2}.
+
+\Answer :
+<<>>=
+#data(ge03d2c)
+#snps1
+#confdat <- ge03d2c[,snps1]
+#rep <- qtscore(dm2,confdat,times=10000,quiet=TRUE)
+#descriptives.scan(rep)
+@
+
+
+\Exercise Given the two-stage design, and applying the
+puristic criteria specified in the lecture, for how many SNPs
+you can claim a significant finding?
+\Answer Two-stage P-value is
+<<>>=
+# snps1
+# finres <- matrix(NA,10,3)
+# colnames(finres) <- c("Stage 1","Replication","Combined")
+# rownames(finres) <- snps1
+# for (i in 1:10) {
+# finres[i,1] <- res1[which(snpnames(data2)==snps1[i]),"Pc1df"]
+# finres[i,2] <- rep[which(snpnames(confdat)==snps1[i]),"P1df"]
+# finres[i,3] <- finres[i,1]*finres[i,2]
+# }
+# finres
+# for (i in 1:10) {
+# if (finres[i,3] <= 0.05) {
+# print(c("---------",rownames(finres)[i],"-------"))
+#
+# print(c(rownames(finres)[i],"stage 1:"))
+# ph <- phdata(data2)$dm2
+# gt <- as.numeric(data2[,rownames(finres)[i]])
+# print(summary(glm(ph~gt,family=binomial))$coef)
+#
+# print(c(rownames(finres)[i],"stage 2:"))
+# ph <- phdata(confdat)$dm2
+# gt <- as.numeric(confdat[,rownames(finres)[i]])
+# print(summary(glm(ph~gt,family=binomial))$coef)
+#
+# print(c(rownames(finres)[i],"Joint:"))
+# ph <- c(phdata(data2)$dm2,phdata(confdat)$dm2)
+# gt <- c(as.numeric(data2[,rownames(finres)[i]]),as.numeric(confdat[,rownames(finres)[i]]))
+# print(summary(glm(ph~gt,family=binomial))$coef)
+# }
+# }
+#replicatedsnps <- rownames(finres)[finres[,"Stage 1"]<=0.05 &
+# finres[,"Replication"]<=0.05 & finres[,"Combined"]<=0.05]
+#replicatedsnps
+#sigsnps <- rownames(finres)[finres[,"Combined"]<=0.05]
+#sigsnps
+@
+At the first glance, NWSexpr{length(replicatedsnps)} SNPs
+may be claimed as replicated
+because both first stage and replication $P$-values are
+$\le 0.05$ and effects are consistent, and additionally
+NWSexpr{length(sigsnps) - length(replicatedsnps)} may
+be claimed as 'significant' because
+joint $p$-values are $\le 0.05$ and the effects are consistent.
+Generally, a more thorough simulation experiment should be performed.
+
+
+\Exercise Using the same criteria, for how many SNPs
+you can claim a replicated finding?
+\Answer SNP ''rs7903146'' had empirical $p$-value $\le 0.05$
+at both stages, and very strong
+joint significance. It can be claimed as replicated.
+
+You can check if any of the SNPs you have
+identified as significant or replicated are the ones which were
+simulated to be associated with \texttt{dm2} by using the command
+\texttt{show.ncbi(c("snpname1","snpname2","snpname3"))}
+where \texttt{snpnameX} stands for the name of your identified SNP.
+The "true" SNPs can be found on NCBI and some are located in known T2D
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/genabel -r 1137
More information about the Genabel-commits
mailing list