[Genabel-commits] r1250 - tutorials/GenABEL_general
noreply at r-forge.r-project.org
noreply at r-forge.r-project.org
Wed Jun 19 14:38:59 CEST 2013
Author: yurii
Date: 2013-06-19 14:38:58 +0200 (Wed, 19 Jun 2013)
New Revision: 1250
Modified:
tutorials/GenABEL_general/Makefile
tutorials/GenABEL_general/fetchData.Rnw
tutorials/GenABEL_general/strat0.Rnw
Log:
putting the chapter on theory of stratification back
Modified: tutorials/GenABEL_general/Makefile
===================================================================
--- tutorials/GenABEL_general/Makefile 2013-06-14 18:30:39 UTC (rev 1249)
+++ tutorials/GenABEL_general/Makefile 2013-06-19 12:38:58 UTC (rev 1250)
@@ -46,7 +46,7 @@
rm -fv Rplots.pdf *.RData mach1* *txt *PHE
rm -fv *idx *ilg *ind *pdf verbinp
rm -fv *.4?? *.css *.idv *.lg *.tmp *.xref
- rm -fv *.png *.html figures/*.png
+ rm -fv *.png *.html
rm -rf GenABEL_tutorial_html
- rm -rf RData
+ rm -rf RData figures
rm -fv *.tar.gz
Modified: tutorials/GenABEL_general/fetchData.Rnw
===================================================================
--- tutorials/GenABEL_general/fetchData.Rnw 2013-06-14 18:30:39 UTC (rev 1249)
+++ tutorials/GenABEL_general/fetchData.Rnw 2013-06-19 12:38:58 UTC (rev 1250)
@@ -17,8 +17,11 @@
dir.create("RData")
\end{verbatim}
<<echo=FALSE>>=
+# operations with 'figures' hidden from user, for build
unlink("RData",recursive=TRUE,force=TRUE)
+unlink("figures",recursive=TRUE,force=TRUE)
dir.create("RData")
+dir.create("figures")
@
Now, fetch the necessary data from the server. First, define the download
@@ -98,7 +101,14 @@
#Special case of figure(s), not shown to user
baseLocal <- ""
figuresFiles <- c(
- "gwaa-data-class.pdf"
+ "gwaa-data-class.pdf",
+ "allelic_freq.pdf",
+ "HWE_under_inbreeding.pdf",
+ "inbred_family.pdf",
+ "inflation_on_freq.pdf",
+ "journal_pone_0005472.pdf",
+ "what_method.pdf",
+ "samples.pdf"
)
myDownloads(baseUrl,baseLocal,figuresFiles)
@
Modified: tutorials/GenABEL_general/strat0.Rnw
===================================================================
--- tutorials/GenABEL_general/strat0.Rnw 2013-06-14 18:30:39 UTC (rev 1249)
+++ tutorials/GenABEL_general/strat0.Rnw 2013-06-19 12:38:58 UTC (rev 1250)
@@ -56,6 +56,1958 @@
specific association tests which take possible genetic
structure into account (section \ref{sec:tests_in_structured_populations}).
+{\bf The text of this chapter is in large part based on a chapter of a
+book published by Elsvier. We thank Elsvier for the permission to reproduce this
+material. COPYRIGHT NOTICE: Reprinted from "Analysis of Complex Disease
+Association Studies: A Practical Guide", Yurii Aulchenko,
+chapter "Effects of Population Structure in Genome-wide Association Studies",
+123-156, Copyright 2011, with permission from Elsevier}
-{\bf The rest of this chapter is ?temporarily? deleted due to potential copyright issues}
+\section{Genetic structure of populations}
+\label{sec:genstruct}
+A major unit of genetic structure is a
+genetic population. Different definitions
+of genetic population are available,
+for example
+\href{http://en.wikipedia.org/wiki/Population}{Wikipedia
+defines population (biol.)}
+as ''the collection of inter-breeding organisms of a
+particular species''. The genetics of populations is
+\href{http://en.wikipedia.org/wiki/Population_genetics}{
+''the study of the allele frequency distribution and change
+under the influence of \ldots evolutionary processes''}.
+\index{population genetics}
+In the framework of population genetics, the main
+characteristics of interest of a group of
+individuals are their genotypes, frequencies of alleles
+in this group, and the dynamics of these distributions
+in time.
+While the units of interest of population genetics
+are alleles, the units of evolutionary processes
+are acting upon are organisms.
+Therefore a definition of a genetic population should
+be based on the chance that different alleles, present
+in the individuals in question can mix together;
+if such chance is zero,
+we may consider such groups as different populations,
+each described by its own genotypic and allelic
+frequencies and their dynamic.
+Based on these considerations, a genetic
+population may be defined a
+in the following way:
+
+\emph{
+Two individuals, $I_1$ and $I_2$, belong to the same
+population if (a) the probability that they would
+have an offspring in common is greater then zero and
+(b) this probability is much higher than the probability
+of $I_1$ and $I_2$ having an offspring in common with
+some individual $I_3$, which is said to belong to other
+genetic population.}
+\index{genetic population!prospective definition}
+
+Here, to have an offspring in common
+does not imply having a direct offspring, but rather a
+common descendant in a number of generations.
+
+However, in gene discovery in general and GWA studies
+in particular we are usually not interested
+in future dinamics of alleles and genotypes distributions.
+What is the matter of concern in genetic association
+studies is potential common
+ancestry -- that is that individuals
+may share common ancestors and thus share in common
+the alleles, which are exact copies of the same ancestral
+allele. Such alleles are called ''identical-by-descent'',
+or IBD for short.\index{identity by descent}\index{IBD}
+If the chance of IBD is high, this reflects high degree
+of genetic relationship.
+As a rule, relatives
+share many features, both environmental and genetic,
+which may lead to confounding.
+
+Genetic relationship between a pair of individuals
+is quantified using the ''coefficient of kinship'',
+which measures that chance that gametes, sampled
+at random from these individuals, are IBD.\index{coefficient!of
+kinship}\index{kinship!coefficient}\label{def:kinship}
+
+Thus for the purposes of gene-discovery
+we can define genetic population
+use retrospective terms and based on the
+concept of IBD:
+
+\emph{
+Two individuals, $I_1$ and $I_2$, belong to the same
+genetic population if (a) their genetic relationship, measured
+with the coefficient of kinship,
+is greater then zero and (b) their kinship is much higher
+than kinship between them and some individual $I_3$, which is
+said to belong to other genetic population.}
+\index{genetic population!retrospective definition}
+\label{def:population}
+
+One can see that this definition is quantitative and
+rather flexible (if not to say arbitrary): what we call
+a ''population'' depends on the choice of the threshold
+for the ''much-higher'' probability. Actually, what
+you define as ''the same'' genetic population depends
+in large part on the scope aims of your study.
+In human genetics literature you may find references to
+a particular genetically isolated population, population of some
+country (e.g. ''German population'', ''population of United Kingdom''),
+European, Caucasoid or even general human population. Defining a
+population is about deciding on some probability threshold.
+
+In genetic association studies, it is frequently assumed that
+study participants are ''unrelated'' and ''come from the same
+genetic population''. Here, ''unrelated'' means, that while
+study participants come from the same population (so, there is
+non-zero kinship between them!), this kinship is so low that it
+has very little effect on the statistical testing procedures
+used to study association between genes and phenotypes.
+
+In the following sections we will consider the effects of population
+structure on the istribution of genotypes in a study population.
+We will start with assumption of zero kinship between study
+participants, which would allow us to formulate Hary-Weinberg principle
+(section \ref{subsec:HWE}).
+In effect, there is no such thing as zero kinship between
+any two organisms, however, when kinship is very low, the effects
+of kinship on genotypic distribution are minimal, as we will see in
+section \ref{subsec:inbreeding}. The effects of substructure --
+that is when study sample consist of several genetic populations --
+onto genotypic distribution will be considered in section \ref{subsec:wahlund}.
+Finally, we will generalize the obtained results for the
+case of arbitrary structures and will see what are the effects
+of kinship onto joint distribution of genotypes and phenotypes
+in section \ref{subsec:phenocorr}.
+
+\subsection{Hardy-Weinberg equilibrium}
+\label{subsec:HWE}
+To describe genetic structure of populations
+we will use rather simplistic model
+approximating genetic processes in natural populations. Firstly, we will
+assume that the population under consideration has infinitely
+large size, which implies that we can work in terms of probabilities,
+and no random process take place.
+Secondly, we accept non-overlapping
+
+$$\textrm{generation} \Rightarrow \textrm{gametic pool} \Rightarrow \textrm{generation}$$
+\index{generation -- gametic pool -- generation model}
+\label{ggpg_model}
+
+\noindent model. This model assumes that a set of individuals
+contributes gametes to genetic pool, and dies out. The gametes
+are sampled randomly from this pool in pairs to form individuals
+of the second generation. The selection acts on individuals, while
+mutation occurs when the gametic pool is formed. The key point
+of this model is the abstract of gametic pool: if you use that,
+you do not need to consider all pair-wise mating between male and
+female individuals; you rather consider some abstract infinitely
+large pool, where gametes are contributed to with the frequency
+proportional to that in previous generation. Interestingly, this
+rather artificial construct has a great potential to describe
+the phenomena we indeed observe in nature.
+
+In this section, we will derive Hardy-Weinberg low (this analog
+of the Mendel's low for populations). The question to be
+answered is, if some alleles at some locus segregate
+according to Mendel's lows and aggregate totally at random, what
+would be genotypic distribution in a population?
+
+Let us consider two alleles, wild type normal allele ($N$) and
+a mutant ($D$), segregating at some locus in the population
+and apply the ''generation $\Rightarrow$ gametic pool $\Rightarrow$
+generation'' model.
+Let us denote the ferquency of the $D$ allele in the gametic
+pool as $q$, and the frequency of the other allele, $N$, as
+$p=1-q$.
+Gametes containing alleles $N$ and $D$ are sampled at random to
+form diploid individuals of the next generation.
+The probability to sample a ''$N$'' gamete is $p$, and the
+probability that the second sampled gamete is also ''$N$'' is
+also $p$. According to the rule, which states that joint probability
+of two independent events is a product of their probabilities,
+the probability to sample ''$N$'' and ''$N$'' is
+$p \cdot p = p^2$. In the same manner, the probability to
+sample ''$D$'' and then ''$D$'' is $q \cdot q = q^2$. The
+probability to sample first the mutant and then normal allele
+is $q \cdot p$, the same is the probability to
+sample ''$D$'' first and ''$N$'' second. In most situations, we
+do not (and can not) distinguish heterozygous genotypes $DN$
+and $ND$ and refer to both of them as ''$ND$''. In this
+notation, frequency of $ND$ will be
+$q \cdot p + p \cdot q = 2 \cdot p \cdot q $.
+Thus, we have computed the genotypic distribution for a population
+formed from a gametic pool in which the frequency of $D$ allele
+was $q$.
+
+To obtain the next generation, the next gametic pool is generated.
+The frequency of $D$ in the nect gametic pool is
+$q^2 + \frac{1}{2}\cdot 2 \cdot p \cdot q$.
+Here, $q^2$ is the probability that a gamete-contributing
+individual has genotype $DD$; $2\cdot p \cdot q$ is the probability that
+a gamete-contributing individual is $ND$, and $\frac{1}{2}$ is
+the probability that $ND$ individual contributes $D$ allele
+(only half of the gametes contributed by individuals with $ND$
+genotype are $D$); see Figure \ref{fig:allelic_freq}.
+Thus the freqeuncy of $D$ in the gametic pool is
+$q^2 + \frac{1}{2}\cdot 2 \cdot p \cdot q = q \cdot (q + p) = q$
+-- exactly the same as it was in previous gametic pool.
+
+\begin{figure}
+\center
+\includegraphics[width=0.80\textwidth]{allelic_freq}
+\caption{
+Genotypic and allelic frequency distribution in a
+population; $q=P(D)=P(DD)+\frac{1}{2}\cdot P(DN)$.
+}
+\label{fig:allelic_freq}
+\end{figure}
+
+Thus, if assumptions of random segregation
+and aggregation hold, the expected frequency of $NN$, $ND$
+and $DD$ genotypes are stable over generations and
+can be related to the allelic frequencies using the
+follwoing relation
+
+\begin{equation}
+\label{eq:HWE2}
+\begin{array}{lll}
+P(NN) &= (1-q) \cdot \ (1-q) &= p^2, \\
+P(ND) &= q\cdot (1-q) + (1-q) \cdot q & = 2 \cdot p \cdot q, \\
+P(DD) &= q \cdot q &= q^2
+\end{array}
+\end{equation}
+which is known as Hardy-Weinberg equlibrium (HWE) point.
+\label{Hardy-Weinberg equilibrium}
+
+
+There are many reasons, in which random segregation and
+aggregation, and, consequently, Hardy-Weinberg equilibrium,
+are violated. It is very important to
+realize that, especially if the study participants are believed
+to come from the same genetic population, most of the times when
+deviation from HWE is detected, this
+deviation is due to technical reasons, i.e. genotyping
+error. Therefore testing for HWE is a part of the
+genotypic quality control procedure in most studies.
+Only when the possibility of technical errors is
+eliminated, other possible explanations may be
+considered.
+In a case when deviation from HWE can not be explained
+by technical reasons, the most frequent explanation would
+be that the sample tested is composed of representatives
+of different genetic populations, or more subtle
+genetic structure. However, unless study participants
+represent a mixture of very distinct genetic
+populations -- the chances of which coming unnoticed
+are low -- the efffects of genetic structure on HWE
+are difficult to detect, at least for any single marker,
+as you will see in the next sections.
+\index{deviation from Hardy-Weinberg equilibrium}
+\index{Hardy-Weinberg equilibrium!deviation from}
+
+\subsection{Inbreeding}
+\label{subsec:inbreeding}
+
+Inbreeding is preferential breeding between (close) relatives.\index{inbreeding}
+An extreme example of inbreeding is a selfing, a breeding system,
+observed in some plants. The inbreeding is not uncommon in animal
+and human populations. Here, the main reason
+for inbreeding are usually geographical (e.g. mice live in
+very small interbred colonies -- dems -- which are usually
+established by few mice and are quite separated
+from other dems) or cultural (e.g. noble families
+of Europe).
+
+Clearly, such preferential breeding between relatives
+violates the assumption of random aggregation, underling
+Hardy-Weinberg principle. Relatives are likely to share the
+same alleles, inherited from common ancestors. Therefore
+their progeny has an increased chance of being
+\emph{autozygous}\index{autozygosity} -- that is to
+inherit a copy of exactly the same ancestral allele
+from both parents. An autozygous genotype is always
+homozygous, therefore inbreeding should increase the
+frequency of homozygous, and decrease the frequency of
+heterozygous, genotypes.
+
+Inbreeding is quantified by the \emph{coefficient of
+inbreeding},\index{coefficient!of inbreeding}\index{inbreeding!coefficient of}
+which is defined as the probability of autozygosity.
+This coefficient may characterize an individual, or
+a population in general, in which case this is expectation
+that a random individual from the population is
+autozygous at a random locus. The coefficient of
+inbreeding is closely related to the coefficient of
+kinship, defined earlier for a pair of individuals as
+the probability that two alleles sampled
+at random from these individuals, are IBD. It is easy to see
+that the coefficient of inbreeding for a person is
+the same as the kinship between its parents.
+\index{coefficien!of inbreeding, relation to kinship}
+\index{coefficien!of kinship, relation to inbreeding}
+
+\begin{figure}
+\center
+\includegraphics[width=1.00\textwidth]{inbred_family}
+\caption{Inbred family structure (A) and probability of
+individual ''G'' being autozygous for the ''Red'' ancestral
+allele
+}
+\label{fig:inbred_family}
+\end{figure}
+
+Let us compute the inbreeding coefficient for the person {\bf J}
+depicted at figure \ref{fig:inbred_family}. {\bf J} is a child
+of {\bf G} and {\bf H}, who are cousins. {\bf J} could be autozygous
+at for example ''red'' allele of founder grand-grand-parent {\bf A},
+which could have been transmitted through the meioses
+{\bf A $\Rightarrow$ D}, {\bf D $\Rightarrow$ G}, and
+{\bf G $\Rightarrow$ J}, and also through the path
+{\bf A $\Rightarrow$ E}, {\bf E $\Rightarrow$ H}, and
+{\bf H $\Rightarrow$ J} (Figure \ref{fig:inbred_family} {\bf B}).
+What is the chance for {\bf J} to be autozygous for the
+''red'' allele? The probability that this particular founder
+allele is transmitted to {\bf D} is $1/2$, the same is the probability
+that the allele is transmitted from {\bf D} to {\bf G}, and
+the probability that the allele is transmitted from
+{\bf G} to {\bf J}. Thus the probability that the ''red'' allele
+is transmitted from {\bf A} to {\bf J} is $1/2 \cdot 1/2 \cdot 1/2 = 1/2^3 = 1/8$.
+The same is the chance that that allele is transmitted from
+{\bf A} to {\bf E} to {\bf H} to {\bf J}, therefore the probability
+that {\bf J} would be autozygous for the red allele is
+$1/2^3 \cdot 1/2^3 = 1/2^6 = 1/64$. However, we are interested in
+autozygosity for any founder allele; and there are four such
+alleles (''red'', ''green'', ''yellow'' and ''blue'', figure
+\ref{fig:inbred_family} {\bf B}). For any of these the probability
+of autozygosity is the same, thus the total probability of
+autozygosity for {\bf J} is $4\cdot 1/64 = 1/2^4 = 1/16$.
+
+Now we shall estimate the expected genotypic probability
+distribution for a person characterized with some
+arbitrary coefficient of inbreeding, $F$ -- or for a population
+in which average inbreeding is $F$. Consider a locus with two
+alleles, $A$ and $B$, with frequency of $B$ denoted as $q$, and
+frequency of $A$ as $p=1-q$. If the person is autozygous
+for some founder allele, the founder allele may be either
+$A$, leading to autozygous genotype $AA$, or the founder
+allele may be $B$, leading to genotype $BB$. The chance that
+the founder allele is $A$ is $p$, and the chance that the
+founder allele is $B$ is $q$. If the person
+is not autozygous, then the expected genotypic frequencies
+follow HWE. Thus, the probability of genotype
+$AA$ is $(1-F)\cdot p^2 + F\cdot p$, where the first term corresponds
+to probability that the person is $AA$ given it is not inbred ($p^2$),
+multiplied by the probability that it is not inbred ($1-F$), and
+the second term corresponds to probability that a person is
+$AA$ given it is inbred ($p$), multiplied by the probability that the
+person is inbred ($F$). This computations can be easily done for all
+genotypic classes leading to the expression for HWE under inbreeding.
+
+\begin{equation}
+\label{eq:HWE_inbreeding}
+\begin{array}{lll}
+P(AA) &=(1-F)\cdot p^2 + F \cdot p &=p^2+p\cdot q\cdot F \\
+P(AB) &=(1-F)\cdot 2\cdot p\cdot q + F \cdot 0 &=2\cdot p\cdot q\cdot (1-F) \\
+P(BB) &=(1-f)\cdot q^2 + F\cdot q &=q^2+p\cdot q\cdot F \\
+\end{array}
+\end{equation}
+\index{Hardy-Weinberg equilibrium!under inbreeding}
+
+How much is inbreeding expected to modify genotypic distribution
+in human populations? The levels of inbreeding observed in
+human genetically isolated populations typically
+vary between $0.001$ (low inbreeding) to $0.05$ (relatively high),
+see \cite{rudan2003,pardo2005}. The genotypic distribution
+for $q=0.5$ and different values of the inbreeding coefficient is
+shown at the figure \ref{fig:HWE_under_inbreeding}.
+
+\begin{figure}
+\center
+\includegraphics[width=1.00\textwidth]{HWE_under_inbreeding}
+\caption{
+Genotypic probability distribution for a locus with 50\% frequency of
+the $B$ allele; black bar, no inbreeding; red, $F=0.001$; green, $F=0.01$;
+blue, $F=0.05$
+}
+\label{fig:HWE_under_inbreeding}
+\end{figure}
+
+What is the power to detect deviation from HWE due to inbreeding?
+For that, we need to estimate the expectation of
+the $\chi^2$ statistics (the non-centrality parameter, NCP) used
+to test for HWE. The test for HWE is performed using standard formula
+
+\begin{equation}
+\label{eq:chi2}
+T^2 = \sum_i \frac{(O_i-E_i)^2}{E_i}
+\end{equation}
+\index{Hardy-Weinberg equilibrium!$\chi^2$ test}
+\index{test!for Hardy-Weinberg equilibrium}
+where summation is performed over all classes (genotypes); $O_i$ is
+the count observed in $i$-th class, and $E_i$ is the count expected
+under the null hypothesis (HWE). Under the null hypothesis, this
+test statistic is distributed as $\chi^2$ with number of degrees of
+freedom equal to the number of genotypes minus the number of alleles.
+
+Thus the expectation of this test statistic for some $q$, $F$, and $N$
+(sample size) is
+
+\begin{equation}
+\label{eq:exp_chi2_HWE_F}
+\begin{array}{ll}
+E[T^2] &= \frac{(N(q^2+p q F)-N q^2)^2}{N q^2}
++ \frac{(N2pq(1-F)-N2pq)^2}{N2pq}
++ \frac{(N(p^2+p q F)-N p^2)^2}{N p^2} \\
+ &= \frac{(NpqF)^2}{N q^2} + \frac{(-2NpqF)^2}{N2pq} + \frac{(NpqF)^2}{N p^2} \\
+ & = Np^2F^2+2NpqF^2+Nq^2F^2 \\
+ & = NF^2(p^2+2pq+q^2) \\
+ & = N\cdot F^2
+\end{array}
+\end{equation}
+
+Interestingly, the non-centrality parameter does not depend on the
+allelic frequency. Given the non-centrality parameter, it is easy
+to compute the power to detect deviation from HWE for any given $F$.
+For example, to achieve the power of $>0.8$ at $\alpha=0.05$, for a test
+with one degree of freedom the non-centrality parameter should
+be $>7.85$. Thus, if $F=0.05$, to have 80\% power,
+$N\cdot F^2 > 7.85$, that is the required sample size should be
+$N > \frac{7.85}{F^2} = \frac{7.85}{0.0025} = 3140$ people.
+
+Thus, even in populations with strong inbreeding, rather
+large sample sizes are required to detect the effects
+of inbreeding on HWE at a particular locus, even at relatively
+weak significance level of 5\%.
+
+While the chance that deviation from HWE due to inbreeding
+will be statistically significant is relatively small,
+inbreeding may have clear effects on the results of HWE
+testing in GWA study. Basically, if testing is performed
+at a threshold corresponding to nominal significance $\alpha$,
+a proportion of markers which show significant deviation
+will be larger than $\alpha$. Clearly, how large this proportion
+will be depends on the inbreeding and on size of the study --
+expectation of $T^2$ is a function of both $N$ an $F$.
+A proportion of markers showing significant deviation form
+HWE at different values of inbreeding, sample size, and
+nominal significance threshold, is shown in table
+\ref{tab:t1e_hwe_underF}.
+While deviation of this proportion from nominal one is
+minimal at large $\alpha$'s and small sample sizes
+and coefficients of inbreeding, it may be 10-fold and
+even 100-fold higher than the nominal level at reasonable
+values of $N$ and $F$ for smaller thresholds.
+
+\begin{table} %\renewcommand{\arraystretch}{2}\addtolength{\tabcolsep}{-1pt}
+\centering
+\caption{Expected proportion of markers deviating from HWE in a sample
+of $N$ people coming from a population with average
+inbreeding $F$. Proportion of markers is shown for
+particular test statistic threshold, corresponding to
+nominal significance $\alpha$.}
+\label{tab:t1e_hwe_underF}
+{\setlength{\tabcolsep}{3mm}
+\begin{tabular}{lcccc}
+\hline
+ & & & $\alpha$ &\\
+\cline{3-5}
+$N$ & $F$ & 0.05 & $10^{-4}$ & $5\cdot 10^{-8}$ \\
+\hline
+ & 0.001 & 0.0501 & $1.008\cdot 10^{-4}$ & $5.077\cdot 10^{-8}$ \\
+1,000 & 0.005 & 0.0529 & $1.205\cdot 10^{-4}$ & $7.025\cdot 10^{-8}$ \\
+ & 0.010 & 0.0615 & $1.885\cdot 10^{-4}$ & $14.503\cdot 10^{-8}$ \\
+\hline
+ & 0.001 & 0.0511 & $1.081\cdot 10^{-4}$ & $5.784\cdot 10^{-8}$ \\
+10,000 & 0.005 & 0.0790 & $3.544\cdot 10^{-4}$ & $36.991\cdot 10^{-8}$ \\
+ & 0.010 & 0.1701 & $19.231\cdot 10^{-4}$ & $426.745\cdot 10^{-8}$ \\
+\hline
+\end{tabular}
+}
+\end{table}
+
+
+\subsection{Mixture of genetic populations: Wahlund's effect}
+\label{subsec:wahlund}
+
+Consider the following artificial example. Imagine that
+recruitment of study participants occurs at a hospital,
+which serves two equally size villagec($V_1$ and ($V_2$);
+however, the villages are very distinct because of cultural
+reasons, and most marriages occur within a village. Thus
+these two villages represent two genetically distinct
+populations. Let us consider a locus with two alleles,
+$A$ and $B$. The frequency of $A$ is $0.9$ in $V_1$ and
+it is $0.2$ in $V_2$. In each population, marriages
+occur at random, and HWE holds for the locus. What
+genotypic distribution is expected in a sample
+ascertained in the hospital, which represents a $1:1$
+mixture of the two populations?
+
+The expected gentypic proportions are presented in
+table \ref{tab:mixpop}. First, assuming that HWE holds
+for each of the populations, we can compute genotypic
+proportions within these (rows 1 and 2 of table
+\ref{tab:mixpop}). If our sample represents a
+$1:1$ mixture of these populations, then the frequency
+of some genotype is also $1:1$ mixture of the respective
+frequencies. For example, frequency of $AA$ genotype
+would be $\frac{0.81}{2} + \frac{0.04}{2} = 0.425$,
+and so on. The frequency of the $A$ allele in pooled
+sample will be $0.425 + \frac{0.25}{2} = 0.55$. Based
+on this frequency we would expect genotypic frequency
+distribution of $0.3$, $0.5$ and $0.2$, for $AA$, $AB$, and
+$BB$, respectively. As you can see the observed distribution
+has much higher frequencies of homozygous genotypes -- excess
+of homozygotes.
+
+It is notable, that the differences between the observed
+homozygotes frequencies and these expected under HWE
+are both 0.125, and, consequently, the observed heterozygosity
+is less than that expected by $0.125\cdot 2 = 0.25$.
+
+The phenomenon of deviation from HWE due to the fact that
+considered population consist of two sub-populations,
+is known as
+\href{http://en.wikipedia.org/wiki/Wahlund_effect}{''Wahlund's effect''}\index{Wahlund's effect},
+after the scientist who has first considered and quantified
+genotypic distribution under such model\cite{wahlund1928}.
+
+\begin{table}
+\centering
+\caption{Genotypic proportions in a mixed population}
+\label{tab:mixpop}
+\begin{tabular}{lccccc}
+\hline
+Village & \%Sample & $p(A)$ & $P(AA)$ & $P(AB)$ & $P(BB)$ \\
+\hline
+$V_1$ & 50 & 0.9 & 0.81 & 0.18 & 0.01 \\
+$V_2$ & 50 & 0.2 & 0.04 & 0.32 & 0.64 \\
+ & & & & Observed & \\
+\hline
+Pooled & 100 & 0.55 & 0.425 & 0.25 & 0.325 \\
+ & & & & Expected & \\
+\cline{4-6}
+ & & & 0.30 & 0.50 & 0.20 \\
+ & & & & Difference & \\
+\cline{4-6}
+ & & & 0.125 & $-0.250$ & 0.125 \\
+\hline
+\end{tabular}
+\end{table}
+
+Such marked differences between observed and expected under HWE are
+very easily detected; for the above example, a sample of $\approx 35$ people
+is enough to reject the hypothesis of HWE (power $>80\%$ at
+$\alpha = 0.05$).
+
+However, the differences we can see in real life are not so marked.
+For example, the common Pro allele at position 12 of the peroxisome
+proliferator-activated receptor gamma is associated with increased
+risk for type 2 diabetes. The frequency of the Pro allele is about
+85\% in European populations and Caucasian-Americans, about 97\% in
+Japan and 99\% in African-American (see table $1$ from \cite{Ruiz2005}).
+Table \ref{tab:mixpop_pparg} shows hypothetical observed and expected
+genotypic proportions in a sample composed of 50\% Caucasians and
+50\% African-American.
+
+\begin{table}
+\centering
+\caption{Genotypic proportions of $PPAR\GA{}mma$ $Pro12Ala$ genotype in a mixed population}
+\label{tab:mixpop_pparg}
+\begin{tabular}{lccccc}
+\hline
+Ethnics & \%Sample & $p(Pro)$ & $P(Pro/Pro)$ & $P(Pro/Ala)$ & $P(Ala/Ala)$ \\
+\hline
+Caucasian & 50 & 0.85 & 0.7225 & 0.2550 & 0.0225 \\
+Afro-American & 50 & 0.99 & 0.9801 & 0.0198 & 0.0.001 \\
+ & & & & Observed & \\
+\hline
+Pooled & 100 & 0.92 & 0.8513 & 0.1374 & 0.0113 \\
+ & & & & Expected & \\
+\cline{4-6}
+ & & & 0.8464 & 0.1472 & 0.0064 \\
+ & & & & Difference & \\
+\cline{4-6}
+ & & & $0.0049$ & $-0.0098$ & $0.0049$ \\
+\hline
+\end{tabular}
+\end{table}
+
+You can see that observed distribution and the one expected
+under HWE are very similar; only a sample as large as 1,800
+people would allow detection of the deviation from HWE (power
+$>80\%$ at $\alpha = 0.05$). The situation is similar for
+most genes observed in real life -- while the frequencies
+may be (or may be not) very different for populations, which
+diverged long time ago, for relatively close populations
+expected frequency differences are small and large sample
+sizes are required to detect deviation from HWE due to
+Wahlund's effect at a particular fixed locus.
+
+Let us summarize, what genotypic proportions are expected
+in a sample, which is a mixture of two populations. Let
+each population is in HWE, and the frequency of the $B$
+allele is $q_1$ in population one and $q_2$ in population
+two. Let the proportion of individuals coming from
+population one is $m$ in the mixed population, and consequently
+the proportion of individuals from population two is $(1-m)$.
+The allelic frequencies, and genotypic distributions
+in the original and mixed populations are presented in
+tale \ref{tab:mixpop_general}.
+
+\begin{table}
+\centering
+\caption{Expected genotypic proportions in a mixed population; $F_{st}$ is
+defined by equation \ref{eq:wahlundsD}}
+\label{tab:mixpop_general}
+\begin{tabular}{lccccc}
+\hline
+Population & Prop. & $p(B)$ & $P(AA)$ & $P(AB)$ & $P(BB)$ \\
+\hline
+$P_1$ & $m$ & $q_1$ & $p_1^2$ & $2 p_1 q_1$ & $q_1^2$ \\
+$P_2$ & $(1-m)$ & $q_2$ & $p_2^2$ & $2 p_2 q_2$ & $q_2^2$ \\
+ & & & & Observed & \\
+\hline
+Pooled & 1.0 & $\overline{q}=m q_1 $ & $m p_1^2 $ & $2 m p_1 q_1$ & $m q_1^2 $ \\
+& & $+ (1-m) q_2$; & $+ (1-m) p_2^2$; & $+2 (1-m) p_2 q_2$; & $ + (1-m) q_2^2$ \\
+ & & & & Expected & \\
+\cline{4-6}
+ & & & $\overline{p}^2$ & $2 \overline{p} \overline{q}$ & $\overline{q}^2$\\
+ & & & & Difference & \\
+\cline{4-6}
+ & & & $\overline{p} \overline{q} F_{st}$ & $-2 \overline{p} \overline{q} F_{st}$ & $\overline{p} \overline{q} F_{st}$ \\
+\hline
+\end{tabular}
+\end{table}
+
+The frequency of the $B$ allele in the mixed population is just the weighted
+average of the allelic frequencies in the two populations,
+$\overline{q}=m\cdot q_1+(1-m)\cdot q_2$. Let us denote the frequency
+of the $A$ allele as $\overline{p}=1-\overline{q}$.
+It can be demonstrated that the genotypic frequency distribution in the
+mixed sample is the function of the frequency of allele $B$ in the sample,
+$\overline{q}$, and ''disequilibrium'' parameter $D$:
+
+\begin{equation}
+\label{eq:HWE_wahlund}
+\begin{array}{ll}
+P(AA) &= \overline{p}^2+\overline{p}\cdot \overline{q}\cdot F_{st} \\
+P(AB) &= 2\cdot \overline{p}\cdot \overline{q}\cdot (1-F_{st}) \\
+P(BB) &= \overline{q}^2+\overline{p}\cdot \overline{q}\cdot F_{st} \\
+\end{array}
+\end{equation}
+\index{Hardy-Weinberg equilibrium!under Wahlund's effect}
+where
+\begin{equation}
+\label{eq:wahlundsD}
+F_{st} = \frac{m\cdot (1-m) \cdot (q_1-q_2)^2}{\overline{p}\cdot \overline{q}}
+\end{equation}
+
+You can see that equation \ref{eq:HWE_wahlund}, expressing the genotypic
+frequencies distribution under Wahlund's effect, is remarkably similar
+(actually, is specifically re-written in a form similar) to the
+equation \ref{eq:HWE_inbreeding}, expressing the genotypic proportions
+under the effects of inbreeding. Again, the reason is that $F_{st}$
+(as well as $F$ of equation \ref{eq:HWE_inbreeding}) is easily estimated
+from the data as the ratio between the observed and expected variances
+of the genotypic distributions. Then the expected non-centrality
+parameter for the test of HWE is simply $N\cdot F_{st}^2$, where
+$N$ is the sample size. Therefore our results concerning the
+proportion of tests expected to pass a particular significance threshold
+when genome-wide data are analyzed (table \ref{tab:t1e_hwe_underF}) hold,
+with replacement of $F$ with $F_{st}$.
+
+We can compute that the values of $F_{st}$, corresponding to the
+population mixtures presented in tables \ref{tab:mixpop} and
+\ref{tab:mixpop_pparg} are 0.49 and 0.067, respectively, which
+gives us a shortcut to estimate the sample size required to detect
+deviation from HWE due to Wahlund's effect (at $\alpha=0.05$ and
+power 80\%): $N > 7.85/0.49^2 \approx 32$ and $N > 7.85/0.067^2 \approx 1771$.
+
+A typical value of $F_{st}$ for European populations is about 0.002
+(up to 0.023\cite{nelis2009}); very large sample sizes are required to
+detect deviation from HWE at any given locus at such small $F_{st}$'s.
+However, the effects onto the proportion of markers failing to
+pass HWE test in GWA may be visibly inflated (table \ref{tab:t1e_hwe_underF}).
+
+
+
+
+
+
+
+
+
+
[TRUNCATED]
To get the complete diff run:
svnlook diff /svnroot/genabel -r 1250
More information about the Genabel-commits
mailing list