[Genabel-commits] r1250 - tutorials/GenABEL_general

noreply at r-forge.r-project.org noreply at r-forge.r-project.org
Wed Jun 19 14:38:59 CEST 2013


Author: yurii
Date: 2013-06-19 14:38:58 +0200 (Wed, 19 Jun 2013)
New Revision: 1250

Modified:
   tutorials/GenABEL_general/Makefile
   tutorials/GenABEL_general/fetchData.Rnw
   tutorials/GenABEL_general/strat0.Rnw
Log:
putting the chapter on theory of stratification back

Modified: tutorials/GenABEL_general/Makefile
===================================================================
--- tutorials/GenABEL_general/Makefile	2013-06-14 18:30:39 UTC (rev 1249)
+++ tutorials/GenABEL_general/Makefile	2013-06-19 12:38:58 UTC (rev 1250)
@@ -46,7 +46,7 @@
 	rm -fv Rplots.pdf *.RData mach1* *txt *PHE
 	rm -fv *idx *ilg *ind *pdf verbinp
 	rm -fv *.4?? *.css *.idv *.lg *.tmp *.xref
-	rm -fv *.png *.html figures/*.png
+	rm -fv *.png *.html 
 	rm -rf GenABEL_tutorial_html
-	rm -rf RData
+	rm -rf RData figures
 	rm -fv *.tar.gz

Modified: tutorials/GenABEL_general/fetchData.Rnw
===================================================================
--- tutorials/GenABEL_general/fetchData.Rnw	2013-06-14 18:30:39 UTC (rev 1249)
+++ tutorials/GenABEL_general/fetchData.Rnw	2013-06-19 12:38:58 UTC (rev 1250)
@@ -17,8 +17,11 @@
 dir.create("RData")
 \end{verbatim}
 <<echo=FALSE>>=
+# operations with 'figures' hidden from user, for build
 unlink("RData",recursive=TRUE,force=TRUE)
+unlink("figures",recursive=TRUE,force=TRUE)
 dir.create("RData")
+dir.create("figures")
 @ 
 
 Now, fetch the necessary data from the server. First, define the download
@@ -98,7 +101,14 @@
 #Special case of figure(s), not shown to user
 baseLocal <- ""
 figuresFiles <- c(
-           "gwaa-data-class.pdf"
+           "gwaa-data-class.pdf",
+           "allelic_freq.pdf",
+           "HWE_under_inbreeding.pdf",
+           "inbred_family.pdf",
+           "inflation_on_freq.pdf",
+           "journal_pone_0005472.pdf",
+           "what_method.pdf",
+           "samples.pdf"
            )
 myDownloads(baseUrl,baseLocal,figuresFiles)
 @ 

Modified: tutorials/GenABEL_general/strat0.Rnw
===================================================================
--- tutorials/GenABEL_general/strat0.Rnw	2013-06-14 18:30:39 UTC (rev 1249)
+++ tutorials/GenABEL_general/strat0.Rnw	2013-06-19 12:38:58 UTC (rev 1250)
@@ -56,6 +56,1958 @@
 specific association tests which take possible genetic 
 structure into account (section \ref{sec:tests_in_structured_populations}).
 
+{\bf The text of this chapter is in large part based on a chapter of a
+book published by Elsvier. We thank Elsvier for the permission to reproduce this
+material. COPYRIGHT NOTICE: Reprinted from "Analysis of Complex Disease
+Association Studies: A Practical Guide", Yurii Aulchenko,
+chapter "Effects of Population Structure in Genome-wide Association Studies",
+123-156, Copyright 2011, with permission from Elsevier}
 
-{\bf The rest of this chapter is ?temporarily? deleted due to potential copyright issues}
+\section{Genetic structure of populations}
+\label{sec:genstruct}
 
+A major unit of genetic structure is a 
+genetic population. Different definitions 
+of genetic population are available,  
+for example 
+\href{http://en.wikipedia.org/wiki/Population}{Wikipedia 
+defines population (biol.)} 
+as ''the collection of inter-breeding organisms of a 
+particular species''. The genetics of populations is 
+\href{http://en.wikipedia.org/wiki/Population_genetics}{
+''the study of the allele frequency distribution and change 
+under the influence of \ldots evolutionary processes''}. 
+\index{population genetics}
+In the framework of population genetics, the main 
+characteristics of interest of a group of 
+individuals are their genotypes, frequencies of alleles
+in this group, and the dynamics of these distributions 
+in time. 
+While the units of interest of population genetics 
+are alleles, the units of evolutionary processes 
+are acting upon are organisms. 
+Therefore a definition of a genetic population should
+be based on the chance that different alleles, present 
+in the individuals in question can mix together;
+if such chance is zero, 
+we may consider such groups as different populations, 
+each described by its own genotypic and allelic 
+frequencies and their dynamic.
+Based on these considerations, a genetic 
+population may be defined a 
+in the following way: 
+
+\emph{
+Two individuals, $I_1$ and $I_2$, belong to the same 
+population if (a) the probability that they would 
+have an offspring in common is greater then zero and 
+(b) this probability is much higher than the probability 
+of $I_1$ and $I_2$ having an offspring in common with 
+some individual $I_3$, which is said to belong to other 
+genetic population.}
+\index{genetic population!prospective definition}
+
+Here, to have an offspring in common
+does not imply having a direct offspring, but rather a 
+common descendant in a number of generations. 
+
+However, in gene discovery in general and GWA studies 
+in particular we are usually not interested 
+in future dinamics of alleles and genotypes distributions. 
+What is the matter of concern in genetic association 
+studies is potential common 
+ancestry -- that is that individuals 
+may share common ancestors and thus share in common 
+the alleles, which are exact copies of the same ancestral 
+allele. Such alleles are called ''identical-by-descent'', 
+or IBD for short.\index{identity by descent}\index{IBD}
+If the chance of IBD is high, this reflects high degree 
+of genetic relationship. 
+As a rule, relatives 
+share many features, both environmental and genetic, 
+which may lead to confounding. 
+
+Genetic relationship between a pair of individuals 
+is quantified using the ''coefficient of kinship'', 
+which measures that chance that gametes, sampled 
+at random from these individuals, are IBD.\index{coefficient!of
+kinship}\index{kinship!coefficient}\label{def:kinship} 
+
+Thus for the purposes of gene-discovery 
+we can define genetic population 
+use retrospective terms and based on the 
+concept of IBD: 
+
+\emph{
+Two individuals, $I_1$ and $I_2$, belong to the same 
+genetic population if (a) their genetic relationship, measured 
+with the coefficient of kinship, 
+is greater then zero and (b) their kinship is much higher
+than kinship between them and some individual $I_3$, which is
+said to belong to other genetic population.}
+\index{genetic population!retrospective definition}
+\label{def:population}
+
+One can see that this definition is quantitative and 
+rather flexible (if not to say arbitrary): what we call 
+a ''population'' depends on the choice of the threshold 
+for the ''much-higher'' probability. Actually, what 
+you define as ''the same'' genetic population depends 
+in large part on the scope aims of your study. 
+In human genetics literature you may find references to 
+a particular genetically isolated population, population of some 
+country (e.g. ''German population'', ''population of United Kingdom''), 
+European, Caucasoid or even general human population. Defining a 
+population is about deciding on some probability threshold. 
+
+In genetic association studies, it is frequently assumed that 
+study participants are ''unrelated'' and ''come from the same 
+genetic population''.  Here, ''unrelated'' means, that while 
+study participants come from the same population (so, there is 
+non-zero kinship between them!), this kinship is so low that it 
+has very little effect on the statistical testing procedures 
+used to study association between genes and phenotypes. 
+
+In the following sections we will consider the effects of population 
+structure on the istribution of genotypes in a study population. 
+We will start with assumption of zero kinship between study 
+participants, which would allow us to formulate Hary-Weinberg principle
+(section \ref{subsec:HWE}). 
+In effect, there is no such thing as zero kinship between 
+any two organisms, however, when kinship is very low, the effects 
+of kinship on genotypic distribution are minimal, as we will see in 
+section \ref{subsec:inbreeding}. The effects of substructure -- 
+that is when study sample consist of several genetic populations -- 
+onto genotypic distribution will be considered in section \ref{subsec:wahlund}.
+Finally, we will generalize the obtained results for the 
+case of arbitrary structures and will see what are the effects 
+of kinship onto joint distribution of genotypes and phenotypes 
+in section \ref{subsec:phenocorr}. 
+
+\subsection{Hardy-Weinberg equilibrium}
+\label{subsec:HWE}
+To describe genetic structure of populations 
+we will use rather simplistic model
+approximating genetic processes in natural populations. Firstly, we will 
+assume that the population under consideration has infinitely 
+large size, which implies that we can work in terms of probabilities, 
+and no random process take place. 
+Secondly, we accept non-overlapping 
+
+$$\textrm{generation}  \Rightarrow \textrm{gametic pool} \Rightarrow \textrm{generation}$$
+\index{generation -- gametic pool -- generation model}
+\label{ggpg_model}
+
+\noindent model. This model assumes that a set of individuals 
+contributes gametes to genetic pool, and dies out. The gametes 
+are sampled randomly from this pool in pairs to form individuals 
+of the second generation. The selection acts on individuals, while 
+mutation occurs when the gametic pool is formed. The key point 
+of this model is the abstract of gametic pool: if you use that, 
+you do not need to consider all pair-wise mating between male and 
+female individuals; you rather consider some abstract infinitely 
+large pool, where gametes are contributed to with the frequency 
+proportional to that in previous generation. Interestingly, this 
+rather artificial construct has a great potential to describe 
+the phenomena we indeed observe in nature. 
+
+In this section, we will derive Hardy-Weinberg low (this analog 
+of the Mendel's low for populations). The question to be 
+answered is, if some alleles at some locus segregate  
+according to Mendel's lows and aggregate totally at random, what 
+would be genotypic distribution in a population? 
+
+Let us consider two alleles, wild type normal allele ($N$) and 
+a mutant ($D$), segregating at some locus in the population 
+and apply the ''generation $\Rightarrow$ gametic pool $\Rightarrow$ 
+generation'' model. 
+Let us denote the ferquency of the $D$ allele in the gametic 
+pool as $q$, and the frequency of the other allele, $N$, as
+$p=1-q$.  
+Gametes containing alleles $N$ and $D$ are sampled at random to 
+form diploid individuals of the next generation. 
+The probability to sample a ''$N$'' gamete is $p$, and the 
+probability that the second sampled gamete is also ''$N$'' is 
+also $p$. According to the rule, which states that joint probability 
+of two independent events is a product of their probabilities, 
+the probability to sample ''$N$'' and ''$N$'' is 
+$p \cdot p = p^2$. In the same manner, the probability to 
+sample ''$D$'' and then ''$D$'' is $q \cdot q = q^2$. The 
+probability to sample first the mutant and then normal allele 
+is $q \cdot p$, the same is the probability to 
+sample ''$D$'' first and ''$N$'' second. In most situations, we 
+do not (and can not) distinguish heterozygous genotypes $DN$ 
+and $ND$ and refer to both of them as ''$ND$''. In this 
+notation, frequency of $ND$ will be 
+$q \cdot p + p \cdot q = 2 \cdot p \cdot q $. 
+Thus, we have computed the genotypic distribution for a population 
+formed from a gametic pool in which the frequency of $D$ allele 
+was $q$. 
+
+To obtain the next generation, the next gametic pool is generated. 
+The frequency of $D$ in the nect gametic pool is 
+$q^2 + \frac{1}{2}\cdot 2 \cdot p \cdot q$. 
+Here, $q^2$ is the probability that a gamete-contributing 
+individual has genotype $DD$; $2\cdot p \cdot q$ is the probability that 
+a gamete-contributing individual is $ND$, and $\frac{1}{2}$ is 
+the probability that $ND$ individual contributes $D$ allele
+(only half of the gametes contributed by individuals with $ND$ 
+genotype are $D$); see Figure \ref{fig:allelic_freq}. 
+Thus the freqeuncy of $D$ in the gametic pool is 
+$q^2 + \frac{1}{2}\cdot 2 \cdot p \cdot q = q \cdot (q + p) = q$
+-- exactly the same as it was in previous gametic pool. 
+
+\begin{figure}
+\center
+\includegraphics[width=0.80\textwidth]{allelic_freq}
+\caption{
+Genotypic and allelic frequency distribution in a 
+population; $q=P(D)=P(DD)+\frac{1}{2}\cdot P(DN)$.
+}
+\label{fig:allelic_freq}
+\end{figure}
+
+Thus, if assumptions of random segregation 
+and aggregation hold, the expected frequency of $NN$, $ND$ 
+and $DD$ genotypes are stable over generations and 
+can be related to the allelic frequencies using the 
+follwoing relation   
+
+\begin{equation}
+\label{eq:HWE2}
+\begin{array}{lll}
+P(NN) &= (1-q) \cdot \ (1-q) &= p^2, \\
+P(ND) &= q\cdot (1-q) + (1-q) \cdot q & = 2 \cdot p \cdot q, \\
+P(DD) &= q \cdot q &= q^2
+\end{array}
+\end{equation}
+which is known as Hardy-Weinberg equlibrium (HWE) point.
+\label{Hardy-Weinberg equilibrium}
+
+
+There are many reasons, in which random segregation and 
+aggregation, and, consequently, Hardy-Weinberg equilibrium, 
+are violated. It is very important to
+realize that, especially if the study participants are believed 
+to come from the same genetic population, most of the times when 
+deviation from HWE is detected, this 
+deviation is due to technical reasons, i.e. genotyping 
+error. Therefore testing for HWE is a part of the 
+genotypic quality control procedure in most studies. 
+Only when the possibility of technical errors is 
+eliminated, other possible explanations may be 
+considered.
+In a case when deviation from HWE can not be explained 
+by technical reasons, the most frequent explanation would 
+be that the sample tested is composed of representatives 
+of different genetic populations, or more subtle 
+genetic structure. However, unless study participants 
+represent a mixture of very distinct genetic 
+populations -- the chances of which coming unnoticed 
+are low -- the efffects of genetic structure on HWE 
+are difficult to detect, at least for any single marker, 
+as you will see in the next sections. 
+\index{deviation from Hardy-Weinberg equilibrium}
+\index{Hardy-Weinberg equilibrium!deviation from}
+
+\subsection{Inbreeding}
+\label{subsec:inbreeding}
+
+Inbreeding is preferential breeding between (close) relatives.\index{inbreeding} 
+An extreme example of inbreeding is a selfing, a breeding system, 
+observed in some plants. The inbreeding is not uncommon in animal 
+and human populations. Here, the main reason 
+for inbreeding are usually geographical (e.g. mice live in 
+very small interbred colonies -- dems -- which are usually 
+established by few mice and are quite separated 
+from other dems) or cultural (e.g. noble families
+of Europe). 
+
+Clearly, such preferential breeding between relatives 
+violates the assumption of random aggregation, underling 
+Hardy-Weinberg principle. Relatives are likely to share the 
+same alleles, inherited from common ancestors. Therefore 
+their progeny has an increased chance of being 
+\emph{autozygous}\index{autozygosity} -- that is to 
+inherit a copy of exactly the same ancestral allele 
+from both parents. An autozygous genotype is always 
+homozygous, therefore inbreeding should increase the 
+frequency of homozygous, and decrease the frequency of 
+heterozygous, genotypes.
+
+Inbreeding is quantified by the \emph{coefficient of 
+inbreeding},\index{coefficient!of inbreeding}\index{inbreeding!coefficient of}
+which is defined as the probability of autozygosity. 
+This coefficient may characterize an individual, or 
+a population in general, in which case this is expectation 
+that a random individual from the population is 
+autozygous at a random locus. The coefficient of 
+inbreeding is closely related to the coefficient of 
+kinship, defined earlier for a pair of individuals as
+the probability that two alleles sampled 
+at random from these individuals, are IBD. It is easy to see
+that the coefficient of inbreeding for a person is 
+the same as the kinship between its parents.
+\index{coefficien!of inbreeding, relation to kinship}
+\index{coefficien!of kinship, relation to inbreeding}
+
+\begin{figure}
+\center
+\includegraphics[width=1.00\textwidth]{inbred_family}
+\caption{Inbred family structure (A) and probability of 
+individual ''G'' being autozygous for the ''Red'' ancestral 
+allele
+}
+\label{fig:inbred_family}
+\end{figure}
+
+Let us compute the inbreeding coefficient for the person {\bf J}
+depicted at figure \ref{fig:inbred_family}. {\bf J} is a child 
+of {\bf G} and {\bf H}, who are cousins. {\bf J} could be autozygous 
+at for example ''red'' allele of founder grand-grand-parent {\bf A}, 
+which could have been transmitted through the meioses 
+{\bf A $\Rightarrow$ D}, {\bf D $\Rightarrow$ G}, and
+{\bf G $\Rightarrow$ J}, and also through the path 
+{\bf A $\Rightarrow$ E}, {\bf E $\Rightarrow$ H}, and
+{\bf H $\Rightarrow$ J} (Figure \ref{fig:inbred_family} {\bf B}).
+What is the chance for {\bf J} to be autozygous for the 
+''red'' allele? The probability that this particular founder 
+allele is transmitted to {\bf D} is $1/2$, the same is the probability 
+that the allele is transmitted from {\bf D} to {\bf G}, and 
+the probability that the allele is transmitted from 
+{\bf G} to {\bf J}. Thus the probability that the ''red'' allele 
+is transmitted from {\bf A} to {\bf J} is $1/2 \cdot 1/2 \cdot 1/2 = 1/2^3 = 1/8$.
+The same is the chance that that allele is transmitted from 
+{\bf A} to {\bf E} to {\bf H} to {\bf J}, therefore the probability 
+that {\bf J} would be autozygous for the red allele is 
+$1/2^3 \cdot 1/2^3 = 1/2^6 = 1/64$. However, we are interested in 
+autozygosity for any founder allele; and there are four such 
+alleles (''red'', ''green'', ''yellow'' and ''blue'', figure 
+\ref{fig:inbred_family} {\bf B}). For any of these the probability 
+of autozygosity is the same, thus the total probability of 
+autozygosity for {\bf J} is $4\cdot 1/64 = 1/2^4 = 1/16$.  
+
+Now we shall estimate the expected genotypic probability 
+distribution for a person characterized with some 
+arbitrary coefficient of inbreeding, $F$ -- or for a population 
+in which average inbreeding is $F$. Consider a locus with two 
+alleles, $A$ and $B$, with frequency of $B$ denoted as $q$, and 
+frequency of $A$ as $p=1-q$. If the person is autozygous 
+for some founder allele, the founder allele may be either 
+$A$, leading to autozygous genotype $AA$, or the founder 
+allele may be $B$, leading to genotype $BB$. The chance that 
+the founder allele is $A$ is $p$, and the chance that the 
+founder allele is $B$ is $q$. If the person 
+is not autozygous, then the expected genotypic frequencies 
+follow HWE. Thus, the probability of genotype 
+$AA$ is $(1-F)\cdot p^2 + F\cdot p$, where the first term corresponds 
+to probability that the person is $AA$ given it is not inbred ($p^2$), 
+multiplied by the probability that it is not inbred ($1-F$), and 
+the second term corresponds to probability that a person is 
+$AA$ given it is inbred ($p$), multiplied by the probability that the 
+person is inbred ($F$). This computations can be easily done for all 
+genotypic classes leading to the expression for HWE under inbreeding.
+
+\begin{equation}
+\label{eq:HWE_inbreeding}
+\begin{array}{lll}
+P(AA) &=(1-F)\cdot p^2 + F \cdot p  &=p^2+p\cdot q\cdot F \\
+P(AB) &=(1-F)\cdot 2\cdot p\cdot q + F \cdot 0  &=2\cdot p\cdot q\cdot (1-F) \\
+P(BB) &=(1-f)\cdot q^2 + F\cdot q  &=q^2+p\cdot q\cdot F \\
+\end{array}
+\end{equation}
+\index{Hardy-Weinberg equilibrium!under inbreeding}
+
+How much is inbreeding expected to modify genotypic distribution 
+in human populations? The levels of inbreeding observed in 
+human genetically isolated populations typically 
+vary between $0.001$ (low inbreeding) to $0.05$ (relatively high), 
+see \cite{rudan2003,pardo2005}. The genotypic distribution 
+for $q=0.5$ and different values of the inbreeding coefficient is 
+shown at the figure \ref{fig:HWE_under_inbreeding}.
+
+\begin{figure}
+\center
+\includegraphics[width=1.00\textwidth]{HWE_under_inbreeding}
+\caption{
+Genotypic probability distribution for a locus with 50\% frequency of 
+the $B$ allele; black bar, no inbreeding; red, $F=0.001$; green, $F=0.01$; 
+blue, $F=0.05$
+}
+\label{fig:HWE_under_inbreeding}
+\end{figure}
+
+What is the power to detect deviation from HWE due to inbreeding?
+For that, we need to estimate the expectation of 
+the $\chi^2$ statistics (the non-centrality parameter, NCP) used 
+to test for HWE. The test for HWE is performed using standard formula
+
+\begin{equation}
+\label{eq:chi2}
+T^2 = \sum_i \frac{(O_i-E_i)^2}{E_i}
+\end{equation}
+\index{Hardy-Weinberg equilibrium!$\chi^2$ test}
+\index{test!for Hardy-Weinberg equilibrium}
+where summation is performed over all classes (genotypes); $O_i$ is 
+the count observed in $i$-th class, and $E_i$ is the count expected 
+under the null hypothesis (HWE). Under the null hypothesis, this 
+test statistic is distributed as $\chi^2$ with number of degrees of 
+freedom equal to the number of genotypes minus the number of alleles.
+
+Thus the expectation of this test statistic for some $q$, $F$, and $N$
+(sample size) is 
+
+\begin{equation}
+\label{eq:exp_chi2_HWE_F}
+\begin{array}{ll}
+E[T^2] &= \frac{(N(q^2+p q F)-N q^2)^2}{N  q^2}
++ \frac{(N2pq(1-F)-N2pq)^2}{N2pq}
++ \frac{(N(p^2+p q F)-N p^2)^2}{N  p^2} \\
+ &= \frac{(NpqF)^2}{N  q^2} + \frac{(-2NpqF)^2}{N2pq} + \frac{(NpqF)^2}{N  p^2} \\
+ & = Np^2F^2+2NpqF^2+Nq^2F^2 \\
+ & = NF^2(p^2+2pq+q^2) \\
+ & = N\cdot F^2
+\end{array}
+\end{equation}
+
+Interestingly, the non-centrality parameter does not depend on the 
+allelic frequency. Given the non-centrality parameter, it is easy 
+to compute the power to detect deviation from HWE for any given $F$. 
+For example, to achieve the power of $>0.8$ at $\alpha=0.05$, for a test 
+with one degree of freedom the non-centrality parameter should 
+be $>7.85$. Thus, if $F=0.05$, to have 80\% power, 
+$N\cdot F^2 > 7.85$, that is the required sample size should be 
+$N > \frac{7.85}{F^2} = \frac{7.85}{0.0025} = 3140$ people.
+
+Thus, even in populations with strong inbreeding, rather 
+large sample sizes are required to detect the effects 
+of inbreeding on HWE at a particular locus, even at relatively 
+weak significance level of 5\%.
+
+While the chance that deviation from HWE due to inbreeding 
+will be statistically significant is relatively small, 
+inbreeding may have clear effects on the results of HWE 
+testing in GWA study. Basically, if testing is performed 
+at a threshold corresponding to nominal significance $\alpha$, 
+a proportion of markers which show significant deviation 
+will be larger than $\alpha$. Clearly, how large this proportion 
+will be depends on the inbreeding and on size of the study  -- 
+expectation of $T^2$ is a function of both $N$ an $F$.
+A proportion of markers showing significant deviation form 
+HWE at different values of inbreeding, sample size, and 
+nominal significance threshold, is shown in table 
+\ref{tab:t1e_hwe_underF}. 
+While deviation of this proportion from nominal one is 
+minimal at large $\alpha$'s and small sample sizes 
+and coefficients of inbreeding, it may be 10-fold and 
+even 100-fold higher than the nominal level at reasonable 
+values of $N$ and $F$ for smaller thresholds. 
+
+\begin{table} %\renewcommand{\arraystretch}{2}\addtolength{\tabcolsep}{-1pt}
+\centering
+\caption{Expected proportion of markers deviating from HWE in a sample 
+of $N$ people coming from a population with average
+inbreeding $F$. Proportion of markers is shown for 
+particular test statistic threshold, corresponding to 
+nominal significance $\alpha$.}
+\label{tab:t1e_hwe_underF}
+{\setlength{\tabcolsep}{3mm}
+\begin{tabular}{lcccc}
+\hline
+	&	&	& $\alpha$	&\\
+\cline{3-5}
+$N$	& $F$	& 0.05	& $10^{-4}$	& $5\cdot 10^{-8}$ \\
+\hline
+	& 0.001 & 0.0501 & $1.008\cdot 10^{-4}$ & $5.077\cdot 10^{-8}$ \\
+1,000	& 0.005 & 0.0529 & $1.205\cdot 10^{-4}$ & $7.025\cdot 10^{-8}$ \\
+	& 0.010 & 0.0615 & $1.885\cdot 10^{-4}$ & $14.503\cdot 10^{-8}$ \\
+\hline
+	& 0.001 & 0.0511 & $1.081\cdot 10^{-4}$ & $5.784\cdot 10^{-8}$ \\
+10,000	& 0.005 & 0.0790 & $3.544\cdot 10^{-4}$ & $36.991\cdot 10^{-8}$ \\
+	& 0.010 & 0.1701 & $19.231\cdot 10^{-4}$ & $426.745\cdot 10^{-8}$ \\
+\hline
+\end{tabular}
+}
+\end{table}
+
+
+\subsection{Mixture of genetic populations: Wahlund's effect}
+\label{subsec:wahlund}
+
+Consider the following artificial example. Imagine that 
+recruitment of study participants occurs at a hospital, 
+which serves two equally size villagec($V_1$ and ($V_2$); 
+however, the villages are very distinct because of cultural 
+reasons, and most marriages occur within a village. Thus 
+these two villages represent two genetically distinct 
+populations. Let us consider a locus with two alleles, 
+$A$ and $B$. The frequency of $A$ is $0.9$ in $V_1$ and 
+it is $0.2$ in $V_2$. In each population, marriages 
+occur at random, and HWE holds for the locus. What 
+genotypic distribution is expected in a sample 
+ascertained in the hospital, which represents a $1:1$ 
+mixture of the two populations?
+
+The expected gentypic proportions are presented in 
+table \ref{tab:mixpop}. First, assuming that HWE holds 
+for each of the populations, we can compute genotypic 
+proportions within these (rows 1 and 2 of table 
+\ref{tab:mixpop}). If our sample represents a 
+$1:1$ mixture of these populations, then the frequency 
+of some genotype is also $1:1$ mixture of the respective 
+frequencies. For example, frequency of $AA$ genotype 
+would be $\frac{0.81}{2} + \frac{0.04}{2} = 0.425$, 
+and so on. The frequency of the $A$ allele in pooled 
+sample will be $0.425 + \frac{0.25}{2} = 0.55$. Based 
+on this frequency we would expect genotypic frequency 
+distribution of $0.3$, $0.5$ and $0.2$, for $AA$, $AB$, and 
+$BB$, respectively. As you can see the observed distribution 
+has much higher frequencies of homozygous genotypes -- excess 
+of homozygotes.
+
+It is notable, that the differences between the observed 
+homozygotes frequencies and these expected under HWE 
+are both 0.125, and, consequently, the observed heterozygosity 
+is less than that expected by $0.125\cdot 2 = 0.25$. 
+
+The phenomenon of deviation from HWE due to the fact that 
+considered population consist of two sub-populations, 
+is known as 
+\href{http://en.wikipedia.org/wiki/Wahlund_effect}{''Wahlund's effect''}\index{Wahlund's effect}, 
+after the scientist who has first considered and quantified  
+genotypic distribution under such model\cite{wahlund1928}.
+
+\begin{table}
+\centering
+\caption{Genotypic proportions in a mixed population}
+\label{tab:mixpop}
+\begin{tabular}{lccccc}
+\hline
+Village	& \%Sample	& $p(A)$	& $P(AA)$	& $P(AB)$	& $P(BB)$	\\	
+\hline
+$V_1$	& 50		& 0.9		& 0.81		& 0.18		& 0.01		\\	
+$V_2$	& 50		& 0.2		& 0.04		& 0.32		& 0.64		\\	
+	& 		& 		& 		& Observed	& 		\\	
+\hline
+Pooled	& 100		& 0.55		& 0.425		& 0.25		& 0.325		\\	
+	& 		& 		& 		& Expected	& 		\\	
+\cline{4-6}
+	& 		& 		& 0.30		& 0.50		& 0.20		\\	
+	& 		& 		& 		& Difference	& 		\\	
+\cline{4-6}
+	& 		& 		& 0.125		& $-0.250$		& 0.125		\\	
+\hline
+\end{tabular}
+\end{table}
+
+Such marked differences between observed and expected under HWE are 
+very easily detected; for the above example, a sample of $\approx 35$ people 
+is enough to reject the hypothesis of HWE (power $>80\%$ at 
+$\alpha = 0.05$). 
+
+However, the differences we can see in real life are not so marked. 
+For example, the common Pro allele at position 12 of the peroxisome 
+proliferator-activated receptor gamma is associated with increased 
+risk for type 2 diabetes. The frequency of the Pro allele is about 
+85\% in European populations and Caucasian-Americans, about 97\% in 
+Japan and 99\% in African-American (see table $1$ from \cite{Ruiz2005}). 
+Table \ref{tab:mixpop_pparg} shows hypothetical observed and expected 
+genotypic proportions in a sample composed of 50\% Caucasians and 
+50\% African-American.  
+
+\begin{table}
+\centering
+\caption{Genotypic proportions of $PPAR\GA{}mma$ $Pro12Ala$ genotype in a mixed population}
+\label{tab:mixpop_pparg}
+\begin{tabular}{lccccc}
+\hline
+Ethnics	& \%Sample	& $p(Pro)$	& $P(Pro/Pro)$	& $P(Pro/Ala)$	& $P(Ala/Ala)$	\\	
+\hline
+Caucasian	& 50		& 0.85		& 0.7225	& 0.2550	& 0.0225	\\	
+Afro-American	& 50		& 0.99		& 0.9801	& 0.0198	& 0.0.001	\\	
+		& 		& 		& 		& Observed	& 		\\	
+\hline
+Pooled		& 100		& 0.92		& 0.8513	& 0.1374	& 0.0113	\\	
+		& 		& 		& 		& Expected	& 		\\	
+\cline{4-6}
+		& 		& 		& 0.8464	& 0.1472	& 0.0064	\\	
+		& 		& 		& 		& Difference	& 		\\	
+\cline{4-6}
+		& 		& 		& $0.0049$	& $-0.0098$	& $0.0049$	\\	
+\hline
+\end{tabular}
+\end{table}
+
+You can see that observed distribution and the one expected 
+under HWE are very similar; only a sample as large as 1,800 
+people would allow detection of the deviation from HWE (power 
+$>80\%$ at $\alpha = 0.05$). The situation is similar for 
+most genes observed in real life -- while the frequencies 
+may be (or may be not) very different for populations, which 
+diverged long time ago, for relatively close populations 
+expected frequency differences are small and large sample 
+sizes are required to detect deviation from HWE due to 
+Wahlund's effect at a particular fixed locus. 
+
+Let us summarize, what genotypic proportions are expected 
+in a sample, which is a mixture of two populations. Let 
+each population is in HWE, and the frequency of the $B$ 
+allele is $q_1$ in population one and $q_2$ in population
+two. Let the proportion of individuals coming from 
+population one is $m$ in the mixed population, and consequently 
+the proportion of individuals from population two is $(1-m)$.
+The allelic frequencies, and genotypic distributions 
+in the original and mixed populations are presented in 
+tale \ref{tab:mixpop_general}.
+ 
+\begin{table}
+\centering
+\caption{Expected genotypic proportions in a mixed population; $F_{st}$ is 
+defined by equation \ref{eq:wahlundsD}}
+\label{tab:mixpop_general}
+\begin{tabular}{lccccc}
+\hline
+Population & Prop.	& $p(B)$	& $P(AA)$	& $P(AB)$	& $P(BB)$	\\	
+\hline
+$P_1$	& $m$		& $q_1$		& $p_1^2$	& $2 p_1  q_1$ & $q_1^2$		\\	
+$P_2$	& $(1-m)$	& $q_2$		& $p_2^2$	& $2 p_2  q_2$ & $q_2^2$		\\	
+	& 		& 		& 		& Observed	& 		\\	
+\hline
+Pooled	& 1.0		& $\overline{q}=m q_1 $	& $m p_1^2 $ & $2 m p_1  q_1$ & $m q_1^2 $ \\	
+& 		& $+ (1-m) q_2$;	& $+ (1-m) p_2^2$; & $+2 (1-m) p_2 q_2$; & $ + (1-m) q_2^2$ \\	
+	& 		& 		& 		& Expected	& 		\\	
+\cline{4-6}
+	& 		& 		& $\overline{p}^2$ & $2 \overline{p} \overline{q}$ & $\overline{q}^2$\\	
+	& 		& 		& 		& Difference	& 		\\	
+\cline{4-6}
+	& 		& 		& $\overline{p} \overline{q} F_{st}$		& $-2 \overline{p} \overline{q} F_{st}$		& $\overline{p} \overline{q} F_{st}$		\\	
+\hline
+\end{tabular}
+\end{table}
+
+The frequency of the $B$ allele in the mixed population is just the weighted 
+average of the allelic frequencies in the two populations, 
+$\overline{q}=m\cdot q_1+(1-m)\cdot q_2$. Let us denote the frequency 
+of the $A$ allele as $\overline{p}=1-\overline{q}$.
+It can be demonstrated that the genotypic frequency distribution in the 
+mixed sample is the function of the frequency of allele $B$ in the sample, 
+$\overline{q}$, and ''disequilibrium'' parameter $D$:
+ 
+\begin{equation}
+\label{eq:HWE_wahlund}
+\begin{array}{ll}
+P(AA) &= \overline{p}^2+\overline{p}\cdot \overline{q}\cdot F_{st} \\
+P(AB) &= 2\cdot \overline{p}\cdot \overline{q}\cdot (1-F_{st}) \\
+P(BB) &= \overline{q}^2+\overline{p}\cdot \overline{q}\cdot F_{st} \\
+\end{array}
+\end{equation}
+\index{Hardy-Weinberg equilibrium!under Wahlund's effect}
+where 
+\begin{equation}
+\label{eq:wahlundsD}
+F_{st} = \frac{m\cdot (1-m) \cdot (q_1-q_2)^2}{\overline{p}\cdot \overline{q}}
+\end{equation}
+
+You can see that equation \ref{eq:HWE_wahlund}, expressing the genotypic 
+frequencies distribution under Wahlund's effect, is remarkably similar 
+(actually, is specifically re-written in a form similar) to the 
+equation \ref{eq:HWE_inbreeding}, expressing the genotypic proportions 
+under the effects of inbreeding. Again, the reason is that $F_{st}$ 
+(as well as $F$ of equation \ref{eq:HWE_inbreeding}) is easily estimated 
+from the data as the ratio between the observed and expected variances 
+of the genotypic distributions. Then the expected non-centrality 
+parameter for the test of HWE is simply $N\cdot F_{st}^2$, where 
+$N$ is the sample size. Therefore our results concerning the 
+proportion of tests expected to pass a particular significance threshold 
+when genome-wide data are analyzed (table \ref{tab:t1e_hwe_underF}) hold, 
+with replacement of $F$ with $F_{st}$.  
+
+We can compute that the values of $F_{st}$, corresponding to the 
+population mixtures presented in tables \ref{tab:mixpop} and 
+\ref{tab:mixpop_pparg} are 0.49 and 0.067, respectively, which 
+gives us a shortcut to estimate the sample size required to detect 
+deviation from HWE due to Wahlund's effect (at $\alpha=0.05$ and 
+power 80\%): $N > 7.85/0.49^2 \approx 32$ and $N > 7.85/0.067^2 \approx 1771$.
+
+A typical value of $F_{st}$ for European populations is about 0.002 
+(up to 0.023\cite{nelis2009}); very large sample sizes are required to 
+detect deviation from HWE at any given locus at such small $F_{st}$'s. 
+However, the effects onto the proportion of markers failing to 
+pass HWE test in GWA may be visibly inflated (table \ref{tab:t1e_hwe_underF}). 
+
+
+
+
+
+
+
+
+
+
[TRUNCATED]

To get the complete diff run:
    svnlook diff /svnroot/genabel -r 1250


More information about the Genabel-commits mailing list