[adegenet-forum] Genotypes assignment to clusters
Jombart, Thibaut
t.jombart at imperial.ac.uk
Mon Mar 21 16:34:56 CET 2011
Dear Sébastien,
thanks for this very interesting question. To rephrase it: "how to assign individuals from one space to groups defined in another, partially overlapping space"?
The problem is not trivial if we think of it in probabilistic terms. If you used Bayesian/likelihood-based clustering, clusters would be defined in terms of frequencies of a given set of alleles (say, "S"). You can compute the probability for an individual to come from cluster xxx (or a mixture of clusters xxx, yyy, zzz etc in admixture models) as long as this individual does not possess any original allele (i.e., not in 'S'). Would it not be the case, the probability of observing a new allele in the previously defined clusters is, by definition, zero and thus P=0 for all clusters. Annoying.
Distance-based method have a similar problem: if the spaces differ, it is much more difficult to compare one individual to another.
However, we can use the fact that one space is contained within another, namely, the alleles differentiating pop1 /vs/ pop2 are a subset of the alleles of the complete dataset. One approach is to use an analysis that we could run on the entire dataset, but that would exclude all originality of 'popx', and only conserve differences between 'pop1' and 'pop2'. This can be achieved by regressing the data onto a factor opposing 'popx' to 'non-pop-x' individuals.
####
X <- truenames(trial)$tab # extract table of allele frequencies
popx <- factor(pop(trial)=="popx") # popx vs non-popx
X.res <- apply(X,2, function(e) residuals(lm(e~popx))) # remove 'popx' effect
dapc1 <- dapc(X.res, pop(trial), n.pca=3, n.da=1) # perform dapc
scatter(dapc1)
assignplot(dapc1)
####
The DAPC aims to discriminate all populations of the dataset, but we actually tricked the method by removing all originality specific to "popx" beforehand. With the toy dataset you sent, "popx" is actually still at one extreme of the cline, but I suspect that actually hybrid populations should fall between the two parental populations.
Best regards
Thibaut.
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - Faculty of Medicine
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] On Behalf Of Sébastien Puechmaille [s.puechmaille at gmail.com]
Sent: 21 March 2011 13:15
To: adegenet-forum at r-forge.wu-wien.ac.at
Subject: [adegenet-forum] Genotypes assignment to clusters
Dear Thibaut and Adegenet users,
I have a data set with 3 groups of samples (see below), 2 with samples of known origin (pop1 and pop2) and one (popx) with samples that I would like to assign to one of the 2 known populations (pop1 or pop2). For this, I want to run a DAPC with 'pop1' and 'pop2' data set and then, assign individuals from 'popx' to either 'pop1' or 'pop2'.
However, individuals from the group to be assigned have some private alleles that are neither in 'pop1' nor in 'pop2' and therefore, the assignment cannot work. What would be the best solution to get around this problem?
Shall I create dummies individuals in 'pop1' and 'pop2' having the private alleles of 'popx'?
Hereafter is a reduced data set to illustrate the problem:
indiv pop L1 L2 L3
Indiv1 pop1 222224 232224 120122
Indiv2 pop1 222226 232226 118120
Indiv3 pop1 222222 232232 120120
Indiv4 pop1 222224 232224 124124
Indiv5 pop2 224224 224224 122122
Indiv6 pop2 224224 224224 124124
Indiv7 pop2 224226 224226 120120
Indiv8 pop2 222224 232224 122124
Indiv9 popx 220222 220232 116118
Indiv10 popx 222224 232224 118120
Indiv11 popx 222226 232226 120120
Indiv12 popx 224224 224224 124124
geno<-read.table("three-pop.txt",h=T)
trial<-df2genind(geno[,3:5],missing=NA,ploidy=2,sep=NULL,ncode=6,ind.names=geno[,1], loc.names=colnames(geno[1,3:5]),pop=geno[,2])
trial at pop.names
split<- seppop(trial)
pop12 <- repool(split$pop1, split$pop2)
pop12 @all.names
split$popx at all.names
In this case, 'pop12' has 10 columns of '@tab' while 'split$popx' has 13 columns of '@tab'.
Would anyone have a solution or any advice?
Thanks for your help,
Sébastien.
More information about the adegenet-forum
mailing list