[adegenet-forum] Genotypes assignment to clusters

Mon Mar 21 20:12:13 CET 2011

Dear Thibaut,

Thanks very much for this solution. Indeed, the question was "how to assign
individuals from one space to groups defined in another, partially
overlapping space"?

I've run this analysis with the real dataset (not the toy dataset presented
in the previous e-mail) and compared the results with and without
regression. What I should probably mention here is that the group 'popx' is
a mix of individuals from;
- about  80% of individuals belonging to  'pop1'
- about  15% of individuals belonging to  'pop2'
- about  5% of hybrid individuals between 'pop1' and 'pop2'.

For the data set without regression, I performed a 'normal' DAPC with 3
predefined groups 'pop1, 'pop2' and 'popx'.
-'pop1' individuals are clearly differentiated from 'pop2' and 'popx'
-'pop2' and 'popx' individuals are nearly indiscernable along discriminant
function 1 (except for a few individuals from 'popx' that in fact belong to
'pop1')
- the mean comparison of inferred groups with actual groups is 0.62

For the data set with regression, I performed a 'normal' DAPC on the
residuals of the regression (as detailed in the previous e-mail)
-'pop1' individuals are again clearly differentiated from 'pop2' and 'popx'
-'pop2' and 'popx' individuals are much more differentiated along
discriminant function 1 when compared to the normal DAPC detailed above.
- the mean comparison of inferred groups with actual groups is 0.81 (more
than the normal DAPC)
This accentuated differentiation of 'pop2' and 'popx' individuals seems
rather unexpected as most individuals from 'popx' are in fact from 'pop2'
(see details above). Also, after the DAPC, each individual has a probability
to belong to the 3 groups ('pop2', 'pop1' and 'popx') rather than 2 groups
('pop2' and 'pop1') as aimed.

Give that original alleles of the 'popx' population, present in the 'popx'
population but absent from either source populations (pop1 or pop2), wont
give us any information about the origin of the 'popx' individuals (e.i.
whether they come from 'pop1' or 'pop2), when performing the 'normal' DAPC
(without regression), could we 'simply' consider these original alleles of
'popx' as missing data? Would there be an easy way to do that?

Thanks again for your help,

Sébastien.

On 21 March 2011 15:34, Jombart, Thibaut <t.jombart at imperial.ac.uk> wrote:

> Dear Sébastien,
>
> thanks for this very interesting question. To rephrase it: "how to assign
> individuals from one space to groups defined in another, partially
> overlapping space"?
>
> The problem is not trivial if we think of it in probabilistic terms. If you
> used Bayesian/likelihood-based clustering, clusters would be defined in
> terms of frequencies of a given set of alleles (say, "S"). You can compute
> the probability for an individual to come from cluster xxx (or a mixture of
> clusters xxx, yyy, zzz etc in admixture models) as long as this individual
> does not possess any original allele (i.e., not in 'S'). Would it not be the
> case, the probability of observing a new allele in the previously defined
> clusters is, by definition, zero and thus P=0 for all clusters. Annoying.
>
> Distance-based method have a similar problem: if the spaces differ, it is
> much more difficult to compare one individual to another.
>
> However, we can use the fact that one space is contained within another,
> namely, the alleles differentiating pop1 /vs/ pop2 are a subset of the
> alleles of the complete dataset. One approach is to use an analysis that we
> could run on the entire dataset, but that would exclude all originality of
> 'popx', and only conserve differences between 'pop1' and 'pop2'. This can be
> achieved by regressing the data onto a factor opposing 'popx' to 'non-pop-x'
> individuals.
>
> ####
> X <- truenames(trial)$tab # extract table of allele frequencies
> popx <- factor(pop(trial)=="popx") # popx vs non-popx
> X.res <- apply(X,2, function(e) residuals(lm(e~popx))) # remove 'popx'
> effect
>
> dapc1 <- dapc(X.res, pop(trial), n.pca=3, n.da=1) # perform dapc
> scatter(dapc1)
> assignplot(dapc1)
> ####
>
> The DAPC aims to discriminate all populations of the dataset, but we
> actually tricked the method by removing all originality specific to "popx"
> beforehand. With the toy dataset you sent, "popx" is actually still at one
> extreme of the cline, but I suspect that actually hybrid populations should
> fall between the two parental populations.
>
>
> Best regards
>
> Thibaut.
>
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - Faculty of Medicine
> St Mary’s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
>
> ________________________________________
> From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [
> adegenet-forum-bounces at r-forge.wu-wien.ac.at] On Behalf Of Sébastien
> Puechmaille [s.puechmaille at gmail.com]
> Sent: 21 March 2011 13:15
> To: adegenet-forum at r-forge.wu-wien.ac.at
> Subject: [adegenet-forum] Genotypes assignment to clusters
>
> Dear Thibaut and Adegenet users,
>
> I have a data set with 3 groups of samples (see below), 2 with samples of
> known origin (pop1 and pop2) and one (popx) with samples that I would like
> to assign to one of the 2 known populations (pop1 or pop2). For this, I want
> to run a DAPC with 'pop1' and 'pop2' data set and then, assign individuals
> from 'popx' to either 'pop1' or 'pop2'.
>
> However, individuals from the group to be assigned have some private
> alleles that are neither in 'pop1' nor in 'pop2' and therefore, the
> assignment cannot work. What would be the best solution to get around this
> problem?
> Shall I create dummies individuals in 'pop1' and 'pop2' having the private
> alleles of 'popx'?
>
> Hereafter is a reduced data set to illustrate the problem:
> indiv    pop    L1    L2    L3
> Indiv1    pop1    222224    232224    120122
> Indiv2    pop1    222226    232226    118120
> Indiv3    pop1    222222    232232    120120
> Indiv4    pop1    222224    232224    124124
> Indiv5    pop2    224224    224224    122122
> Indiv6    pop2    224224    224224    124124
> Indiv7    pop2    224226    224226    120120
> Indiv8    pop2    222224    232224    122124
> Indiv9    popx    220222    220232    116118
> Indiv10    popx    222224    232224    118120
> Indiv11    popx    222226    232226    120120
> Indiv12    popx    224224    224224    124124
>
>
> geno<-read.table("three-pop.txt",h=T)
>
> trial<-df2genind(geno[,3:5],missing=NA,ploidy=2,sep=NULL,ncode=6,ind.names=geno[,1],
> loc.names=colnames(geno[1,3:5]),pop=geno[,2])
>
> trial at pop.names
> split<- seppop(trial)
>
> pop12 <- repool(split$pop1, split$pop2)
>
> pop12 @all.names
> split$popx at all.names
>
> In this case, 'pop12' has 10 columns of '@tab' while 'split$popx' has 13
> columns of '@tab'.
>
> Would anyone have a solution or any advice?
>
> Thanks for your help,
>
> Sébastien.
>
>
*********************
Dr. Sébastien Puechmaille
UCD School of Biological and Environmental Sciences
University College Dublin (Zoology)
UCD Science and Education Research Center (West)
Belfield
Dublin 4
Ireland

and

Max Planck Institute for Ornithology
Sensory Ecology Group
Eberhard-Gwinner-Straße
Haus Nr. 11
82319 Seewiesen
Germany

http://batlab.ucd.ie/~spuechmaille/
http://www.ucd.ie/research/people/biologyenvscience/drsebastienpuechmaille/home/
*********************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20110321/9728f6bb/attachment-0001.htm>