[adegenet-forum] Genotypes assignment to clusters

Sébastien Puechmaille s.puechmaille at gmail.com
Wed Mar 23 13:08:16 CET 2011


Dear Thibaut,

I just used the toy dataset to split a genind object and repool it (see
below) and I think the data is lost when using the 'repool' function.

> #Read the input and creates a 'genind' object
> geno<-read.table("three-pop.txt",h=T)
>
trial<-df2genind(geno[,3:5],missing=NA,ploidy=2,sep=NULL,ncode=6,ind.names=geno[,1],
loc.names=colnames(geno[1,3:5]),pop=geno[,2])
> #Total number of alleles for 'trial'
> length(trial at loc.fac)
[1] 13

> #Splits the genind object per population
> trial1=seppop(trial) #'drop=FALSE' is the default option
> #Total number of alleles for each population
> length(trial1$pop1 at loc.fac)
[1] 13
> length(trial1$pop2 at loc.fac)
[1] 13
> length(trial1$popx at loc.fac)
[1] 13
> #Total number of alleles for a 'repooled' gening object
> trial2=repool(trial1$pop1,trial1$pop2)
> length(trial2 at loc.fac)
[1] 10

Thanks very much for implementing the predict methods for the DAPC.

Best wishes,

Sebastien.


On 23 March 2011 10:55, Jombart, Thibaut <t.jombart at imperial.ac.uk> wrote:

> Dear Sébastien,
>
> actually I think repool doesn't need a drop option - alleles can't be lost
> by using repool, only by splitting the data. Here you may want to use
> directly seppop with the option 'drop=FALSE' - from the doc of seppop:
>
> drop: a logical stating whether alleles that are no longer present
>          in a subset of data should be discarded (TRUE) or kept anyway
>          (FALSE, default).
>
>
> I will implement a predict method for DAPC objects soon - it should be
> available from the devel version under two weeks.
>
> Best
>
> Thibaut
>
> ________________________________________
> From: Sébastien Puechmaille [s.puechmaille at gmail.com]
> Sent: 23 March 2011 10:47
> To: Jombart, Thibaut
> Cc: adegenet-forum at r-forge.wu-wien.ac.at
> Subject: Re: [adegenet-forum] Genotypes assignment to clusters
>
> Dear Thibaut,
>
> Sorry, I accidentally switched 'pop1' and 'pop2' in my description of
> 'popx'; it should read:
>  'popx' is a mix of individuals from;
>    - about  80% of individuals belonging to  'pop2'
>   - about  15% of individuals belonging to  'pop1'
>   - about  5% of hybrid individuals between 'pop1' and 'pop2'.
>
> For the reclassification of individuals from popx into pop1 and pop2,
> indeed, the second approach you propose seems the best. I was previously
> using the 'seppop' function to separate the genind object into 3 objects
> (corresponding to each population) and then using the 'repool' function to
> merge 2 of these objects. By doing this, alleles with no data are dropped
> when using the 'repool' function (the 'drop' option is not implemented in
> 'repool'). However, by subsetting directly 2 populations from the original
> genind object ('2pop=3pop[3pop at pop %in% c("P1","P2"),drop=FALSE]'), we do
> not need to use the 'repool' function so that alleles with no data are kept.
> Concerning the wrapper for 'predict.lda', I'm not too sure how best to code
> that.
>
> Thanks for your help,
>
> Sebastien.
>
> On 22 March 2011 17:36, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>> wrote:
> Hello,
>
> There is something that confuses me in your description. 80% of individuals
> in popx are from pop1, 15% from pop2, the rest are hybrids.So why is it
> unexpected that the distinction between pop2 and popx is made clearer on the
> 'partial DAPC' approach? On the contrary, you expect this analysis to
> distinguish pop1 from pop2, so if popx is mainly pop1, we expect differences
> between popx and pop2 to be emphasized.
>
> Concerning the probabilities of assignment of individuals to the three
> groups, this is because popx still contributes to the variability between
> groups - only there is no longer an effect of alleles that are specific to
> popx. If you want to reclassify individuals from popx into pop1 and pop2
> only, then a different and probably cleaner approach needs to be used.
> Alleles from popx that do not exist in pop1 and pop2 will not be missing
> data, but the analysis will need to be done without pruning these alleles
> (in the subset function "[" of genind object, there's an option 'drop' which
> needs to be set to FALSE). Then what you will need is a wrapper for
> 'predict.lda' for dapc objects. This does not exist yet, but it fairly
> straightforward to code. Contribution welcome if you want to give it a go,
> otherwise I will likely sort this out over the coming days, as soon as I've
> got time to devote to adegenet that is.
>
> All the best
>
> Thibaut
>
>
>
>
> ________________________________________
> From: Sébastien Puechmaille [s.puechmaille at gmail.com<mailto:
> s.puechmaille at gmail.com>]
> Sent: 21 March 2011 19:12
> To: Jombart, Thibaut
> Cc: adegenet-forum at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum at r-forge.wu-wien.ac.at>
> Subject: Re: [adegenet-forum] Genotypes assignment to clusters
>
> Dear Thibaut,
>
> Thanks very much for this solution. Indeed, the question was "how to assign
> individuals from one space to groups defined in another, partially
> overlapping space"?
>
> I've run this analysis with the real dataset (not the toy dataset presented
> in the previous e-mail) and compared the results with and without
> regression. What I should probably mention here is that the group 'popx' is
> a mix of individuals from;
> - about  80% of individuals belonging to  'pop1'
> - about  15% of individuals belonging to  'pop2'
> - about  5% of hybrid individuals between 'pop1' and 'pop2'.
>
> For the data set without regression, I performed a 'normal' DAPC with 3
> predefined groups 'pop1, 'pop2' and 'popx'.
> -'pop1' individuals are clearly differentiated from 'pop2' and 'popx'
> -'pop2' and 'popx' individuals are nearly indiscernable along discriminant
> function 1 (except for a few individuals from 'popx' that in fact belong to
> 'pop1')
> - the mean comparison of inferred groups with actual groups is 0.62
>
> For the data set with regression, I performed a 'normal' DAPC on the
> residuals of the regression (as detailed in the previous e-mail)
> -'pop1' individuals are again clearly differentiated from 'pop2' and 'popx'
> -'pop2' and 'popx' individuals are much more differentiated along
> discriminant function 1 when compared to the normal DAPC detailed above.
> - the mean comparison of inferred groups with actual groups is 0.81 (more
> than the normal DAPC)
> This accentuated differentiation of 'pop2' and 'popx' individuals seems
> rather unexpected as most individuals from 'popx' are in fact from 'pop2'
> (see details above). Also, after the DAPC, each individual has a probability
> to belong to the 3 groups ('pop2', 'pop1' and 'popx') rather than 2 groups
> ('pop2' and 'pop1') as aimed.
>
> Give that original alleles of the 'popx' population, present in the 'popx'
> population but absent from either source populations (pop1 or pop2), wont
> give us any information about the origin of the 'popx' individuals (e.i.
> whether they come from 'pop1' or 'pop2), when performing the 'normal' DAPC
> (without regression), could we 'simply' consider these original alleles of
> 'popx' as missing data? Would there be an easy way to do that?
>
> Thanks again for your help,
>
> Sébastien.
>
> On 21 March 2011 15:34, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:
> t.jombart at imperial.ac.uk>>> wrote:
> Dear Sébastien,
>
> thanks for this very interesting question. To rephrase it: "how to assign
> individuals from one space to groups defined in another, partially
> overlapping space"?
>
> The problem is not trivial if we think of it in probabilistic terms. If you
> used Bayesian/likelihood-based clustering, clusters would be defined in
> terms of frequencies of a given set of alleles (say, "S"). You can compute
> the probability for an individual to come from cluster xxx (or a mixture of
> clusters xxx, yyy, zzz etc in admixture models) as long as this individual
> does not possess any original allele (i.e., not in 'S'). Would it not be the
> case, the probability of observing a new allele in the previously defined
> clusters is, by definition, zero and thus P=0 for all clusters. Annoying.
>
> Distance-based method have a similar problem: if the spaces differ, it is
> much more difficult to compare one individual to another.
>
> However, we can use the fact that one space is contained within another,
> namely, the alleles differentiating pop1 /vs/ pop2 are a subset of the
> alleles of the complete dataset. One approach is to use an analysis that we
> could run on the entire dataset, but that would exclude all originality of
> 'popx', and only conserve differences between 'pop1' and 'pop2'. This can be
> achieved by regressing the data onto a factor opposing 'popx' to 'non-pop-x'
> individuals.
>
> ####
> X <- truenames(trial)$tab # extract table of allele frequencies
> popx <- factor(pop(trial)=="popx") # popx vs non-popx
> X.res <- apply(X,2, function(e) residuals(lm(e~popx))) # remove 'popx'
> effect
>
> dapc1 <- dapc(X.res, pop(trial), n.pca=3, n.da=1) # perform dapc
> scatter(dapc1)
> assignplot(dapc1)
> ####
>
> The DAPC aims to discriminate all populations of the dataset, but we
> actually tricked the method by removing all originality specific to "popx"
> beforehand. With the toy dataset you sent, "popx" is actually still at one
> extreme of the cline, but I suspect that actually hybrid populations should
> fall between the two parental populations.
>
>
> Best regards
>
> Thibaut.
>
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - Faculty of Medicine
> St Mary’s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:
> t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
>
> ________________________________________
> From: adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum-bounces at r-forge.wu-wien.ac.at><mailto:
> adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum-bounces at r-forge.wu-wien.ac.at>> [
> adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum-bounces at r-forge.wu-wien.ac.at><mailto:
> adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum-bounces at r-forge.wu-wien.ac.at>>] On Behalf Of Sébastien
> Puechmaille [s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com
> ><mailto:s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com>>]
> Sent: 21 March 2011 13:15
> To: adegenet-forum at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum at r-forge.wu-wien.ac.at><mailto:
> adegenet-forum at r-forge.wu-wien.ac.at<mailto:
> adegenet-forum at r-forge.wu-wien.ac.at>>
> Subject: [adegenet-forum] Genotypes assignment to clusters
>
> Dear Thibaut and Adegenet users,
>
> I have a data set with 3 groups of samples (see below), 2 with samples of
> known origin (pop1 and pop2) and one (popx) with samples that I would like
> to assign to one of the 2 known populations (pop1 or pop2). For this, I want
> to run a DAPC with 'pop1' and 'pop2' data set and then, assign individuals
> from 'popx' to either 'pop1' or 'pop2'.
>
> However, individuals from the group to be assigned have some private
> alleles that are neither in 'pop1' nor in 'pop2' and therefore, the
> assignment cannot work. What would be the best solution to get around this
> problem?
> Shall I create dummies individuals in 'pop1' and 'pop2' having the private
> alleles of 'popx'?
>
> Hereafter is a reduced data set to illustrate the problem:
> indiv    pop    L1    L2    L3
> Indiv1    pop1    222224    232224    120122
> Indiv2    pop1    222226    232226    118120
> Indiv3    pop1    222222    232232    120120
> Indiv4    pop1    222224    232224    124124
> Indiv5    pop2    224224    224224    122122
> Indiv6    pop2    224224    224224    124124
> Indiv7    pop2    224226    224226    120120
> Indiv8    pop2    222224    232224    122124
> Indiv9    popx    220222    220232    116118
> Indiv10    popx    222224    232224    118120
> Indiv11    popx    222226    232226    120120
> Indiv12    popx    224224    224224    124124
>
>
> geno<-read.table("three-pop.txt",h=T)
>
> trial<-df2genind(geno[,3:5],missing=NA,ploidy=2,sep=NULL,ncode=6,ind.names=geno[,1],
> loc.names=colnames(geno[1,3:5]),pop=geno[,2])
>
> trial at pop.names
> split<- seppop(trial)
>
> pop12 <- repool(split$pop1, split$pop2)
>
> pop12 @all.names
> split$popx at all.names
>
> In this case, 'pop12' has 10 columns of '@tab' while 'split$popx' has 13
> columns of '@tab'.
>
> Would anyone have a solution or any advice?
>
> Thanks for your help,
>
> Sébastien.
>
>
> *********************
> Dr. Sébastien Puechmaille
> UCD School of Biological and Environmental Sciences
> University College Dublin (Zoology)
> UCD Science and Education Research Center (West)
> Belfield
> Dublin 4
> Ireland
>
> and
>
> Max Planck Institute for Ornithology
> Sensory Ecology Group
> Eberhard-Gwinner-Straße
> Haus Nr. 11
> 82319 Seewiesen
> Germany
>
> http://batlab.ucd.ie/~spuechmaille/<http://batlab.ucd.ie/%7Espuechmaille/
> ><http://batlab.ucd.ie/%7Espuechmaille/>
>
> http://www.ucd.ie/research/people/biologyenvscience/drsebastienpuechmaille/home/
> *********************
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20110323/6b92e5a3/attachment-0001.htm>


More information about the adegenet-forum mailing list