[adegenet-forum] Genotypes assignment to clusters

Thu Mar 31 01:52:05 CEST 2011

Hello,

I have implemented a predict.dapc method yesterday, which should now be available in the devel version of adegenet (see 'download' section on the website). It triggered more changes than I expected in the dapc functions, but all seems to be working just fine after a the tests I ran.

The method is documented in ?dapc, and I provided an example which uses simulated hybrids. There is also a possibility for plotting assignments of unknown individuals using assignplot (see example).

All the best

Thibaut

________________________________________
From: Sébastien Puechmaille [s.puechmaille at gmail.com]
Sent: 23 March 2011 12:08
To: Jombart, Thibaut
Cc: adegenet-forum at r-forge.wu-wien.ac.at
Subject: Re: [adegenet-forum] Genotypes assignment to clusters

Dear Thibaut,

I just used the toy dataset to split a genind object and repool it (see below) and I think the data is lost when using the 'repool' function.

> #Read the input and creates a 'genind' object
> geno<-read.table("three-pop.txt",h=T)
> trial<-df2genind(geno[,3:5],missing=NA,ploidy=2,sep=NULL,ncode=6,ind.names=geno[,1], loc.names=colnames(geno[1,3:5]),pop=geno[,2])
> #Total number of alleles for 'trial'
> length(trial at loc.fac)
[1] 13

> #Splits the genind object per population
> trial1=seppop(trial) #'drop=FALSE' is the default option
> #Total number of alleles for each population
> length(trial1$pop1 at loc.fac)
[1] 13
> length(trial1$pop2 at loc.fac)
[1] 13
> length(trial1$popx at loc.fac)
[1] 13
> #Total number of alleles for a 'repooled' gening object
> trial2=repool(trial1$pop1,trial1$pop2)
> length(trial2 at loc.fac)
[1] 10

Thanks very much for implementing the predict methods for the DAPC.

Best wishes,

Sebastien.

On 23 March 2011 10:55, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
Dear Sébastien,

actually I think repool doesn't need a drop option - alleles can't be lost by using repool, only by splitting the data. Here you may want to use directly seppop with the option 'drop=FALSE' - from the doc of seppop:

drop: a logical stating whether alleles that are no longer present
         in a subset of data should be discarded (TRUE) or kept anyway
         (FALSE, default).

I will implement a predict method for DAPC objects soon - it should be available from the devel version under two weeks.

Best

Thibaut

________________________________________
From: Sébastien Puechmaille [s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com>]
Sent: 23 March 2011 10:47
To: Jombart, Thibaut
Cc: adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at>
Subject: Re: [adegenet-forum] Genotypes assignment to clusters

Dear Thibaut,

Sorry, I accidentally switched 'pop1' and 'pop2' in my description of 'popx'; it should read:
 'popx' is a mix of individuals from;
   - about  80% of individuals belonging to  'pop2'
  - about  15% of individuals belonging to  'pop1'
  - about  5% of hybrid individuals between 'pop1' and 'pop2'.

For the reclassification of individuals from popx into pop1 and pop2, indeed, the second approach you propose seems the best. I was previously using the 'seppop' function to separate the genind object into 3 objects (corresponding to each population) and then using the 'repool' function to merge 2 of these objects. By doing this, alleles with no data are dropped when using the 'repool' function (the 'drop' option is not implemented in 'repool'). However, by subsetting directly 2 populations from the original genind object ('2pop=3pop[3pop at pop %in% c("P1","P2"),drop=FALSE]'), we do not need to use the 'repool' function so that alleles with no data are kept.
Concerning the wrapper for 'predict.lda', I'm not too sure how best to code that.

Thanks for your help,

Sebastien.

On 22 March 2011 17:36, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>> wrote:
Hello,

There is something that confuses me in your description. 80% of individuals in popx are from pop1, 15% from pop2, the rest are hybrids.So why is it unexpected that the distinction between pop2 and popx is made clearer on the 'partial DAPC' approach? On the contrary, you expect this analysis to distinguish pop1 from pop2, so if popx is mainly pop1, we expect differences between popx and pop2 to be emphasized.

Concerning the probabilities of assignment of individuals to the three groups, this is because popx still contributes to the variability between groups - only there is no longer an effect of alleles that are specific to popx. If you want to reclassify individuals from popx into pop1 and pop2 only, then a different and probably cleaner approach needs to be used. Alleles from popx that do not exist in pop1 and pop2 will not be missing data, but the analysis will need to be done without pruning these alleles (in the subset function "[" of genind object, there's an option 'drop' which needs to be set to FALSE). Then what you will need is a wrapper for 'predict.lda' for dapc objects. This does not exist yet, but it fairly straightforward to code. Contribution welcome if you want to give it a go, otherwise I will likely sort this out over the coming days, as soon as I've got time to devote to adegenet that is.

All the best

Thibaut

________________________________________
From: Sébastien Puechmaille [s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com><mailto:s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com>>]
Sent: 21 March 2011 19:12
To: Jombart, Thibaut
Cc: adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at><mailto:adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at>>
Subject: Re: [adegenet-forum] Genotypes assignment to clusters

Dear Thibaut,

Thanks very much for this solution. Indeed, the question was "how to assign individuals from one space to groups defined in another, partially overlapping space"?

I've run this analysis with the real dataset (not the toy dataset presented in the previous e-mail) and compared the results with and without regression. What I should probably mention here is that the group 'popx' is a mix of individuals from;
- about  80% of individuals belonging to  'pop1'
- about  15% of individuals belonging to  'pop2'
- about  5% of hybrid individuals between 'pop1' and 'pop2'.

For the data set without regression, I performed a 'normal' DAPC with 3 predefined groups 'pop1, 'pop2' and 'popx'.
-'pop1' individuals are clearly differentiated from 'pop2' and 'popx'
-'pop2' and 'popx' individuals are nearly indiscernable along discriminant function 1 (except for a few individuals from 'popx' that in fact belong to 'pop1')
- the mean comparison of inferred groups with actual groups is 0.62

For the data set with regression, I performed a 'normal' DAPC on the residuals of the regression (as detailed in the previous e-mail)
-'pop1' individuals are again clearly differentiated from 'pop2' and 'popx'
-'pop2' and 'popx' individuals are much more differentiated along discriminant function 1 when compared to the normal DAPC detailed above.
- the mean comparison of inferred groups with actual groups is 0.81 (more than the normal DAPC)
This accentuated differentiation of 'pop2' and 'popx' individuals seems rather unexpected as most individuals from 'popx' are in fact from 'pop2' (see details above). Also, after the DAPC, each individual has a probability to belong to the 3 groups ('pop2', 'pop1' and 'popx') rather than 2 groups ('pop2' and 'pop1') as aimed.

Give that original alleles of the 'popx' population, present in the 'popx' population but absent from either source populations (pop1 or pop2), wont give us any information about the origin of the 'popx' individuals (e.i. whether they come from 'pop1' or 'pop2), when performing the 'normal' DAPC (without regression), could we 'simply' consider these original alleles of 'popx' as missing data? Would there be an easy way to do that?

Thanks again for your help,

Sébastien.

On 21 March 2011 15:34, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>> wrote:
Dear Sébastien,

thanks for this very interesting question. To rephrase it: "how to assign individuals from one space to groups defined in another, partially overlapping space"?

The problem is not trivial if we think of it in probabilistic terms. If you used Bayesian/likelihood-based clustering, clusters would be defined in terms of frequencies of a given set of alleles (say, "S"). You can compute the probability for an individual to come from cluster xxx (or a mixture of clusters xxx, yyy, zzz etc in admixture models) as long as this individual does not possess any original allele (i.e., not in 'S'). Would it not be the case, the probability of observing a new allele in the previously defined clusters is, by definition, zero and thus P=0 for all clusters. Annoying.

Distance-based method have a similar problem: if the spaces differ, it is much more difficult to compare one individual to another.

However, we can use the fact that one space is contained within another, namely, the alleles differentiating pop1 /vs/ pop2 are a subset of the alleles of the complete dataset. One approach is to use an analysis that we could run on the entire dataset, but that would exclude all originality of 'popx', and only conserve differences between 'pop1' and 'pop2'. This can be achieved by regressing the data onto a factor opposing 'popx' to 'non-pop-x' individuals.

####
X <- truenames(trial)$tab # extract table of allele frequencies
popx <- factor(pop(trial)=="popx") # popx vs non-popx
X.res <- apply(X,2, function(e) residuals(lm(e~popx))) # remove 'popx' effect

dapc1 <- dapc(X.res, pop(trial), n.pca=3, n.da=1) # perform dapc
scatter(dapc1)
assignplot(dapc1)
####

The DAPC aims to discriminate all populations of the dataset, but we actually tricked the method by removing all originality specific to "popx" beforehand. With the toy dataset you sent, "popx" is actually still at one extreme of the cline, but I suspect that actually hybrid populations should fall between the two parental populations.

Best regards

Thibaut.

--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - Faculty of Medicine
St Mary’s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk><mailto:t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>>
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/

________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at><mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at>><mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at><mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at>>> [adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at><mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at>><mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at><mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at<mailto:adegenet-forum-bounces at r-forge.wu-wien.ac.at>>>] On Behalf Of Sébastien Puechmaille [s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com><mailto:s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com>><mailto:s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com><mailto:s.puechmaille at gmail.com<mailto:s.puechmaille at gmail.com>>>]
Sent: 21 March 2011 13:15
To: adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at><mailto:adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at>><mailto:adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at><mailto:adegenet-forum at r-forge.wu-wien.ac.at<mailto:adegenet-forum at r-forge.wu-wien.ac.at>>>
Subject: [adegenet-forum] Genotypes assignment to clusters

Dear Thibaut and Adegenet users,

I have a data set with 3 groups of samples (see below), 2 with samples of known origin (pop1 and pop2) and one (popx) with samples that I would like to assign to one of the 2 known populations (pop1 or pop2). For this, I want to run a DAPC with 'pop1' and 'pop2' data set and then, assign individuals from 'popx' to either 'pop1' or 'pop2'.

However, individuals from the group to be assigned have some private alleles that are neither in 'pop1' nor in 'pop2' and therefore, the assignment cannot work. What would be the best solution to get around this problem?
Shall I create dummies individuals in 'pop1' and 'pop2' having the private alleles of 'popx'?

Hereafter is a reduced data set to illustrate the problem:
indiv    pop    L1    L2    L3
Indiv1    pop1    222224    232224    120122
Indiv2    pop1    222226    232226    118120
Indiv3    pop1    222222    232232    120120
Indiv4    pop1    222224    232224    124124
Indiv5    pop2    224224    224224    122122
Indiv6    pop2    224224    224224    124124
Indiv7    pop2    224226    224226    120120
Indiv8    pop2    222224    232224    122124
Indiv9    popx    220222    220232    116118
Indiv10    popx    222224    232224    118120
Indiv11    popx    222226    232226    120120
Indiv12    popx    224224    224224    124124

geno<-read.table("three-pop.txt",h=T)

trial<-df2genind(geno[,3:5],missing=NA,ploidy=2,sep=NULL,ncode=6,ind.names=geno[,1], loc.names=colnames(geno[1,3:5]),pop=geno[,2])

trial at pop.names
split<- seppop(trial)

pop12 <- repool(split$pop1, split$pop2)

pop12 @all.names
split$popx at all.names

In this case, 'pop12' has 10 columns of '@tab' while 'split$popx' has 13 columns of '@tab'.

Would anyone have a solution or any advice?

Thanks for your help,

Sébastien.

*********************
Dr. Sébastien Puechmaille
UCD School of Biological and Environmental Sciences
University College Dublin (Zoology)
UCD Science and Education Research Center (West)
Belfield
Dublin 4
Ireland

and

Max Planck Institute for Ornithology
Sensory Ecology Group
Eberhard-Gwinner-Straße
Haus Nr. 11
82319 Seewiesen
Germany

http://batlab.ucd.ie/~spuechmaille/<http://batlab.ucd.ie/%7Espuechmaille/><http://batlab.ucd.ie/%7Espuechmaille/><http://batlab.ucd.ie/%7Espuechmaille/>
http://www.ucd.ie/research/people/biologyenvscience/drsebastienpuechmaille/home/
*********************

_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum