From thomfromsea at gmail.com Tue Apr 2 10:10:19 2013
From: thomfromsea at gmail.com (Thomas Vignaud)
Date: Tue, 2 Apr 2013 10:10:19 +0200
Subject: [adegenet-forum] A few basic questions
Message-ID:
Hi everyone,
I find Adegenet -DAPC- to be very usefull -yet I don't fully understand
all the subtilities.
I'll here try to ask a few simple questions with associated screenshots.
I'll mostly use examples to ask my questions as I believe it a very
efficient way to do it.
(I'm working with 17 microsats on animals)
I'm sorry if all this sounds newbie - please feel free to redirect me to
any .pdf I might have miss.
I believe the two main questions I want to answer with DAPC are :
1 - How different my clusters are ? (I know this depend on a lot of things
and that I can't compare with other species/genes)
I feel like one way to do it is to check is a few components still finds
a lot of structure.
Another is, using alpha scores and the whole classic process, to visually
see how assigned to their cluster the individuals are.
2 - Is there any sub-(genetic)clusters in my sample? for example, I have
sampled 50 ids in the same location. But maybe there is two (sub)
population here and I sample 40 of the first one and 10 of the other. I
want to see that (i.e. compoplot), to go back to my data and to check if I
can find patterns related with what the genetic tells me.
Now here is my problem : depending what number of discriminant function I'm
using, I get totally different results with the same sub-dataset.
And, with the same number of discriminant function but with adding another
population (very structured) to my first sub-dataset, then the first
sub-dataset will be different again.
---> I'm a little lost in what to choose as a number of discriminant
function (I understand the alpha-score, but sometimes it will tell me "21",
when using only "5" will give me the same exact compoplot).
It would not be such a problem if differences would be small, but here it
is : often all my individuals are 100% in one color, but it's never the
same pattern.
One compoplot I'll have ids 1, 2, 5, 6 that are 100% red, and 3, 4, 7 that
are 100% blue.
Then I just redo the analysis changing the number of discriminant function
and I get 1, 3, 7 100% red and 2, 4, 5, 6 100% blue.
See attached screenshots A, B and C from the SAME dataset. (I'm trying to
use small number of DF as I don't like my ids to be 100% in one color, I
feel I miss some information)
---> the same thing happen if I add other populations. The whole pattern
change again. See screenshot D
So is there any guideline that would give me something a little less
absolute that totally different results?
If I want, for example, to note all my outliers (ids that does not belong
the their original geographic cluster) and check for their caracteristic
(size, sex etc...) how am I supposed to do that if outliers change
depending on priors ? especially with more than 700 individuals and 16
geographic clusters.
If I want to account for how much different 3 clusters are, and if using
the opt alpha score gives me three 100% differenciated clusters, but using
a lower one start to create a mix between two of the clusters : can I just
decide to use a lot of different numbers of discriminant function to
explore the dataset ? or is it "wrong" ?
Additional information :
my 'exploring' workflow looks like :
> grp <- find.clusters(obj, max.n.clust = 35)
x (50-150)
x (depend what I want to see)
> dapc1 <- dapc(obj, grp$grp)
x (N/3 or 100 if N is large)
x (either alpha score number or smaller because I have a strong structure)
> compoplot(dapc1, grp$grp)
Any imput or help more than welcome.
Best,
Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sshot A.jpg
Type: image/jpeg
Size: 220048 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sshot B.jpg
Type: image/jpeg
Size: 145542 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sshot C.jpg
Type: image/jpeg
Size: 211791 bytes
Desc: not available
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sshot D.jpg
Type: image/jpeg
Size: 162264 bytes
Desc: not available
URL:
From t.jombart at imperial.ac.uk Thu Apr 4 11:27:00 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Thu, 4 Apr 2013 09:27:00 +0000
Subject: [adegenet-forum] A few basic questions
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657057A5C22E@icexch-m1.ic.ac.uk>
Hello,
sorry in advance if I missed some of your points. But briefly, a few points that might help clarifying issues:
- the a-score helps deciding how many PCs of PCA to retain (not PCs of the DA, aka discriminant functions); there is no tool currently to decide how many discriminant functions (DF) to retain.
- it is normal that assignment changes when the number of DF changes, as the space on which assignment is based changes. Think for instance of a very simple situation where each DF differentiates two populations; removing a given DF will erase discrimination for this pop.
- about the instability you observe: this is quite possible a sign of ad-hoc discrimination due to a discriminating space too big compared to the number of observations. Cross-validation would be the way to go, and should not be too much of a pain to implement. Basically, run DAPC on a random sample of the data, and validate classification using the remaining individuals. Do this repeatedly with varying numbers of PCs (of PCA) retained, and pick the number of components optimizing cross-classification.
* message to the list * : I will offer a pint to the person who will implement this feature in adegenet; nothing complicated, but I just don't have time for this at the moment
- DAPC is good at finding an optimal typology of groups; cluster assignment is merely a by-product, useful but limited. This is where model-based classifiers will be better. I recommend using BAPS, especially on microsat data since it should run quite fast.
Cheers
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thomas Vignaud [thomfromsea at gmail.com]
Sent: 02 April 2013 09:10
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] A few basic questions
Hi everyone,
I find Adegenet -DAPC- to be very usefull -yet I don't fully understand all the subtilities.
I'll here try to ask a few simple questions with associated screenshots. I'll mostly use examples to ask my questions as I believe it a very efficient way to do it.
(I'm working with 17 microsats on animals)
I'm sorry if all this sounds newbie - please feel free to redirect me to any .pdf I might have miss.
I believe the two main questions I want to answer with DAPC are :
1 - How different my clusters are ? (I know this depend on a lot of things and that I can't compare with other species/genes)
I feel like one way to do it is to check is a few components still finds a lot of structure.
Another is, using alpha scores and the whole classic process, to visually see how assigned to their cluster the individuals are.
2 - Is there any sub-(genetic)clusters in my sample? for example, I have sampled 50 ids in the same location. But maybe there is two (sub) population here and I sample 40 of the first one and 10 of the other. I want to see that (i.e. compoplot), to go back to my data and to check if I can find patterns related with what the genetic tells me.
Now here is my problem : depending what number of discriminant function I'm using, I get totally different results with the same sub-dataset.
And, with the same number of discriminant function but with adding another population (very structured) to my first sub-dataset, then the first sub-dataset will be different again.
---> I'm a little lost in what to choose as a number of discriminant function (I understand the alpha-score, but sometimes it will tell me "21", when using only "5" will give me the same exact compoplot).
It would not be such a problem if differences would be small, but here it is : often all my individuals are 100% in one color, but it's never the same pattern.
One compoplot I'll have ids 1, 2, 5, 6 that are 100% red, and 3, 4, 7 that are 100% blue.
Then I just redo the analysis changing the number of discriminant function and I get 1, 3, 7 100% red and 2, 4, 5, 6 100% blue.
See attached screenshots A, B and C from the SAME dataset. (I'm trying to use small number of DF as I don't like my ids to be 100% in one color, I feel I miss some information)
---> the same thing happen if I add other populations. The whole pattern change again. See screenshot D
So is there any guideline that would give me something a little less absolute that totally different results?
If I want, for example, to note all my outliers (ids that does not belong the their original geographic cluster) and check for their caracteristic (size, sex etc...) how am I supposed to do that if outliers change depending on priors ? especially with more than 700 individuals and 16 geographic clusters.
If I want to account for how much different 3 clusters are, and if using the opt alpha score gives me three 100% differenciated clusters, but using a lower one start to create a mix between two of the clusters : can I just decide to use a lot of different numbers of discriminant function to explore the dataset ? or is it "wrong" ?
Additional information :
my 'exploring' workflow looks like :
> grp <- find.clusters(obj, max.n.clust = 35)
x (50-150)
x (depend what I want to see)
> dapc1 <- dapc(obj, grp$grp)
x (N/3 or 100 if N is large)
x (either alpha score number or smaller because I have a strong structure)
> compoplot(dapc1, grp$grp)
Any imput or help more than welcome.
Best,
Thomas
From f.calboli at imperial.ac.uk Thu Apr 4 11:36:18 2013
From: f.calboli at imperial.ac.uk (Federico Calboli)
Date: Thu, 4 Apr 2013 10:36:18 +0100
Subject: [adegenet-forum] A few basic questions
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657057A5C22E@icexch-m1.ic.ac.uk>
References:
<2CB2DA8E426F3541AB1907F98ABA657057A5C22E@icexch-m1.ic.ac.uk>
Message-ID:
On 4 Apr 2013, at 10:27, "Jombart, Thibaut" wrote:
> * message to the list * : I will offer a pint to the person who will implement this feature in adegenet; nothing complicated, but I just don't have time for this at the moment
I can come by your office tomorrow and see if it is something I could do in a nice and reasonably speedy way. Contact me off list. No beer required.
BW
F
--
Federico C. F. Calboli
Neuroepidemiology and Ageing Research
Imperial College, St. Mary's Campus
Norfolk Place, London W2 1PG
Tel +44 (0)20 75941602 Fax +44 (0)20 75943193
f.calboli [.a.t] imperial.ac.uk
f.calboli [.a.t] gmail.com
>
> - DAPC is good at finding an optimal typology of groups; cluster assignment is merely a by-product, useful but limited. This is where model-based classifiers will be better. I recommend using BAPS, especially on microsat data since it should run quite fast.
>
> Cheers
> Thibaut
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary?s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thomas Vignaud [thomfromsea at gmail.com]
> Sent: 02 April 2013 09:10
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] A few basic questions
>
> Hi everyone,
>
> I find Adegenet -DAPC- to be very usefull -yet I don't fully understand all the subtilities.
>
> I'll here try to ask a few simple questions with associated screenshots. I'll mostly use examples to ask my questions as I believe it a very efficient way to do it.
> (I'm working with 17 microsats on animals)
>
> I'm sorry if all this sounds newbie - please feel free to redirect me to any .pdf I might have miss.
>
> I believe the two main questions I want to answer with DAPC are :
>
> 1 - How different my clusters are ? (I know this depend on a lot of things and that I can't compare with other species/genes)
> I feel like one way to do it is to check is a few components still finds a lot of structure.
> Another is, using alpha scores and the whole classic process, to visually see how assigned to their cluster the individuals are.
>
> 2 - Is there any sub-(genetic)clusters in my sample? for example, I have sampled 50 ids in the same location. But maybe there is two (sub) population here and I sample 40 of the first one and 10 of the other. I want to see that (i.e. compoplot), to go back to my data and to check if I can find patterns related with what the genetic tells me.
>
> Now here is my problem : depending what number of discriminant function I'm using, I get totally different results with the same sub-dataset.
> And, with the same number of discriminant function but with adding another population (very structured) to my first sub-dataset, then the first sub-dataset will be different again.
>
> ---> I'm a little lost in what to choose as a number of discriminant function (I understand the alpha-score, but sometimes it will tell me "21", when using only "5" will give me the same exact compoplot).
> It would not be such a problem if differences would be small, but here it is : often all my individuals are 100% in one color, but it's never the same pattern.
> One compoplot I'll have ids 1, 2, 5, 6 that are 100% red, and 3, 4, 7 that are 100% blue.
> Then I just redo the analysis changing the number of discriminant function and I get 1, 3, 7 100% red and 2, 4, 5, 6 100% blue.
> See attached screenshots A, B and C from the SAME dataset. (I'm trying to use small number of DF as I don't like my ids to be 100% in one color, I feel I miss some information)
>
> ---> the same thing happen if I add other populations. The whole pattern change again. See screenshot D
>
>
> So is there any guideline that would give me something a little less absolute that totally different results?
>
> If I want, for example, to note all my outliers (ids that does not belong the their original geographic cluster) and check for their caracteristic (size, sex etc...) how am I supposed to do that if outliers change depending on priors ? especially with more than 700 individuals and 16 geographic clusters.
>
> If I want to account for how much different 3 clusters are, and if using the opt alpha score gives me three 100% differenciated clusters, but using a lower one start to create a mix between two of the clusters : can I just decide to use a lot of different numbers of discriminant function to explore the dataset ? or is it "wrong" ?
>
>
>
> Additional information :
> my 'exploring' workflow looks like :
>
>> grp <- find.clusters(obj, max.n.clust = 35)
> x (50-150)
> x (depend what I want to see)
>
>> dapc1 <- dapc(obj, grp$grp)
> x (N/3 or 100 if N is large)
> x (either alpha score number or smaller because I have a strong structure)
>
>> compoplot(dapc1, grp$grp)
>
>
>
>
> Any imput or help more than welcome.
>
> Best,
>
> Thomas
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From vidalrussell at comahue-conicet.gob.ar Fri Apr 5 05:18:32 2013
From: vidalrussell at comahue-conicet.gob.ar (Romina Vidal Russell)
Date: Fri, 5 Apr 2013 00:18:32 -0300
Subject: [adegenet-forum] Inconsistent results
Message-ID: <1DA66815-E7DC-497C-AD5E-52891F896A4B@comahue-conicet.gob.ar>
I'm trying to perform a dapc but I run the same script several times with the same number of retained PC and DF and I get different groupings, group membership and the scatter plot also differs. What can be happening?The number of PC retain is lower than N/3.
Thanks for any help
Romina
From t.jombart at imperial.ac.uk Fri Apr 5 11:18:44 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Fri, 5 Apr 2013 09:18:44 +0000
Subject: [adegenet-forum] Inconsistent results
In-Reply-To: <1DA66815-E7DC-497C-AD5E-52891F896A4B@comahue-conicet.gob.ar>
References: <1DA66815-E7DC-497C-AD5E-52891F896A4B@comahue-conicet.gob.ar>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657057A5C6F6@icexch-m1.ic.ac.uk>
Hello,
what does the BIC plot look like? What you describe is typical of no structuring...
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Romina Vidal Russell [vidalrussell at comahue-conicet.gob.ar]
Sent: 05 April 2013 04:18
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Inconsistent results
I'm trying to perform a dapc but I run the same script several times with the same number of retained PC and DF and I get different groupings, group membership and the scatter plot also differs. What can be happening?The number of PC retain is lower than N/3.
Thanks for any help
Romina
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From vanesse.labeyrie at cirad.fr Mon Apr 15 13:53:17 2013
From: vanesse.labeyrie at cirad.fr (Vanesse Labeyrie)
Date: Mon, 15 Apr 2013 13:53:17 +0200
Subject: [adegenet-forum] sPCA: Calculation of spatial autocorrelation for
PA data
Message-ID: <516BEA2D.7060005@cirad.fr>
Hi,
I did not found details concerning the calculation of the spatial
autocorrelation index for presences/absence ("PA") data in the sPCA.
I want to use sPCA for a species inventory table (presence of the
species coded 1 and absence 0), and I would like to know how the spatial
autocorrelation is calculated for this kind of data, as I thought
Moran's I is used for continuous numerical data only.
Many thanks for you answer
--
Vanesse LABEYRIE
CIRAD Montpellier
From t.jombart at imperial.ac.uk Mon Apr 15 16:13:12 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Mon, 15 Apr 2013 14:13:12 +0000
Subject: [adegenet-forum] sPCA: Calculation of spatial autocorrelation
for PA data
In-Reply-To: <516BEA2D.7060005@cirad.fr>
References: <516BEA2D.7060005@cirad.fr>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657057A68239@icexch-m1.ic.ac.uk>
Hello,
Moran's I is computed for binary data as for continuous characters. While developed for continuous traits originally, the index makes perfect sense for binary data. The lag vector is the mean presence over neighbouring sites, and Moran's index is a dot product between presence at sites and these lagged values (divided by the variance).
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Vanesse Labeyrie [vanesse.labeyrie at cirad.fr]
Sent: 15 April 2013 12:53
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] sPCA: Calculation of spatial autocorrelation for PA data
Hi,
I did not found details concerning the calculation of the spatial
autocorrelation index for presences/absence ("PA") data in the sPCA.
I want to use sPCA for a species inventory table (presence of the
species coded 1 and absence 0), and I would like to know how the spatial
autocorrelation is calculated for this kind of data, as I thought
Moran's I is used for continuous numerical data only.
Many thanks for you answer
--
Vanesse LABEYRIE
CIRAD Montpellier
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From stefan.prost at anatomy.otago.ac.nz Tue Apr 23 04:48:13 2013
From: stefan.prost at anatomy.otago.ac.nz (Stefan Prost)
Date: Tue, 23 Apr 2013 14:48:13 +1200
Subject: [adegenet-forum] DAPC stats
Message-ID:
Hello,
I've run a DAPC analysis on geographically pre-defined popualtions and I
would like to provide some stats about population differentiation and
shared ancestry. Which ones would you suggest to use?
Best,
Stefan
From t.jombart at imperial.ac.uk Tue Apr 23 08:27:46 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 23 Apr 2013 06:27:46 +0000
Subject: [adegenet-forum] DAPC stats
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657057A6D3F3@icexch-m1.ic.ac.uk>
Dear Stefan,
if you have pre-defined populations and want to measure differentiation, I would recommend using usual measures - Fst, pairwise Fst, or any of the 5 population distances implemented in dist.genpop. DAPC will complement these results, by describing how the diversity of these populations is organized.
Best
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Stefan Prost [stefan.prost at anatomy.otago.ac.nz]
Sent: 23 April 2013 03:48
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] DAPC stats
Hello,
I've run a DAPC analysis on geographically pre-defined popualtions and I
would like to provide some stats about population differentiation and
shared ancestry. Which ones would you suggest to use?
Best,
Stefan
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From nathan.truelove at manchester.ac.uk Tue Apr 23 14:46:27 2013
From: nathan.truelove at manchester.ac.uk (Nathan Truelove)
Date: Tue, 23 Apr 2013 12:46:27 +0000
Subject: [adegenet-forum] Detecting Genetically Unique Individuals in a Well
Mixed Population
Message-ID:
Dear Thibaut and Adegenet Users,
I would like to begin by thanking Thibaut and everyone else who created Adegenet, it has to be the most useful data analysis tool that I have used for my PhD research.
I am PhD student working on the population genetics of Caribbean spiny lobster using 16 microsatellite markers. The species has a huge potential for migration since it can spend up to a year floating/swimming in ocean currents before settling in shallow coastal habitat. Adults can also migrate 10s to 100s of km. It's no big surprise that I am finding very little differentiation in PCA, PCoA, and DAPC analyses. The trend that comes out in all these analyses is that ~80% of individuals from all sampling sites fall within the interia ellipse (s.class) or the contour polygon (s.chull). Several of the individuals outside the interia ellipse (or polygons) are located quite far away from the "core" of individuals within the ellipse. These outlier individuals are not associated with any particular site, however on the spatial level, there appear to be more outliers in southern sites than in northern sites. I've been trying a variety of techniques to try and figure out the ecological importance of these outlier individuals. For example, a recent paper by Elphie et al. entitled "Detecting immigrants in a highly genetically homogeneous spiny lobster population (Palinurus elephas) in the northwest Mediterranean Sea" explores a similar issue in a different species of lobster. In this paper the authors use non-metric multidimensional scaling to separate out the genetic distances of their individuals in multivariate space. They then classified all individuals within a 50% radius of the barycentre as the "reference population" and all individuals outside the 50% radius as an "assignment population". They then used Geneclass2 to run assignment tests and any individuals that had a p-value < 0.05 are considered "genetically different". The authors argue that the most likely explanation for the genetic differences is that the genetically unique individuals detected in Geneclass are migrants from populations that have genetically diverged. I imagine there are several other ecological or selective processes that could also lead to genetically unique individuals, so calling them migrants is up for debate.
For my data I ran a similar analysis in Adegenet using the functions s.class and s.chull along with dudi.pca to select the reference and assignment populations for Genclass2. I compared these results to a similar analysis using non-metric multidimensional scaling in the Vegan package. The Adegenet PCA analyses contained about twice as many individuals in the reference population than the nMDS technique, yet the overall trend of Geneclass finding more unique individuals in the south than the north was consistent among all techniques. Also, most of the distant outliers in PCA analysis in Adegenet were also significantly different in the Geneclass analysis.
It would be excellent to get your opinions on this technique and discuss potential options for improving it:
1) Would it be possible to get additional information using Adegenet on how different the outliers in PCA are from the "core" of individuals inside the inertia ellipse? It would be nice to run the entire analysis in Adegenet and not have to use Geneclass2 at all.
2) Is there a simple way to identify each individual within an inertia ellipse. I have been using the function identify to select the individuals that are located within the ellipse, yet it is rather clunky since you have to click on every point.
3) Any additional advice concerning how to detect genetic outliers in homogeneous populations using Adegenet would be greatly appreciated.
Thank you very much for your time.
Best Wishes,
Nate
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Apr 30 12:14:27 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 30 Apr 2013 10:14:27 +0000
Subject: [adegenet-forum] Detecting Genetically Unique Individuals in a
Well Mixed Population
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657057A75AF1@icexch-m1.ic.ac.uk>
Dear Nate,
the problem here is that it is not clear what is meant by 'outliers'. If we're talking about a few migrants from another population, then they should fall in a small cluster of there own (e.g. using find.clusters). If the definition is spatial, then 'outliers' may be individuals that are genetically distinct from their neighbours (without having to be migrants from another population). Or, 'outliers' can be individuals with rare/original alleles (without having to be any of the above). Or 'outliers' can be whatever does not fall within the inertia ellipse, and in this case you will always have 'outliers' with the default parameters of s.class.
All of these definitions of 'outliers' would require different techniques to pin them down. I would really avoid anything based on the distance from the centroid. This implies that the cloud of point of the population is well represented in only 2D and more importantly is spherical, which is very unlikely. Detection based on inertia ellipses (not intertia - inertia is the squared length of a vector, which in PCA is the variance of the corresponding scores) is bound to fail to. There the assumption is that the cloud of point of the population is bivariate normal, which again is unlikely. But if it is the case, the default inertia ellipse in s.class contains 2/3 of the points. It would be far-fetched to call the remaining third 'outliers'. One can change this parameter, but again, that means arbitrarily deciding of a fixed number of outliers.
But again, the problem here as I understand it is not technical (for now) - what is meant by 'outliers' needs to be clarified first.
All the best
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Nathan Truelove [nathan.truelove at manchester.ac.uk]
Sent: 23 April 2013 13:46
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Detecting Genetically Unique Individuals in a Well Mixed Population
Dear Thibaut and Adegenet Users,
I would like to begin by thanking Thibaut and everyone else who created Adegenet, it has to be the most useful data analysis tool that I have used for my PhD research.
I am PhD student working on the population genetics of Caribbean spiny lobster using 16 microsatellite markers. The species has a huge potential for migration since it can spend up to a year floating/swimming in ocean currents before settling in shallow coastal habitat. Adults can also migrate 10s to 100s of km. It's no big surprise that I am finding very little differentiation in PCA, PCoA, and DAPC analyses. The trend that comes out in all these analyses is that ~80% of individuals from all sampling sites fall within the interia ellipse (s.class) or the contour polygon (s.chull). Several of the individuals outside the interia ellipse (or polygons) are located quite far away from the "core" of individuals within the ellipse. These outlier individuals are not associated with any particular site, however on the spatial level, there appear to be more outliers in southern sites than in northern sites. I've been trying a variety of techniques to try and figure out the ecological importance of these outlier individuals. For example, a recent paper by Elphie et al. entitled "Detecting immigrants in a highly genetically homogeneous spiny lobster population (Palinurus elephas) in the northwest Mediterranean Sea" explores a similar issue in a different species of lobster. In this paper the authors use non-metric multidimensional scaling to separate out the genetic distances of their individuals in multivariate space. They then classified all individuals within a 50% radius of the barycentre as the "reference population" and all individuals outside the 50% radius as an "assignment population". They then used Geneclass2 to run assignment tests and any individuals that had a p-value < 0.05 are considered "genetically different". The authors argue that the most likely explanation for the genetic differences is that the genetically unique individuals detected in Geneclass are migrants from populations that have genetically diverged. I imagine there are several other ecological or selective processes that could also lead to genetically unique individuals, so calling them migrants is up for debate.
For my data I ran a similar analysis in Adegenet using the functions s.class and s.chull along with dudi.pca to select the reference and assignment populations for Genclass2. I compared these results to a similar analysis using non-metric multidimensional scaling in the Vegan package. The Adegenet PCA analyses contained about twice as many individuals in the reference population than the nMDS technique, yet the overall trend of Geneclass finding more unique individuals in the south than the north was consistent among all techniques. Also, most of the distant outliers in PCA analysis in Adegenet were also significantly different in the Geneclass analysis.
It would be excellent to get your opinions on this technique and discuss potential options for improving it:
1) Would it be possible to get additional information using Adegenet on how different the outliers in PCA are from the "core" of individuals inside the inertia ellipse? It would be nice to run the entire analysis in Adegenet and not have to use Geneclass2 at all.
2) Is there a simple way to identify each individual within an inertia ellipse. I have been using the function identify to select the individuals that are located within the ellipse, yet it is rather clunky since you have to click on every point.
3) Any additional advice concerning how to detect genetic outliers in homogeneous populations using Adegenet would be greatly appreciated.
Thank you very much for your time.
Best Wishes,
Nate