From Aaron.Adamack at canberra.edu.au Wed Nov 6 13:21:45 2013
From: Aaron.Adamack at canberra.edu.au (Aaron.Adamack)
Date: Wed, 6 Nov 2013 12:21:45 +0000
Subject: [adegenet-forum] Screeplot showing spatial and variance components
of eigenvalues fails due to complex numbers
Message-ID: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au>
Hi, I'm trying to perform a sPCA and am getting an error when I attempt to make a screeplot showing the spatial and variance components of the eigenvalues. The error seems to be coming from the summary command that gets run within screeplot as I get the following error message:
> summary(possum.spca2)
Spatial principal component analysis
Call: spca(obj = nonapossum, cn = possum.graph, scannf = FALSE, nfposi = 2,
nfnega = 0)
Error in min(eigL) : invalid 'type' (complex) of argument
Looking at the step in summary just before it breaks, all (or nearly all) values of eigL are complex numbers (e.g. 1.025750e+00+0.000000e+00i).
Other than this, I am able to go through all of the steps in the examples provided in adegenet-spca.pdf, so I'm not sure if this is a sign of problems with my data set or if it could be something else? I am pointing to problems in my data set as there is quite a bit of missing data in my genotypes (~12.4%) and I have 1605 individuals.
The code I'm running is:
...
data organization steps to prepare my genind object dpossum
...
nonapossum<-na.replace(dpossum,met=0)
possum.graph<-chooseCN(nonapossum$other$xy,type=5,d1=0,d2=5000,plot=FALSE,res="listw")
possum.spca2<-spca(nonapossum,cn=possum.graph,scannf=FALSE,nfposi=2,nfnega=0)
screeplot(possum.spca2)
Any help in solving this would be greatly appreciated.
-Aaron
p.s. I think there may be a small typo on page 3 of the manual (adegenet-spca.pdf), I think the page reference for Numerical Ecology should be pp. 752-756 rather than pp. 572-576.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Wed Nov 6 17:13:38 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Wed, 6 Nov 2013 16:13:38 +0000
Subject: [adegenet-forum] Screeplot showing spatial and variance
components of eigenvalues fails due to complex numbers
In-Reply-To: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au>
References: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390984A@icexch-m1.ic.ac.uk>
Hello there,
thanks for reporting the error. I confess it is beyond me how one can get complex eigenvalues in sPCA. As this is the first time it happens, there may be something quirky about this particular dataset.
I would need a reproducible example to possibly try and understand what is going on. Thanks for the typo, fixed on the devel now.
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Aaron.Adamack [Aaron.Adamack at canberra.edu.au]
Sent: 06 November 2013 12:21
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers
Hi, I?m trying to perform a sPCA and am getting an error when I attempt to make a screeplot showing the spatial and variance components of the eigenvalues. The error seems to be coming from the summary command that gets run within screeplot as I get the following error message:
> summary(possum.spca2)
Spatial principal component analysis
Call: spca(obj = nonapossum, cn = possum.graph, scannf = FALSE, nfposi = 2,
nfnega = 0)
Error in min(eigL) : invalid 'type' (complex) of argument
Looking at the step in summary just before it breaks, all (or nearly all) values of eigL are complex numbers (e.g. 1.025750e+00+0.000000e+00i).
Other than this, I am able to go through all of the steps in the examples provided in adegenet-spca.pdf, so I?m not sure if this is a sign of problems with my data set or if it could be something else? I am pointing to problems in my data set as there is quite a bit of missing data in my genotypes (~12.4%) and I have 1605 individuals.
The code I?m running is:
?
data organization steps to prepare my genind object dpossum
?
nonapossum<-na.replace(dpossum,met=0)
possum.graph<-chooseCN(nonapossum$other$xy,type=5,d1=0,d2=5000,plot=FALSE,res="listw")
possum.spca2<-spca(nonapossum,cn=possum.graph,scannf=FALSE,nfposi=2,nfnega=0)
screeplot(possum.spca2)
Any help in solving this would be greatly appreciated.
-Aaron
p.s. I think there may be a small typo on page 3 of the manual (adegenet-spca.pdf), I think the page reference for Numerical Ecology should be pp. 752-756 rather than pp. 572-576.
From danica_714 at hotmail.com Wed Nov 6 17:22:31 2013
From: danica_714 at hotmail.com (Danica Fabrigar)
Date: Wed, 6 Nov 2013 16:22:31 +0000
Subject: [adegenet-forum] read.plink: Multiple cores is not supported on
Windows
Message-ID:
Hi,
I am trying to upload my SNP dataset in the PLINK format, however I get the following error message:
Reading PLINK raw format into a genlight object...
Loading required package: parallel
Reading loci information...
Reading and converting genotypes... .Error in mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), : 'mc.cores' > 1 is not supported on Windows
Is there a solution to this problem?
Thanks,Danica
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From caitiecollins17 at gmail.com Wed Nov 6 19:30:53 2013
From: caitiecollins17 at gmail.com (Caitlin Collins)
Date: Wed, 6 Nov 2013 18:30:53 +0000
Subject: [adegenet-forum] Fwd: read.plink: Multiple cores is not supported
on Windows
In-Reply-To:
References:
Message-ID:
---------- Forwarded message ----------
From: Danica Fabrigar
Date: Wed, Nov 6, 2013 at 6:19 PM
Subject: RE: [adegenet-forum] read.plink: Multiple cores is not supported
on Windows
To: Caitlin Collins
Hi Caitlin,
That did the trick.
Thanks you,
Danica
------------------------------
Date: Wed, 6 Nov 2013 17:57:23 +0000
Subject: Re: [adegenet-forum] read.plink: Multiple cores is not supported
on Windows
From: caitiecollins17 at gmail.com
To: danica_714 at hotmail.com
Hi Danica,
While I cannot be certain without knowing precisely what you did to
initiate the upload, I will say that this error message is usually resolved
by adding the argument *parallel=FALSE* to the argument list of the
function you called.
(Note: in older versions of adegenet this used to be multicore=FALSE).
Hope that helps.
Best,
Caitlin.
On Wed, Nov 6, 2013 at 4:22 PM, Danica Fabrigar wrote:
Hi,
I am trying to upload my SNP dataset in the PLINK format, however I get the
following error message:
Reading PLINK raw format into a genlight object...
Loading required package: parallel
Reading loci information...
Reading and converting genotypes...
.Error in mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), :
'mc.cores' > 1 is not supported on Windows
Is there a solution to this problem?
Thanks,
Danica
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From danica_714 at hotmail.com Thu Nov 7 14:30:46 2013
From: danica_714 at hotmail.com (Danica Fabrigar)
Date: Thu, 7 Nov 2013 13:30:46 +0000
Subject: [adegenet-forum] read.plink: no position read from .map file
Message-ID:
Hi,
I am trying to load genome information using the read.PLINK feature. The data uploads fine with no error messages, however when I examine the @other slot, I see that the SNP positions from the map file are not uploaded. I've checked that my .map file contains all the necessary columns and all the information is there.
>chr2L<-read.PLINK ("2L.raw",
map.file="2L_hwe_cleaned.map", chunkSize=10000, parallel=FALSE)
>chr2L$other$positionNULL
Any ideas?
Thanks,Danica
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Mon Nov 11 10:34:40 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Mon, 11 Nov 2013 09:34:40 +0000
Subject: [adegenet-forum] Screeplot showing spatial and variance
components of eigenvalues fails due to complex numbers
In-Reply-To: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au>
References: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390B16B@icexch-m1.ic.ac.uk>
Hello,
the bug was not coming from adegenet, but from an oddity in 'eigen' which for some large symmetric matrices returns complex eigenvalues with imaginary parts equalling zero. Fixed now in the patch attached. On sourceforge now, and will integrate the next stable CRAN release.
Best
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Aaron.Adamack [Aaron.Adamack at canberra.edu.au]
Sent: 06 November 2013 12:21
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers
Hi, I?m trying to perform a sPCA and am getting an error when I attempt to make a screeplot showing the spatial and variance components of the eigenvalues. The error seems to be coming from the summary command that gets run within screeplot as I get the following error message:
> summary(possum.spca2)
Spatial principal component analysis
Call: spca(obj = nonapossum, cn = possum.graph, scannf = FALSE, nfposi = 2,
nfnega = 0)
Error in min(eigL) : invalid 'type' (complex) of argument
Looking at the step in summary just before it breaks, all (or nearly all) values of eigL are complex numbers (e.g. 1.025750e+00+0.000000e+00i).
Other than this, I am able to go through all of the steps in the examples provided in adegenet-spca.pdf, so I?m not sure if this is a sign of problems with my data set or if it could be something else? I am pointing to problems in my data set as there is quite a bit of missing data in my genotypes (~12.4%) and I have 1605 individuals.
The code I?m running is:
?
data organization steps to prepare my genind object dpossum
?
nonapossum<-na.replace(dpossum,met=0)
possum.graph<-chooseCN(nonapossum$other$xy,type=5,d1=0,d2=5000,plot=FALSE,res="listw")
possum.spca2<-spca(nonapossum,cn=possum.graph,scannf=FALSE,nfposi=2,nfnega=0)
screeplot(possum.spca2)
Any help in solving this would be greatly appreciated.
-Aaron
p.s. I think there may be a small typo on page 3 of the manual (adegenet-spca.pdf), I think the page reference for Numerical Ecology should be pp. 752-756 rather than pp. 572-576.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spca.R
Type: application/octet-stream
Size: 12866 bytes
Desc: spca.R
URL:
From M.Coulson at MARLAB.AC.UK Mon Nov 11 10:50:08 2013
From: M.Coulson at MARLAB.AC.UK (Mark Coulson)
Date: Mon, 11 Nov 2013 09:50:08 -0000
Subject: [adegenet-forum] identification of hybrids
Message-ID: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
Hello,
I am attempting to use adegenet in a similar fashion to how one may use
STRUCTURE to identify hybrids/admixed individuals. I know the compoplot
function will allow for a STRUCTURE-like bar plot but my question is
given the differences between STRUCTURE and compoplot, can one still
make the same inferences about the identification of hybrids? In
STRUCTURE I have been using a q-value cut-off from known individuals to
identify possible hybrids (also simulating known hybrids) so that
individuals falling below the q-value for 'pure species membership'
would fall into this category. Given compoplot is a probability rather
than a membership coefficient, is this type of an approach valid?
Best,
Mark
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Mon Nov 11 11:17:14 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Mon, 11 Nov 2013 10:17:14 +0000
Subject: [adegenet-forum] identification of hybrids
In-Reply-To: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>
Hello,
STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model.
DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work.
The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids.
S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that.
Best
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK]
Sent: 11 November 2013 09:50
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] identification of hybrids
Hello,
I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for ?pure species membership? would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid?
Best,
Mark
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
From M.Coulson at MARLAB.AC.UK Mon Nov 11 13:47:14 2013
From: M.Coulson at MARLAB.AC.UK (Mark Coulson)
Date: Mon, 11 Nov 2013 12:47:14 -0000
Subject: [adegenet-forum] identification of hybrids
References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>
Message-ID: <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk>
Hi Dr. Jombart,
Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data?
furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE).
Many thanks,
Mark
-----Original Message-----
From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk]
Sent: Mon 11/11/2013 10:17
To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org
Cc: sebastien.devillard at univ-lyon1.fr
Subject: RE: identification of hybrids
Hello,
STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model.
DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work.
The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids.
S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that.
Best
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary's Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK]
Sent: 11 November 2013 09:50
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] identification of hybrids
Hello,
I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid?
Best,
Mark
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Mon Nov 11 16:06:47 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Mon, 11 Nov 2013 15:06:47 +0000
Subject: [adegenet-forum] identification of hybrids
In-Reply-To: <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk>
References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>,
<1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390C32C@icexch-m1.ic.ac.uk>
Hi again,
there can be multiple explanation for the overfitting patterns you observe, so of which could well lie within the data themself (e.g. outliers, or groups defined by few individuals). The main expectation is that there should be a number of PCs which is optimal in terms of prediction; there may be many drivers for the variance in non-optimal solutions.
As for the second point, yes, this is exactly the projection of supplementary individuals described at the end of the DAPC vignette. You calibrate the DAPC with individuals from known groups, and predict the group membership of the supplementary individuals.
Cheers
Thibaut
________________________________________
From: Mark Coulson [M.Coulson at MARLAB.AC.UK]
Sent: 11 November 2013 12:47
To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Cc: sebastien.devillard at univ-lyon1.fr
Subject: RE: identification of hybrids
Hi Dr. Jombart,
Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data?
furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE).
Many thanks,
Mark
-----Original Message-----
From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk]
Sent: Mon 11/11/2013 10:17
To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org
Cc: sebastien.devillard at univ-lyon1.fr
Subject: RE: identification of hybrids
Hello,
STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model.
DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work.
The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids.
S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that.
Best
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary's Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK]
Sent: 11 November 2013 09:50
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] identification of hybrids
Hello,
I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid?
Best,
Mark
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
From M.Coulson at MARLAB.AC.UK Tue Nov 12 14:01:14 2013
From: M.Coulson at MARLAB.AC.UK (Mark Coulson)
Date: Tue, 12 Nov 2013 13:01:14 -0000
Subject: [adegenet-forum] identification of hybrids
References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>,
<1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390C32C@icexch-m1.ic.ac.uk>
<5281F54A.2030202@univ-lyon1.fr>
Message-ID: <1BA13B469D9E89408AAA651AC9B309160103F257@sose0009g.marlab.ac.uk>
Many thanks for the addition re: the comparison between STRUCTURE and adegenet. I am working with three distinct groups and STRUCTURE has a hard time separating groups 2 and 3 (so thereby really only identifying 2 groups). The third group is a much smaller sample (n=75) compared to the other two baselines (100s-1000s) and I suspect that is having an effect as described in Kalinowski 2011. If one uses supplementary individuals to assign to these three groups, what would happen if some of the individuals were from a 4th distinct group that had not been sampled in the baseline. In other words, can the posterior probabilities not assign this individual to any of the three represented groups (or at least with poor probability) and thereby be considered excluded from these baselines?
Thanks,
Mark
-----Original Message-----
From: Sebastien Devillard [mailto:sebastien.devillard at univ-lyon1.fr]
Sent: Tue 11/12/2013 09:30
To: Jombart, Thibaut; Mark Coulson; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: identification of hybrids
hi,
just a small add to the Thibaut's answer.
From my own unpublished experience in comparing /interpreting results
from STRUCTURE and DAPC in identifying hybrids of different generations
(simulated microsatellite genotypes), I recorded a clear tendancy of
having a less continous distribution of "individual introgression"
coefficients (namely q score in STRUCTURE and membership probability in
DAPC) in DAPC. In other words, higher scores to one of the parental
populations are more often found in DAPC than in STRUCTURE, hence, the
population hybridization rate tends to be lower in DAPC than in
STRUCTURE (although I never made simulations to check whether STRUCTURE
or DAPC is closer to the truth) . As Thibaut underlined, there is in
STRUCTURE a genetic model which is not present in DAPC and it is likely
the origin of the difference.
Hope this helps
S?bastien
Le 11/11/2013 16:06, Jombart, Thibaut a ?crit :
> Hi again,
>
> there can be multiple explanation for the overfitting patterns you observe, so of which could well lie within the data themself (e.g. outliers, or groups defined by few individuals). The main expectation is that there should be a number of PCs which is optimal in terms of prediction; there may be many drivers for the variance in non-optimal solutions.
>
> As for the second point, yes, this is exactly the projection of supplementary individuals described at the end of the DAPC vignette. You calibrate the DAPC with individuals from known groups, and predict the group membership of the supplementary individuals.
>
> Cheers
> Thibaut
>
>
> ________________________________________
> From: Mark Coulson [M.Coulson at MARLAB.AC.UK]
> Sent: 11 November 2013 12:47
> To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
> Cc: sebastien.devillard at univ-lyon1.fr
> Subject: RE: identification of hybrids
>
> Hi Dr. Jombart,
>
> Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data?
>
> furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE).
>
> Many thanks,
>
> Mark
>
>
>
>
>
> -----Original Message-----
> From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk]
> Sent: Mon 11/11/2013 10:17
> To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org
> Cc: sebastien.devillard at univ-lyon1.fr
> Subject: RE: identification of hybrids
>
> Hello,
>
> STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model.
>
> DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work.
>
> The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids.
>
> S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that.
>
> Best
> Thibaut
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary's Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK]
> Sent: 11 November 2013 09:50
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] identification of hybrids
>
> Hello,
>
> I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid?
>
> Best,
>
> Mark
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________.
>
--
S?bastienDevillard, PhD, Associate Professor
UMR 5558 "Biometry and Evolutionary Biology"
43 bd du 11 novembre 1918,
69622 Villeurbanne cedex
France
Phone :+33 (0)4 72 44 81 70
Fax : +33 (0)4 72 43 13 88
sebastien.devillard at univ-lyon1.fr
http://lbbe.univ-lyon1.fr/-Devillard-Sebastien-.html
http://sebastien.devillard.perso.sfr.fr
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Nov 12 21:07:17 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 12 Nov 2013 20:07:17 +0000
Subject: [adegenet-forum] identification of hybrids
In-Reply-To: <1BA13B469D9E89408AAA651AC9B309160103F257@sose0009g.marlab.ac.uk>
References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>,
<1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390C32C@icexch-m1.ic.ac.uk>
<5281F54A.2030202@univ-lyon1.fr>,
<1BA13B469D9E89408AAA651AC9B309160103F257@sose0009g.marlab.ac.uk>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390C794@icexch-m1.ic.ac.uk>
Hi there,
by definition, no, the analysis cannot assign new individuals to a group that was not part of the 'training' set.
Cheers
Thibaut
________________________________________
From: Mark Coulson [M.Coulson at MARLAB.AC.UK]
Sent: 12 November 2013 13:01
To: sebastien.devillard at univ-lyon1.fr; Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: RE: identification of hybrids
Many thanks for the addition re: the comparison between STRUCTURE and adegenet. I am working with three distinct groups and STRUCTURE has a hard time separating groups 2 and 3 (so thereby really only identifying 2 groups). The third group is a much smaller sample (n=75) compared to the other two baselines (100s-1000s) and I suspect that is having an effect as described in Kalinowski 2011. If one uses supplementary individuals to assign to these three groups, what would happen if some of the individuals were from a 4th distinct group that had not been sampled in the baseline. In other words, can the posterior probabilities not assign this individual to any of the three represented groups (or at least with poor probability) and thereby be considered excluded from these baselines?
Thanks,
Mark
-----Original Message-----
From: Sebastien Devillard [mailto:sebastien.devillard at univ-lyon1.fr]
Sent: Tue 11/12/2013 09:30
To: Jombart, Thibaut; Mark Coulson; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: identification of hybrids
hi,
just a small add to the Thibaut's answer.
From my own unpublished experience in comparing /interpreting results
from STRUCTURE and DAPC in identifying hybrids of different generations
(simulated microsatellite genotypes), I recorded a clear tendancy of
having a less continous distribution of "individual introgression"
coefficients (namely q score in STRUCTURE and membership probability in
DAPC) in DAPC. In other words, higher scores to one of the parental
populations are more often found in DAPC than in STRUCTURE, hence, the
population hybridization rate tends to be lower in DAPC than in
STRUCTURE (although I never made simulations to check whether STRUCTURE
or DAPC is closer to the truth) . As Thibaut underlined, there is in
STRUCTURE a genetic model which is not present in DAPC and it is likely
the origin of the difference.
Hope this helps
S?bastien
Le 11/11/2013 16:06, Jombart, Thibaut a ?crit :
> Hi again,
>
> there can be multiple explanation for the overfitting patterns you observe, so of which could well lie within the data themself (e.g. outliers, or groups defined by few individuals). The main expectation is that there should be a number of PCs which is optimal in terms of prediction; there may be many drivers for the variance in non-optimal solutions.
>
> As for the second point, yes, this is exactly the projection of supplementary individuals described at the end of the DAPC vignette. You calibrate the DAPC with individuals from known groups, and predict the group membership of the supplementary individuals.
>
> Cheers
> Thibaut
>
>
> ________________________________________
> From: Mark Coulson [M.Coulson at MARLAB.AC.UK]
> Sent: 11 November 2013 12:47
> To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
> Cc: sebastien.devillard at univ-lyon1.fr
> Subject: RE: identification of hybrids
>
> Hi Dr. Jombart,
>
> Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data?
>
> furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE).
>
> Many thanks,
>
> Mark
>
>
>
>
>
> -----Original Message-----
> From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk]
> Sent: Mon 11/11/2013 10:17
> To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org
> Cc: sebastien.devillard at univ-lyon1.fr
> Subject: RE: identification of hybrids
>
> Hello,
>
> STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model.
>
> DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work.
>
> The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids.
>
> S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that.
>
> Best
> Thibaut
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary's Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK]
> Sent: 11 November 2013 09:50
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] identification of hybrids
>
> Hello,
>
> I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid?
>
> Best,
>
> Mark
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________.
>
--
S?bastienDevillard, PhD, Associate Professor
UMR 5558 "Biometry and Evolutionary Biology"
43 bd du 11 novembre 1918,
69622 Villeurbanne cedex
France
Phone :+33 (0)4 72 44 81 70
Fax : +33 (0)4 72 43 13 88
sebastien.devillard at univ-lyon1.fr
http://lbbe.univ-lyon1.fr/-Devillard-Sebastien-.html
http://sebastien.devillard.perso.sfr.fr
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
From fernando.cruz at ebd.csic.es Fri Nov 15 19:53:12 2013
From: fernando.cruz at ebd.csic.es (Fernando Cruz)
Date: Fri, 15 Nov 2013 19:53:12 +0100
Subject: [adegenet-forum] Request an example of genetic distance among two
individuals
Message-ID: <52866D98.6040106@ebd.csic.es>
Hi Thibaut,
I performed a NJ Tree using 1M SNPs with 10 samples, following the
instructions in the documentation. However I would like to know exactly
the genetic distance among individuals is calculated. Is it based on the
number of shared alleles?
Could you provide a simple example? Like for this two individuals using
5 SNPs:
Ind1 00122
Ind2 02210
Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
Thanks in advance,
Fernando Cruz
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail: fernando.cruz at ebd.csic.es
Website: http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html
****************************************
From t.jombart at imperial.ac.uk Sun Nov 17 16:07:32 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 17 Nov 2013 15:07:32 +0000
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <52866D98.6040106@ebd.csic.es>
References: <52866D98.6040106@ebd.csic.es>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
Hello there,
there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options.
One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'.
The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations):
D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
Using your example:
> x <- c(0,0,1,2,2)
> y <- c(0,2,2,1,0)
> sqrt(sum((x-y)^2))
[1] 3.162278
> dist(rbind.data.frame(x,y))
1
2 3.162278
Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different:
> x.rel <- x/2
> y.rel <- y/2
> dist(rbind.data.frame(x.rel,y.rel))
1
2 1.581139
That is, the distance between the raw allele count profiles divided by the ploidy.
As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances).
Cheers
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es]
Sent: 15 November 2013 18:53
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Request an example of genetic distance among two individuals
Hi Thibaut,
I performed a NJ Tree using 1M SNPs with 10 samples, following the
instructions in the documentation. However I would like to know exactly
the genetic distance among individuals is calculated. Is it based on the
number of shared alleles?
Could you provide a simple example? Like for this two individuals using
5 SNPs:
Ind1 00122
Ind2 02210
Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
Thanks in advance,
Fernando Cruz
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail: fernando.cruz at ebd.csic.es
Website: http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html
****************************************
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From t.jombart at imperial.ac.uk Sun Nov 17 16:23:52 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 17 Nov 2013 15:23:52 +0000
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
References: <52866D98.6040106@ebd.csic.es>,
<2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>
Just realized a typo:
sqrt(\sum_i (x_i - y_i)^2
should read
sqrt{ \sum_i (x_i - y_i)^2 }
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk]
Sent: 17 November 2013 15:07
To: Fernando Cruz; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
Hello there,
there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options.
One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'.
The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations):
D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
Using your example:
> x <- c(0,0,1,2,2)
> y <- c(0,2,2,1,0)
> sqrt(sum((x-y)^2))
[1] 3.162278
> dist(rbind.data.frame(x,y))
1
2 3.162278
Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different:
> x.rel <- x/2
> y.rel <- y/2
> dist(rbind.data.frame(x.rel,y.rel))
1
2 1.581139
That is, the distance between the raw allele count profiles divided by the ploidy.
As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances).
Cheers
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es]
Sent: 15 November 2013 18:53
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Request an example of genetic distance among two individuals
Hi Thibaut,
I performed a NJ Tree using 1M SNPs with 10 samples, following the
instructions in the documentation. However I would like to know exactly
the genetic distance among individuals is calculated. Is it based on the
number of shared alleles?
Could you provide a simple example? Like for this two individuals using
5 SNPs:
Ind1 00122
Ind2 02210
Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
Thanks in advance,
Fernando Cruz
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail: fernando.cruz at ebd.csic.es
Website: http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html
****************************************
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From fernando.cruz at ebd.csic.es Sun Nov 17 16:41:36 2013
From: fernando.cruz at ebd.csic.es (Fernando Cruz)
Date: Sun, 17 Nov 2013 16:41:36 +0100
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>
References: <52866D98.6040106@ebd.csic.es>,
<2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>
Message-ID: <5288E3B0.8080206@ebd.csic.es>
Thanks Tibaut,
This clarifies. In both the euclidean and the Hamming distances, the
distance between a pair of individuals depends on the number of
"unshared alleles".
By the way, then the standardized distance is plot in the NJ Tree
instead of using the Saitou & Nei (1987) used by APE library, right?
Cheers,
Fernando
On 11/17/13 4:23 PM, Jombart, Thibaut wrote:
> Just realized a typo:
>
> sqrt(\sum_i (x_i - y_i)^2
>
> should read
>
> sqrt{ \sum_i (x_i - y_i)^2 }
>
> Cheers
> Thibaut
> ________________________________________
> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk]
> Sent: 17 November 2013 15:07
> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
>
> Hello there,
>
> there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options.
>
> One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'.
>
> The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations):
>
> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
>
> Using your example:
>> x <- c(0,0,1,2,2)
>> y <- c(0,2,2,1,0)
>> sqrt(sum((x-y)^2))
> [1] 3.162278
>> dist(rbind.data.frame(x,y))
> 1
> 2 3.162278
>
>
> Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different:
>> x.rel <- x/2
>> y.rel <- y/2
>> dist(rbind.data.frame(x.rel,y.rel))
> 1
> 2 1.581139
>
> That is, the distance between the raw allele count profiles divided by the ploidy.
>
> As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances).
>
> Cheers
>
> Thibaut
>
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary?s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es]
> Sent: 15 November 2013 18:53
> To:adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] Request an example of genetic distance among two individuals
>
> Hi Thibaut,
>
> I performed a NJ Tree using 1M SNPs with 10 samples, following the
> instructions in the documentation. However I would like to know exactly
> the genetic distance among individuals is calculated. Is it based on the
> number of shared alleles?
>
> Could you provide a simple example? Like for this two individuals using
> 5 SNPs:
> Ind1 00122
> Ind2 02210
>
> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
>
> Thanks in advance,
> Fernando Cruz
>
>
> --
> ****************************************
> Dr. Fernando Cruz
> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
> Avd. Americo Vespucio s/n
> 41092-Seville (Spain)
> Tel. +34 954466700/Ext. 1079
> Fax: +34 95 4621125
> Room: 0/12
>
> e-mail:fernando.cruz at ebd.csic.es
> Website:http://openwetware.org/wiki/User:Fernando_Cruz
> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
> ****************************************
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail:fernando.cruz at ebd.csic.es
Website:http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
****************************************
From t.jombart at imperial.ac.uk Sun Nov 17 16:45:51 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 17 Nov 2013 15:45:51 +0000
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <5288E3B0.8080206@ebd.csic.es>
References: <52866D98.6040106@ebd.csic.es>,
<2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>,
<5288E3B0.8080206@ebd.csic.es>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk>
Hi there,
I'm not sure which tree you are referring to.
Cheers
Thibaut
________________________________________
From: Fernando Cruz [fernando.cruz at ebd.csic.es]
Sent: 17 November 2013 15:41
To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
Thanks Tibaut,
This clarifies. In both the euclidean and the Hamming distances, the
distance between a pair of individuals depends on the number of
"unshared alleles".
By the way, then the standardized distance is plot in the NJ Tree
instead of using the Saitou & Nei (1987) used by APE library, right?
Cheers,
Fernando
On 11/17/13 4:23 PM, Jombart, Thibaut wrote:
> Just realized a typo:
>
> sqrt(\sum_i (x_i - y_i)^2
>
> should read
>
> sqrt{ \sum_i (x_i - y_i)^2 }
>
> Cheers
> Thibaut
> ________________________________________
> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk]
> Sent: 17 November 2013 15:07
> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
>
> Hello there,
>
> there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options.
>
> One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'.
>
> The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations):
>
> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
>
> Using your example:
>> x <- c(0,0,1,2,2)
>> y <- c(0,2,2,1,0)
>> sqrt(sum((x-y)^2))
> [1] 3.162278
>> dist(rbind.data.frame(x,y))
> 1
> 2 3.162278
>
>
> Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different:
>> x.rel <- x/2
>> y.rel <- y/2
>> dist(rbind.data.frame(x.rel,y.rel))
> 1
> 2 1.581139
>
> That is, the distance between the raw allele count profiles divided by the ploidy.
>
> As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances).
>
> Cheers
>
> Thibaut
>
>
> --
> ######################################
> Dr Thibaut JOMBART
> MRC Centre for Outbreak Analysis and Modelling
> Department of Infectious Disease Epidemiology
> Imperial College - School of Public Health
> St Mary?s Campus
> Norfolk Place
> London W2 1PG
> United Kingdom
> Tel. : 0044 (0)20 7594 3658
> t.jombart at imperial.ac.uk
> http://sites.google.com/site/thibautjombart/
> http://adegenet.r-forge.r-project.org/
> ________________________________________
> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es]
> Sent: 15 November 2013 18:53
> To:adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] Request an example of genetic distance among two individuals
>
> Hi Thibaut,
>
> I performed a NJ Tree using 1M SNPs with 10 samples, following the
> instructions in the documentation. However I would like to know exactly
> the genetic distance among individuals is calculated. Is it based on the
> number of shared alleles?
>
> Could you provide a simple example? Like for this two individuals using
> 5 SNPs:
> Ind1 00122
> Ind2 02210
>
> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
>
> Thanks in advance,
> Fernando Cruz
>
>
> --
> ****************************************
> Dr. Fernando Cruz
> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
> Avd. Americo Vespucio s/n
> 41092-Seville (Spain)
> Tel. +34 954466700/Ext. 1079
> Fax: +34 95 4621125
> Room: 0/12
>
> e-mail:fernando.cruz at ebd.csic.es
> Website:http://openwetware.org/wiki/User:Fernando_Cruz
> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
> ****************************************
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail:fernando.cruz at ebd.csic.es
Website:http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
****************************************
From fernando.cruz at ebd.csic.es Sun Nov 17 17:03:21 2013
From: fernando.cruz at ebd.csic.es (Fernando Cruz)
Date: Sun, 17 Nov 2013 17:03:21 +0100
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk>
References: <52866D98.6040106@ebd.csic.es>,
<2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>,
<5288E3B0.8080206@ebd.csic.es>
<2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk>
Message-ID: <5288E8C9.6050505@ebd.csic.es>
Hi Tibaut,
The nj tree of APE. What I basically did was:
mygenlight <- read.snp("/Users/Nando/Documents/mydata.snp", chunk=2)
x<- seploc(k31_13c_lp23,n.block=100) # ~10000 SNPs each
library(ape)
lD<-lapply(x, function(e) dist(as.matrix(e))) # dist is used within a
lapply loop to compute pairwise distances between individuals for each block
class(lD[[1]])
#The general distance matrix is obtained by summing these:
D <- Reduce("+", lD)
plot (nj(D), type="fan")
Cheers,
Fernando
On 11/17/13 4:45 PM, Jombart, Thibaut wrote:
> Hi there,
>
> I'm not sure which tree you are referring to.
>
> Cheers
> Thibaut
> ________________________________________
> From: Fernando Cruz [fernando.cruz at ebd.csic.es]
> Sent: 17 November 2013 15:41
> To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
>
> Thanks Tibaut,
>
> This clarifies. In both the euclidean and the Hamming distances, the
> distance between a pair of individuals depends on the number of
> "unshared alleles".
> By the way, then the standardized distance is plot in the NJ Tree
> instead of using the Saitou & Nei (1987) used by APE library, right?
>
> Cheers,
> Fernando
>
> On 11/17/13 4:23 PM, Jombart, Thibaut wrote:
>> Just realized a typo:
>>
>> sqrt(\sum_i (x_i - y_i)^2
>>
>> should read
>>
>> sqrt{ \sum_i (x_i - y_i)^2 }
>>
>> Cheers
>> Thibaut
>> ________________________________________
>> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk]
>> Sent: 17 November 2013 15:07
>> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org
>> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
>>
>> Hello there,
>>
>> there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options.
>>
>> One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'.
>>
>> The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations):
>>
>> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
>>
>> Using your example:
>>> x <- c(0,0,1,2,2)
>>> y <- c(0,2,2,1,0)
>>> sqrt(sum((x-y)^2))
>> [1] 3.162278
>>> dist(rbind.data.frame(x,y))
>> 1
>> 2 3.162278
>>
>>
>> Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different:
>>> x.rel <- x/2
>>> y.rel <- y/2
>>> dist(rbind.data.frame(x.rel,y.rel))
>> 1
>> 2 1.581139
>>
>> That is, the distance between the raw allele count profiles divided by the ploidy.
>>
>> As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances).
>>
>> Cheers
>>
>> Thibaut
>>
>>
>> --
>> ######################################
>> Dr Thibaut JOMBART
>> MRC Centre for Outbreak Analysis and Modelling
>> Department of Infectious Disease Epidemiology
>> Imperial College - School of Public Health
>> St Mary?s Campus
>> Norfolk Place
>> London W2 1PG
>> United Kingdom
>> Tel. : 0044 (0)20 7594 3658
>> t.jombart at imperial.ac.uk
>> http://sites.google.com/site/thibautjombart/
>> http://adegenet.r-forge.r-project.org/
>> ________________________________________
>> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es]
>> Sent: 15 November 2013 18:53
>> To:adegenet-forum at lists.r-forge.r-project.org
>> Subject: [adegenet-forum] Request an example of genetic distance among two individuals
>>
>> Hi Thibaut,
>>
>> I performed a NJ Tree using 1M SNPs with 10 samples, following the
>> instructions in the documentation. However I would like to know exactly
>> the genetic distance among individuals is calculated. Is it based on the
>> number of shared alleles?
>>
>> Could you provide a simple example? Like for this two individuals using
>> 5 SNPs:
>> Ind1 00122
>> Ind2 02210
>>
>> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
>>
>> Thanks in advance,
>> Fernando Cruz
>>
>>
>> --
>> ****************************************
>> Dr. Fernando Cruz
>> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
>> Avd. Americo Vespucio s/n
>> 41092-Seville (Spain)
>> Tel. +34 954466700/Ext. 1079
>> Fax: +34 95 4621125
>> Room: 0/12
>>
>> e-mail:fernando.cruz at ebd.csic.es
>> Website:http://openwetware.org/wiki/User:Fernando_Cruz
>> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
>> ****************************************
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> --
> ****************************************
> Dr. Fernando Cruz
> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
> Avd. Americo Vespucio s/n
> 41092-Seville (Spain)
> Tel. +34 954466700/Ext. 1079
> Fax: +34 95 4621125
> Room: 0/12
>
> e-mail:fernando.cruz at ebd.csic.es
> Website:http://openwetware.org/wiki/User:Fernando_Cruz
> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
> ****************************************
>
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail: fernando.cruz at ebd.csic.es
Website: http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html
****************************************
From fernando.cruz at ebd.csic.es Sun Nov 17 17:13:29 2013
From: fernando.cruz at ebd.csic.es (Fernando Cruz)
Date: Sun, 17 Nov 2013 17:13:29 +0100
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <5288E8C9.6050505@ebd.csic.es>
References: <52866D98.6040106@ebd.csic.es>,
<2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>,
<5288E3B0.8080206@ebd.csic.es>
<2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk>
<5288E8C9.6050505@ebd.csic.es>
Message-ID: <5288EB29.80202@ebd.csic.es>
Well,there's a typo sorry. "k31_13c_lp23" is the same as "mygenlight"
Thanks,
Fernando
On 11/17/13 5:03 PM, Fernando Cruz wrote:
> Hi Tibaut,
>
> The nj tree of APE. What I basically did was:
>
> mygenlight <- read.snp("/Users/Nando/Documents/mydata.snp", chunk=2)
>
> x<- seploc(k31_13c_lp23,n.block=100) # ~10000 SNPs each
>
> library(ape)
> lD<-lapply(x, function(e) dist(as.matrix(e))) # dist is used within a
> lapply loop to compute pairwise distances between individuals for each
> block
> class(lD[[1]])
>
> #The general distance matrix is obtained by summing these:
> D <- Reduce("+", lD)
> plot (nj(D), type="fan")
>
> Cheers,
> Fernando
>
> On 11/17/13 4:45 PM, Jombart, Thibaut wrote:
>> Hi there,
>>
>> I'm not sure which tree you are referring to.
>>
>> Cheers
>> Thibaut
>> ________________________________________
>> From: Fernando Cruz [fernando.cruz at ebd.csic.es]
>> Sent: 17 November 2013 15:41
>> To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
>> Subject: Re: [adegenet-forum] Request an example of genetic distance
>> among two individuals
>>
>> Thanks Tibaut,
>>
>> This clarifies. In both the euclidean and the Hamming distances, the
>> distance between a pair of individuals depends on the number of
>> "unshared alleles".
>> By the way, then the standardized distance is plot in the NJ Tree
>> instead of using the Saitou & Nei (1987) used by APE library, right?
>>
>> Cheers,
>> Fernando
>>
>> On 11/17/13 4:23 PM, Jombart, Thibaut wrote:
>>> Just realized a typo:
>>>
>>> sqrt(\sum_i (x_i - y_i)^2
>>>
>>> should read
>>>
>>> sqrt{ \sum_i (x_i - y_i)^2 }
>>>
>>> Cheers
>>> Thibaut
>>> ________________________________________
>>> From:adegenet-forum-bounces at lists.r-forge.r-project.org
>>> [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of
>>> Jombart, Thibaut [t.jombart at imperial.ac.uk]
>>> Sent: 17 November 2013 15:07
>>> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org
>>> Subject: Re: [adegenet-forum] Request an example of genetic distance
>>> among two individuals
>>>
>>> Hello there,
>>>
>>> there are many different distances that can be computed between
>>> allelic profiles, but at an individual levels there is somewhat less
>>> options.
>>>
>>> One is the Hamming distance, which you mention here (D=6), and which
>>> you can deduce from 'propShared'.
>>>
>>> The usual Euclidean distance is different though. Between two
>>> vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean
>>> distance is given by (using latex notations):
>>>
>>> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
>>>
>>> Using your example:
>>>> x <- c(0,0,1,2,2)
>>>> y <- c(0,2,2,1,0)
>>>> sqrt(sum((x-y)^2))
>>> [1] 3.162278
>>>> dist(rbind.data.frame(x,y))
>>> 1
>>> 2 3.162278
>>>
>>>
>>> Note that in adegenet, data in genind objects are standardized to
>>> relative frequencies, so that the distance would be different:
>>>> x.rel <- x/2
>>>> y.rel <- y/2
>>>> dist(rbind.data.frame(x.rel,y.rel))
>>> 1
>>> 2 1.581139
>>>
>>> That is, the distance between the raw allele count profiles divided
>>> by the ploidy.
>>>
>>> As a last note, there is a particular case for haploid data, where
>>> the Hamming distance equals the squared Euclidean distance (it
>>> follows that a PCA on the covariance matrix is also the best
>>> reduced-space representation of Hamming distances).
>>>
>>> Cheers
>>>
>>> Thibaut
>>>
>>>
>>> --
>>> ######################################
>>> Dr Thibaut JOMBART
>>> MRC Centre for Outbreak Analysis and Modelling
>>> Department of Infectious Disease Epidemiology
>>> Imperial College - School of Public Health
>>> St Mary?s Campus
>>> Norfolk Place
>>> London W2 1PG
>>> United Kingdom
>>> Tel. : 0044 (0)20 7594 3658
>>> t.jombart at imperial.ac.uk
>>> http://sites.google.com/site/thibautjombart/
>>> http://adegenet.r-forge.r-project.org/
>>> ________________________________________
>>> From:adegenet-forum-bounces at lists.r-forge.r-project.org
>>> [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of
>>> Fernando Cruz [fernando.cruz at ebd.csic.es]
>>> Sent: 15 November 2013 18:53
>>> To:adegenet-forum at lists.r-forge.r-project.org
>>> Subject: [adegenet-forum] Request an example of genetic distance
>>> among two individuals
>>>
>>> Hi Thibaut,
>>>
>>> I performed a NJ Tree using 1M SNPs with 10 samples, following the
>>> instructions in the documentation. However I would like to know exactly
>>> the genetic distance among individuals is calculated. Is it based on
>>> the
>>> number of shared alleles?
>>>
>>> Could you provide a simple example? Like for this two individuals
>>> using
>>> 5 SNPs:
>>> Ind1 00122
>>> Ind2 02210
>>>
>>> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
>>>
>>> Thanks in advance,
>>> Fernando Cruz
>>>
>>>
>>> --
>>> ****************************************
>>> Dr. Fernando Cruz
>>> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
>>> Avd. Americo Vespucio s/n
>>> 41092-Seville (Spain)
>>> Tel. +34 954466700/Ext. 1079
>>> Fax: +34 95 4621125
>>> Room: 0/12
>>>
>>> e-mail:fernando.cruz at ebd.csic.es
>>> Website:http://openwetware.org/wiki/User:Fernando_Cruz
>>> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
>>> ****************************************
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>
>>
>> --
>> ****************************************
>> Dr. Fernando Cruz
>> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
>> Avd. Americo Vespucio s/n
>> 41092-Seville (Spain)
>> Tel. +34 954466700/Ext. 1079
>> Fax: +34 95 4621125
>> Room: 0/12
>>
>> e-mail:fernando.cruz at ebd.csic.es
>> Website:http://openwetware.org/wiki/User:Fernando_Cruz
>> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
>> ****************************************
>>
>
>
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail: fernando.cruz at ebd.csic.es
Website: http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html
****************************************
From t.jombart at imperial.ac.uk Sun Nov 17 19:52:59 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 17 Nov 2013 18:52:59 +0000
Subject: [adegenet-forum] Request an example of genetic distance among
two individuals
In-Reply-To: <5288E8C9.6050505@ebd.csic.es>
References: <52866D98.6040106@ebd.csic.es>,
<2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>,
<5288E3B0.8080206@ebd.csic.es>
<2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk>,
<5288E8C9.6050505@ebd.csic.es>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D738@icexch-m1.ic.ac.uk>
Hello,
just to clarify, 'nj' from APE is agnostic with respect to the distance used.
Here in your code you are using 'dist', thus the Euclidean distance between SNP profiles.
Cheers
Thibaut
________________________________________
From: Fernando Cruz [fernando.cruz at ebd.csic.es]
Sent: 17 November 2013 16:03
To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
Hi Tibaut,
The nj tree of APE. What I basically did was:
mygenlight <- read.snp("/Users/Nando/Documents/mydata.snp", chunk=2)
x<- seploc(k31_13c_lp23,n.block=100) # ~10000 SNPs each
library(ape)
lD<-lapply(x, function(e) dist(as.matrix(e))) # dist is used within a
lapply loop to compute pairwise distances between individuals for each block
class(lD[[1]])
#The general distance matrix is obtained by summing these:
D <- Reduce("+", lD)
plot (nj(D), type="fan")
Cheers,
Fernando
On 11/17/13 4:45 PM, Jombart, Thibaut wrote:
> Hi there,
>
> I'm not sure which tree you are referring to.
>
> Cheers
> Thibaut
> ________________________________________
> From: Fernando Cruz [fernando.cruz at ebd.csic.es]
> Sent: 17 November 2013 15:41
> To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
>
> Thanks Tibaut,
>
> This clarifies. In both the euclidean and the Hamming distances, the
> distance between a pair of individuals depends on the number of
> "unshared alleles".
> By the way, then the standardized distance is plot in the NJ Tree
> instead of using the Saitou & Nei (1987) used by APE library, right?
>
> Cheers,
> Fernando
>
> On 11/17/13 4:23 PM, Jombart, Thibaut wrote:
>> Just realized a typo:
>>
>> sqrt(\sum_i (x_i - y_i)^2
>>
>> should read
>>
>> sqrt{ \sum_i (x_i - y_i)^2 }
>>
>> Cheers
>> Thibaut
>> ________________________________________
>> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk]
>> Sent: 17 November 2013 15:07
>> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org
>> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals
>>
>> Hello there,
>>
>> there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options.
>>
>> One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'.
>>
>> The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations):
>>
>> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2
>>
>> Using your example:
>>> x <- c(0,0,1,2,2)
>>> y <- c(0,2,2,1,0)
>>> sqrt(sum((x-y)^2))
>> [1] 3.162278
>>> dist(rbind.data.frame(x,y))
>> 1
>> 2 3.162278
>>
>>
>> Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different:
>>> x.rel <- x/2
>>> y.rel <- y/2
>>> dist(rbind.data.frame(x.rel,y.rel))
>> 1
>> 2 1.581139
>>
>> That is, the distance between the raw allele count profiles divided by the ploidy.
>>
>> As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances).
>>
>> Cheers
>>
>> Thibaut
>>
>>
>> --
>> ######################################
>> Dr Thibaut JOMBART
>> MRC Centre for Outbreak Analysis and Modelling
>> Department of Infectious Disease Epidemiology
>> Imperial College - School of Public Health
>> St Mary?s Campus
>> Norfolk Place
>> London W2 1PG
>> United Kingdom
>> Tel. : 0044 (0)20 7594 3658
>> t.jombart at imperial.ac.uk
>> http://sites.google.com/site/thibautjombart/
>> http://adegenet.r-forge.r-project.org/
>> ________________________________________
>> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es]
>> Sent: 15 November 2013 18:53
>> To:adegenet-forum at lists.r-forge.r-project.org
>> Subject: [adegenet-forum] Request an example of genetic distance among two individuals
>>
>> Hi Thibaut,
>>
>> I performed a NJ Tree using 1M SNPs with 10 samples, following the
>> instructions in the documentation. However I would like to know exactly
>> the genetic distance among individuals is calculated. Is it based on the
>> number of shared alleles?
>>
>> Could you provide a simple example? Like for this two individuals using
>> 5 SNPs:
>> Ind1 00122
>> Ind2 02210
>>
>> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10
>>
>> Thanks in advance,
>> Fernando Cruz
>>
>>
>> --
>> ****************************************
>> Dr. Fernando Cruz
>> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
>> Avd. Americo Vespucio s/n
>> 41092-Seville (Spain)
>> Tel. +34 954466700/Ext. 1079
>> Fax: +34 95 4621125
>> Room: 0/12
>>
>> e-mail:fernando.cruz at ebd.csic.es
>> Website:http://openwetware.org/wiki/User:Fernando_Cruz
>> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
>> ****************************************
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> --
> ****************************************
> Dr. Fernando Cruz
> Estaci?n Biol?gica de Do?ana (EBD-CSIC)
> Avd. Americo Vespucio s/n
> 41092-Seville (Spain)
> Tel. +34 954466700/Ext. 1079
> Fax: +34 95 4621125
> Room: 0/12
>
> e-mail:fernando.cruz at ebd.csic.es
> Website:http://openwetware.org/wiki/User:Fernando_Cruz
> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html
> ****************************************
>
--
****************************************
Dr. Fernando Cruz
Estaci?n Biol?gica de Do?ana (EBD-CSIC)
Avd. Americo Vespucio s/n
41092-Seville (Spain)
Tel. +34 954466700/Ext. 1079
Fax: +34 95 4621125
Room: 0/12
e-mail: fernando.cruz at ebd.csic.es
Website: http://openwetware.org/wiki/User:Fernando_Cruz
Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html
****************************************
From katherine.miller at students.tamuk.edu Sat Nov 23 22:36:57 2013
From: katherine.miller at students.tamuk.edu (katherine.miller)
Date: Sat, 23 Nov 2013 21:36:57 +0000
Subject: [adegenet-forum] UTMs vs lat/lon,
and inverse distance network confusion
Message-ID: <8DECF27DB3B2534F8C9873C9B734441F713E9765@BY2PRD0810MB392.namprd08.prod.outlook.com>
Greetings,
I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions:
1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area?
2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error:
error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed
I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated!
Katherine S. Miller
Ph.D. candidate
Caesar Kleberg Wildlife Research Institute
Texas A&M University-Kingsville
MSC 218, 700 University Blvd
Kingsville, TX 78363
(361) 593-4486, office
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Sun Nov 24 15:45:30 2013
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 24 Nov 2013 14:45:30 +0000
Subject: [adegenet-forum] UTMs vs lat/lon,
and inverse distance network confusion
In-Reply-To: <8DECF27DB3B2534F8C9873C9B734441F713E9765@BY2PRD0810MB392.namprd08.prod.outlook.com>
References: <8DECF27DB3B2534F8C9873C9B734441F713E9765@BY2PRD0810MB392.namprd08.prod.outlook.com>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657063918FF6@icexch-m1.ic.ac.uk>
Dear Katherine,
sPCA uses a rather crude model of spatial proximities (most commonly a binary connection network), so that conversion from latitudes/longitudes, even at that regional scale, should not be much of an issue.
As for the choice of network or the error you report, it is difficult to provide advice / guess the origin of the error without a spatial distribution of your locations, or a sample of data and code reproducing the error. In general, I would advocated using a binary connection network (e.g. Delaunay's triangulation, Gabriel's graph) where possible. If the sampling design is very uneven, treating clusters as populations (possibly with finer scale, within-population geographic analyses) might be an option.
Cheers
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of katherine.miller [katherine.miller at students.tamuk.edu]
Sent: 23 November 2013 21:36
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion
Greetings,
I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions:
1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area?
2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error:
error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed
I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated!
Katherine S. Miller
Ph.D. candidate
Caesar Kleberg Wildlife Research Institute
Texas A&M University-Kingsville
MSC 218, 700 University Blvd
Kingsville, TX 78363
(361) 593-4486, office
From katherine.miller at students.tamuk.edu Sun Nov 24 19:35:31 2013
From: katherine.miller at students.tamuk.edu (katherine.miller)
Date: Sun, 24 Nov 2013 18:35:31 +0000
Subject: [adegenet-forum] UTMs vs lat/lon,
and inverse distance network confusion
Message-ID: <8DECF27DB3B2534F8C9873C9B734441F713E978F@BY2PRD0810MB392.namprd08.prod.outlook.com>
Thank you so much for your input.
When I try to do the Delaunay triangulation it tells me:
"Error in tri2nb(xy) : too few coordinates"
I'm wondering if the problem is the XY data. I followed the format from Robinson et al. 2012: The walk is never random: subtle landscape effects shape gene flow in a continuous white-tailed deer population in the Midwestern United States. http://datadryad.org/resource/doi:10.5061/dryad.p7639
I'm trying to duplicate Robinson et al.'s R script, uploaded with my data here:
https://www.dropbox.com/sh/8bckmlf2nbx5hdh/75HaBklOlG
This includes my genetic data, the .gen file, and the locations. The genetic data represents northern bobwhite genetic samples, each line a new sample, and alleles at 13 loci.
In response to the finer scale approach, I've been setting up my data for a spatial regression in spdep, but I would like to get the spatial pca to run first.
Comments and suggestions are definitely appreciated! Thank you in advance!
Katherine S. Miller
Ph.D. candidate
Caesar Kleberg Wildlife Research Institute
Texas A&M University-Kingsville
MSC 218, 700 University Blvd
Kingsville, TX 78363
(361) 593-4486, office
________________________________________
From: Jombart, Thibaut [t.jombart at imperial.ac.uk]
Sent: Sunday, November 24, 2013 8:45 AM
To: katherine.miller; adegenet-forum at lists.r-forge.r-project.org
Subject: RE: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion
Dear Katherine,
sPCA uses a rather crude model of spatial proximities (most commonly a binary connection network), so that conversion from latitudes/longitudes, even at that regional scale, should not be much of an issue.
As for the choice of network or the error you report, it is difficult to provide advice / guess the origin of the error without a spatial distribution of your locations, or a sample of data and code reproducing the error. In general, I would advocated using a binary connection network (e.g. Delaunay's triangulation, Gabriel's graph) where possible. If the sampling design is very uneven, treating clusters as populations (possibly with finer scale, within-population geographic analyses) might be an option.
Cheers
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
St Mary?s Campus
Norfolk Place
London W2 1PG
United Kingdom
Tel. : 0044 (0)20 7594 3658
t.jombart at imperial.ac.uk
http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of katherine.miller [katherine.miller at students.tamuk.edu]
Sent: 23 November 2013 21:36
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion
Greetings,
I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions:
1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area?
2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error:
error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed
I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated!
Katherine S. Miller
Ph.D. candidate
Caesar Kleberg Wildlife Research Institute
Texas A&M University-Kingsville
MSC 218, 700 University Blvd
Kingsville, TX 78363
(361) 593-4486, office
From RoyFrancis.Mathew at agrsci.dk Mon Nov 25 12:32:05 2013
From: RoyFrancis.Mathew at agrsci.dk (Roy Mathew Francis)
Date: Mon, 25 Nov 2013 11:32:05 +0000
Subject: [adegenet-forum] UTMs vs lat/lon,
and inverse distance network confusion
In-Reply-To: <8DECF27DB3B2534F8C9873C9B734441F713E978F@BY2PRD0810MB392.namprd08.prod.outlook.com>
References: <8DECF27DB3B2534F8C9873C9B734441F713E978F@BY2PRD0810MB392.namprd08.prod.outlook.com>
Message-ID:
Hi,
I am not an expert on this, but I have done some sPCA using large areas. If your area spans more than one UTM zone, just use any one zone that fits best. You will still get coordinates for points outside the UTM zone since it's based on one transverse meridian. When you plot it later on a background map, make sure the background map is plotted using the same UTM coordinates. When using points outside the UTM zone, distances might be fine but the projection would be distorted.
Regarding the minimum distance, I always thought that was the distances your individuals could migrate (maybe avg dis or max dist). But, maybe that is neighbourhood by distance type 5.
Roy
-----Original Message-----
From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of katherine.miller
Sent: 24 November 2013 19:36
To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion
Thank you so much for your input.
When I try to do the Delaunay triangulation it tells me:
"Error in tri2nb(xy) : too few coordinates"
I'm wondering if the problem is the XY data. I followed the format from Robinson et al. 2012: The walk is never random: subtle landscape effects shape gene flow in a continuous white-tailed deer population in the Midwestern United States. http://datadryad.org/resource/doi:10.5061/dryad.p7639
I'm trying to duplicate Robinson et al.'s R script, uploaded with my data here:
https://www.dropbox.com/sh/8bckmlf2nbx5hdh/75HaBklOlG
This includes my genetic data, the .gen file, and the locations. The genetic data represents northern bobwhite genetic samples, each line a new sample, and alleles at 13 loci.
In response to the finer scale approach, I've been setting up my data for a spatial regression in spdep, but I would like to get the spatial pca to run first.
Comments and suggestions are definitely appreciated! Thank you in advance!
Katherine S. Miller
Ph.D. candidate
Caesar Kleberg Wildlife Research Institute
Texas A&M University-Kingsville
MSC 218, 700 University Blvd
Kingsville, TX 78363
(361) 593-4486, office
________________________________________
From: Jombart, Thibaut [t.jombart at imperial.ac.uk]
Sent: Sunday, November 24, 2013 8:45 AM
To: katherine.miller; adegenet-forum at lists.r-forge.r-project.org
Subject: RE: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion
Dear Katherine,
sPCA uses a rather crude model of spatial proximities (most commonly a binary connection network), so that conversion from latitudes/longitudes, even at that regional scale, should not be much of an issue.
As for the choice of network or the error you report, it is difficult to provide advice / guess the origin of the error without a spatial distribution of your locations, or a sample of data and code reproducing the error. In general, I would advocated using a binary connection network (e.g. Delaunay's triangulation, Gabriel's graph) where possible. If the sampling design is very uneven, treating clusters as populations (possibly with finer scale, within-population geographic analyses) might be an option.
Cheers
Thibaut
--
######################################
Dr Thibaut JOMBART
MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary's Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/
http://adegenet.r-forge.r-project.org/
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of katherine.miller [katherine.miller at students.tamuk.edu]
Sent: 23 November 2013 21:36
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion
Greetings,
I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions:
1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area?
2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error:
error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed
I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated!
Katherine S. Miller
Ph.D. candidate
Caesar Kleberg Wildlife Research Institute
Texas A&M University-Kingsville
MSC 218, 700 University Blvd
Kingsville, TX 78363
(361) 593-4486, office
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum