From Aaron.Adamack at canberra.edu.au Wed Nov 6 13:21:45 2013 From: Aaron.Adamack at canberra.edu.au (Aaron.Adamack) Date: Wed, 6 Nov 2013 12:21:45 +0000 Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers Message-ID: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au> Hi, I'm trying to perform a sPCA and am getting an error when I attempt to make a screeplot showing the spatial and variance components of the eigenvalues. The error seems to be coming from the summary command that gets run within screeplot as I get the following error message: > summary(possum.spca2) Spatial principal component analysis Call: spca(obj = nonapossum, cn = possum.graph, scannf = FALSE, nfposi = 2, nfnega = 0) Error in min(eigL) : invalid 'type' (complex) of argument Looking at the step in summary just before it breaks, all (or nearly all) values of eigL are complex numbers (e.g. 1.025750e+00+0.000000e+00i). Other than this, I am able to go through all of the steps in the examples provided in adegenet-spca.pdf, so I'm not sure if this is a sign of problems with my data set or if it could be something else? I am pointing to problems in my data set as there is quite a bit of missing data in my genotypes (~12.4%) and I have 1605 individuals. The code I'm running is: ... data organization steps to prepare my genind object dpossum ... nonapossum<-na.replace(dpossum,met=0) possum.graph<-chooseCN(nonapossum$other$xy,type=5,d1=0,d2=5000,plot=FALSE,res="listw") possum.spca2<-spca(nonapossum,cn=possum.graph,scannf=FALSE,nfposi=2,nfnega=0) screeplot(possum.spca2) Any help in solving this would be greatly appreciated. -Aaron p.s. I think there may be a small typo on page 3 of the manual (adegenet-spca.pdf), I think the page reference for Numerical Ecology should be pp. 752-756 rather than pp. 572-576. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Wed Nov 6 17:13:38 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Wed, 6 Nov 2013 16:13:38 +0000 Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers In-Reply-To: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au> References: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390984A@icexch-m1.ic.ac.uk> Hello there, thanks for reporting the error. I confess it is beyond me how one can get complex eigenvalues in sPCA. As this is the first time it happens, there may be something quirky about this particular dataset. I would need a reproducible example to possibly try and understand what is going on. Thanks for the typo, fixed on the devel now. Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Aaron.Adamack [Aaron.Adamack at canberra.edu.au] Sent: 06 November 2013 12:21 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers Hi, I?m trying to perform a sPCA and am getting an error when I attempt to make a screeplot showing the spatial and variance components of the eigenvalues. The error seems to be coming from the summary command that gets run within screeplot as I get the following error message: > summary(possum.spca2) Spatial principal component analysis Call: spca(obj = nonapossum, cn = possum.graph, scannf = FALSE, nfposi = 2, nfnega = 0) Error in min(eigL) : invalid 'type' (complex) of argument Looking at the step in summary just before it breaks, all (or nearly all) values of eigL are complex numbers (e.g. 1.025750e+00+0.000000e+00i). Other than this, I am able to go through all of the steps in the examples provided in adegenet-spca.pdf, so I?m not sure if this is a sign of problems with my data set or if it could be something else? I am pointing to problems in my data set as there is quite a bit of missing data in my genotypes (~12.4%) and I have 1605 individuals. The code I?m running is: ? data organization steps to prepare my genind object dpossum ? nonapossum<-na.replace(dpossum,met=0) possum.graph<-chooseCN(nonapossum$other$xy,type=5,d1=0,d2=5000,plot=FALSE,res="listw") possum.spca2<-spca(nonapossum,cn=possum.graph,scannf=FALSE,nfposi=2,nfnega=0) screeplot(possum.spca2) Any help in solving this would be greatly appreciated. -Aaron p.s. I think there may be a small typo on page 3 of the manual (adegenet-spca.pdf), I think the page reference for Numerical Ecology should be pp. 752-756 rather than pp. 572-576. From danica_714 at hotmail.com Wed Nov 6 17:22:31 2013 From: danica_714 at hotmail.com (Danica Fabrigar) Date: Wed, 6 Nov 2013 16:22:31 +0000 Subject: [adegenet-forum] read.plink: Multiple cores is not supported on Windows Message-ID: Hi, I am trying to upload my SNP dataset in the PLINK format, however I get the following error message: Reading PLINK raw format into a genlight object... Loading required package: parallel Reading loci information... Reading and converting genotypes... .Error in mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), : 'mc.cores' > 1 is not supported on Windows Is there a solution to this problem? Thanks,Danica -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitiecollins17 at gmail.com Wed Nov 6 19:30:53 2013 From: caitiecollins17 at gmail.com (Caitlin Collins) Date: Wed, 6 Nov 2013 18:30:53 +0000 Subject: [adegenet-forum] Fwd: read.plink: Multiple cores is not supported on Windows In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Danica Fabrigar Date: Wed, Nov 6, 2013 at 6:19 PM Subject: RE: [adegenet-forum] read.plink: Multiple cores is not supported on Windows To: Caitlin Collins Hi Caitlin, That did the trick. Thanks you, Danica ------------------------------ Date: Wed, 6 Nov 2013 17:57:23 +0000 Subject: Re: [adegenet-forum] read.plink: Multiple cores is not supported on Windows From: caitiecollins17 at gmail.com To: danica_714 at hotmail.com Hi Danica, While I cannot be certain without knowing precisely what you did to initiate the upload, I will say that this error message is usually resolved by adding the argument *parallel=FALSE* to the argument list of the function you called. (Note: in older versions of adegenet this used to be multicore=FALSE). Hope that helps. Best, Caitlin. On Wed, Nov 6, 2013 at 4:22 PM, Danica Fabrigar wrote: Hi, I am trying to upload my SNP dataset in the PLINK format, however I get the following error message: Reading PLINK raw format into a genlight object... Loading required package: parallel Reading loci information... Reading and converting genotypes... .Error in mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), : 'mc.cores' > 1 is not supported on Windows Is there a solution to this problem? Thanks, Danica _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum -------------- next part -------------- An HTML attachment was scrubbed... URL: From danica_714 at hotmail.com Thu Nov 7 14:30:46 2013 From: danica_714 at hotmail.com (Danica Fabrigar) Date: Thu, 7 Nov 2013 13:30:46 +0000 Subject: [adegenet-forum] read.plink: no position read from .map file Message-ID: Hi, I am trying to load genome information using the read.PLINK feature. The data uploads fine with no error messages, however when I examine the @other slot, I see that the SNP positions from the map file are not uploaded. I've checked that my .map file contains all the necessary columns and all the information is there. >chr2L<-read.PLINK ("2L.raw", map.file="2L_hwe_cleaned.map", chunkSize=10000, parallel=FALSE) >chr2L$other$positionNULL Any ideas? Thanks,Danica -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Mon Nov 11 10:34:40 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Mon, 11 Nov 2013 09:34:40 +0000 Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers In-Reply-To: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au> References: <0996D44934151041B8D7623B1DC805E8A6411288@genoa.ucstaff.win.canberra.edu.au> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390B16B@icexch-m1.ic.ac.uk> Hello, the bug was not coming from adegenet, but from an oddity in 'eigen' which for some large symmetric matrices returns complex eigenvalues with imaginary parts equalling zero. Fixed now in the patch attached. On sourceforge now, and will integrate the next stable CRAN release. Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Aaron.Adamack [Aaron.Adamack at canberra.edu.au] Sent: 06 November 2013 12:21 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Screeplot showing spatial and variance components of eigenvalues fails due to complex numbers Hi, I?m trying to perform a sPCA and am getting an error when I attempt to make a screeplot showing the spatial and variance components of the eigenvalues. The error seems to be coming from the summary command that gets run within screeplot as I get the following error message: > summary(possum.spca2) Spatial principal component analysis Call: spca(obj = nonapossum, cn = possum.graph, scannf = FALSE, nfposi = 2, nfnega = 0) Error in min(eigL) : invalid 'type' (complex) of argument Looking at the step in summary just before it breaks, all (or nearly all) values of eigL are complex numbers (e.g. 1.025750e+00+0.000000e+00i). Other than this, I am able to go through all of the steps in the examples provided in adegenet-spca.pdf, so I?m not sure if this is a sign of problems with my data set or if it could be something else? I am pointing to problems in my data set as there is quite a bit of missing data in my genotypes (~12.4%) and I have 1605 individuals. The code I?m running is: ? data organization steps to prepare my genind object dpossum ? nonapossum<-na.replace(dpossum,met=0) possum.graph<-chooseCN(nonapossum$other$xy,type=5,d1=0,d2=5000,plot=FALSE,res="listw") possum.spca2<-spca(nonapossum,cn=possum.graph,scannf=FALSE,nfposi=2,nfnega=0) screeplot(possum.spca2) Any help in solving this would be greatly appreciated. -Aaron p.s. I think there may be a small typo on page 3 of the manual (adegenet-spca.pdf), I think the page reference for Numerical Ecology should be pp. 752-756 rather than pp. 572-576. -------------- next part -------------- A non-text attachment was scrubbed... Name: spca.R Type: application/octet-stream Size: 12866 bytes Desc: spca.R URL: From M.Coulson at MARLAB.AC.UK Mon Nov 11 10:50:08 2013 From: M.Coulson at MARLAB.AC.UK (Mark Coulson) Date: Mon, 11 Nov 2013 09:50:08 -0000 Subject: [adegenet-forum] identification of hybrids Message-ID: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> Hello, I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid? Best, Mark ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Mon Nov 11 11:17:14 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Mon, 11 Nov 2013 10:17:14 +0000 Subject: [adegenet-forum] identification of hybrids In-Reply-To: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk> Hello, STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model. DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work. The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids. S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that. Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK] Sent: 11 November 2013 09:50 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] identification of hybrids Hello, I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for ?pure species membership? would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid? Best, Mark ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ From M.Coulson at MARLAB.AC.UK Mon Nov 11 13:47:14 2013 From: M.Coulson at MARLAB.AC.UK (Mark Coulson) Date: Mon, 11 Nov 2013 12:47:14 -0000 Subject: [adegenet-forum] identification of hybrids References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk> Message-ID: <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk> Hi Dr. Jombart, Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data? furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE). Many thanks, Mark -----Original Message----- From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk] Sent: Mon 11/11/2013 10:17 To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org Cc: sebastien.devillard at univ-lyon1.fr Subject: RE: identification of hybrids Hello, STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model. DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work. The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids. S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that. Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary's Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK] Sent: 11 November 2013 09:50 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] identification of hybrids Hello, I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid? Best, Mark ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Mon Nov 11 16:06:47 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Mon, 11 Nov 2013 15:06:47 +0000 Subject: [adegenet-forum] identification of hybrids In-Reply-To: <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk> References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>, <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390C32C@icexch-m1.ic.ac.uk> Hi again, there can be multiple explanation for the overfitting patterns you observe, so of which could well lie within the data themself (e.g. outliers, or groups defined by few individuals). The main expectation is that there should be a number of PCs which is optimal in terms of prediction; there may be many drivers for the variance in non-optimal solutions. As for the second point, yes, this is exactly the projection of supplementary individuals described at the end of the DAPC vignette. You calibrate the DAPC with individuals from known groups, and predict the group membership of the supplementary individuals. Cheers Thibaut ________________________________________ From: Mark Coulson [M.Coulson at MARLAB.AC.UK] Sent: 11 November 2013 12:47 To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org Cc: sebastien.devillard at univ-lyon1.fr Subject: RE: identification of hybrids Hi Dr. Jombart, Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data? furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE). Many thanks, Mark -----Original Message----- From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk] Sent: Mon 11/11/2013 10:17 To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org Cc: sebastien.devillard at univ-lyon1.fr Subject: RE: identification of hybrids Hello, STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model. DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work. The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids. S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that. Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary's Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK] Sent: 11 November 2013 09:50 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] identification of hybrids Hello, I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid? Best, Mark ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ From M.Coulson at MARLAB.AC.UK Tue Nov 12 14:01:14 2013 From: M.Coulson at MARLAB.AC.UK (Mark Coulson) Date: Tue, 12 Nov 2013 13:01:14 -0000 Subject: [adegenet-forum] identification of hybrids References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>, <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390C32C@icexch-m1.ic.ac.uk> <5281F54A.2030202@univ-lyon1.fr> Message-ID: <1BA13B469D9E89408AAA651AC9B309160103F257@sose0009g.marlab.ac.uk> Many thanks for the addition re: the comparison between STRUCTURE and adegenet. I am working with three distinct groups and STRUCTURE has a hard time separating groups 2 and 3 (so thereby really only identifying 2 groups). The third group is a much smaller sample (n=75) compared to the other two baselines (100s-1000s) and I suspect that is having an effect as described in Kalinowski 2011. If one uses supplementary individuals to assign to these three groups, what would happen if some of the individuals were from a 4th distinct group that had not been sampled in the baseline. In other words, can the posterior probabilities not assign this individual to any of the three represented groups (or at least with poor probability) and thereby be considered excluded from these baselines? Thanks, Mark -----Original Message----- From: Sebastien Devillard [mailto:sebastien.devillard at univ-lyon1.fr] Sent: Tue 11/12/2013 09:30 To: Jombart, Thibaut; Mark Coulson; adegenet-forum at lists.r-forge.r-project.org Subject: Re: identification of hybrids hi, just a small add to the Thibaut's answer. From my own unpublished experience in comparing /interpreting results from STRUCTURE and DAPC in identifying hybrids of different generations (simulated microsatellite genotypes), I recorded a clear tendancy of having a less continous distribution of "individual introgression" coefficients (namely q score in STRUCTURE and membership probability in DAPC) in DAPC. In other words, higher scores to one of the parental populations are more often found in DAPC than in STRUCTURE, hence, the population hybridization rate tends to be lower in DAPC than in STRUCTURE (although I never made simulations to check whether STRUCTURE or DAPC is closer to the truth) . As Thibaut underlined, there is in STRUCTURE a genetic model which is not present in DAPC and it is likely the origin of the difference. Hope this helps S?bastien Le 11/11/2013 16:06, Jombart, Thibaut a ?crit : > Hi again, > > there can be multiple explanation for the overfitting patterns you observe, so of which could well lie within the data themself (e.g. outliers, or groups defined by few individuals). The main expectation is that there should be a number of PCs which is optimal in terms of prediction; there may be many drivers for the variance in non-optimal solutions. > > As for the second point, yes, this is exactly the projection of supplementary individuals described at the end of the DAPC vignette. You calibrate the DAPC with individuals from known groups, and predict the group membership of the supplementary individuals. > > Cheers > Thibaut > > > ________________________________________ > From: Mark Coulson [M.Coulson at MARLAB.AC.UK] > Sent: 11 November 2013 12:47 > To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org > Cc: sebastien.devillard at univ-lyon1.fr > Subject: RE: identification of hybrids > > Hi Dr. Jombart, > > Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data? > > furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE). > > Many thanks, > > Mark > > > > > > -----Original Message----- > From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk] > Sent: Mon 11/11/2013 10:17 > To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org > Cc: sebastien.devillard at univ-lyon1.fr > Subject: RE: identification of hybrids > > Hello, > > STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model. > > DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work. > > The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids. > > S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that. > > Best > Thibaut > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary's Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > t.jombart at imperial.ac.uk > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK] > Sent: 11 November 2013 09:50 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] identification of hybrids > > Hello, > > I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid? > > Best, > > Mark > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________. > -- S?bastienDevillard, PhD, Associate Professor UMR 5558 "Biometry and Evolutionary Biology" 43 bd du 11 novembre 1918, 69622 Villeurbanne cedex France Phone :+33 (0)4 72 44 81 70 Fax : +33 (0)4 72 43 13 88 sebastien.devillard at univ-lyon1.fr http://lbbe.univ-lyon1.fr/-Devillard-Sebastien-.html http://sebastien.devillard.perso.sfr.fr ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Tue Nov 12 21:07:17 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 12 Nov 2013 20:07:17 +0000 Subject: [adegenet-forum] identification of hybrids In-Reply-To: <1BA13B469D9E89408AAA651AC9B309160103F257@sose0009g.marlab.ac.uk> References: <1BA13B469D9E89408AAA651AC9B3091601092550@sose0009g.marlab.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390B1D1@icexch-m1.ic.ac.uk>, <1BA13B469D9E89408AAA651AC9B309160103F255@sose0009g.marlab.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390C32C@icexch-m1.ic.ac.uk> <5281F54A.2030202@univ-lyon1.fr>, <1BA13B469D9E89408AAA651AC9B309160103F257@sose0009g.marlab.ac.uk> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390C794@icexch-m1.ic.ac.uk> Hi there, by definition, no, the analysis cannot assign new individuals to a group that was not part of the 'training' set. Cheers Thibaut ________________________________________ From: Mark Coulson [M.Coulson at MARLAB.AC.UK] Sent: 12 November 2013 13:01 To: sebastien.devillard at univ-lyon1.fr; Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org Subject: RE: identification of hybrids Many thanks for the addition re: the comparison between STRUCTURE and adegenet. I am working with three distinct groups and STRUCTURE has a hard time separating groups 2 and 3 (so thereby really only identifying 2 groups). The third group is a much smaller sample (n=75) compared to the other two baselines (100s-1000s) and I suspect that is having an effect as described in Kalinowski 2011. If one uses supplementary individuals to assign to these three groups, what would happen if some of the individuals were from a 4th distinct group that had not been sampled in the baseline. In other words, can the posterior probabilities not assign this individual to any of the three represented groups (or at least with poor probability) and thereby be considered excluded from these baselines? Thanks, Mark -----Original Message----- From: Sebastien Devillard [mailto:sebastien.devillard at univ-lyon1.fr] Sent: Tue 11/12/2013 09:30 To: Jombart, Thibaut; Mark Coulson; adegenet-forum at lists.r-forge.r-project.org Subject: Re: identification of hybrids hi, just a small add to the Thibaut's answer. From my own unpublished experience in comparing /interpreting results from STRUCTURE and DAPC in identifying hybrids of different generations (simulated microsatellite genotypes), I recorded a clear tendancy of having a less continous distribution of "individual introgression" coefficients (namely q score in STRUCTURE and membership probability in DAPC) in DAPC. In other words, higher scores to one of the parental populations are more often found in DAPC than in STRUCTURE, hence, the population hybridization rate tends to be lower in DAPC than in STRUCTURE (although I never made simulations to check whether STRUCTURE or DAPC is closer to the truth) . As Thibaut underlined, there is in STRUCTURE a genetic model which is not present in DAPC and it is likely the origin of the difference. Hope this helps S?bastien Le 11/11/2013 16:06, Jombart, Thibaut a ?crit : > Hi again, > > there can be multiple explanation for the overfitting patterns you observe, so of which could well lie within the data themself (e.g. outliers, or groups defined by few individuals). The main expectation is that there should be a number of PCs which is optimal in terms of prediction; there may be many drivers for the variance in non-optimal solutions. > > As for the second point, yes, this is exactly the projection of supplementary individuals described at the end of the DAPC vignette. You calibrate the DAPC with individuals from known groups, and predict the group membership of the supplementary individuals. > > Cheers > Thibaut > > > ________________________________________ > From: Mark Coulson [M.Coulson at MARLAB.AC.UK] > Sent: 11 November 2013 12:47 > To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org > Cc: sebastien.devillard at univ-lyon1.fr > Subject: RE: identification of hybrids > > Hi Dr. Jombart, > > Many thanks for your quick reply and I will try out the xvalDapc option, however, I have a question on this. I did the example for this option provided and found that both fewer and many more components had a higher variance in success than say ~ 50-70. Why would more components have a higher variance, as I would have thought this many might actually overfit the data? > > furthermore, I should clarify that I have three known baselines (and these will routinely be used to compare individuals of unknown origin to identify possible hybrids. Therefore is it possible to bring in the unknowns as a separate file and to have them be imposed upon the discriminant space provided by the baseline (i.e. similar to pre-specifying the origin of some individuals to assist with clustering of unknowns in STRUCTURE). > > Many thanks, > > Mark > > > > > > -----Original Message----- > From: Jombart, Thibaut [mailto:t.jombart at imperial.ac.uk] > Sent: Mon 11/11/2013 10:17 > To: Mark Coulson; adegenet-forum at lists.r-forge.r-project.org > Cc: sebastien.devillard at univ-lyon1.fr > Subject: RE: identification of hybrids > > Hello, > > STRUCTURE uses a mixture model to partition each genotype into membership to the different populations, which is probably what one is looking for when investigating hybridization. However, this is pending that STRUCTURE actually detects the population structuring in the first place, which it may fail to do, especially when the system departs from a standard island model. > > DAPC is usually better at finding the existing population structure, but the group membership probabilities are not derived from a population genetic model. These values are derived from the position of the genotypes on the discriminant factors. This can be practical, but is slightly less satisfying from a theoretical point of view. Still, one expects hybrids to fall between their parental groups, so it should work. > > The important point one needs to be careful about is the fact that these will change if the discriminant functions change (i.e. if different numbers of PCA axes are retained). I strongly recommend using cross validation for this purpose (see function xvalDapc). Then, if you can find a DAPC giving satisfying group prediction, the compoplot should indeed point out hybrids. > > S?bastien Devillard has worked on exactly these issues, but I am unsure if the paper has been published - I'll leave him comment on that. > > Best > Thibaut > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary's Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > t.jombart at imperial.ac.uk > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Mark Coulson [M.Coulson at MARLAB.AC.UK] > Sent: 11 November 2013 09:50 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] identification of hybrids > > Hello, > > I am attempting to use adegenet in a similar fashion to how one may use STRUCTURE to identify hybrids/admixed individuals. I know the compoplot function will allow for a STRUCTURE-like bar plot but my question is given the differences between STRUCTURE and compoplot, can one still make the same inferences about the identification of hybrids? In STRUCTURE I have been using a q-value cut-off from known individuals to identify possible hybrids (also simulating known hybrids) so that individuals falling below the q-value for 'pure species membership' would fall into this category. Given compoplot is a probability rather than a membership coefficient, is this type of an approach valid? > > Best, > > Mark > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________. > -- S?bastienDevillard, PhD, Associate Professor UMR 5558 "Biometry and Evolutionary Biology" 43 bd du 11 novembre 1918, 69622 Villeurbanne cedex France Phone :+33 (0)4 72 44 81 70 Fax : +33 (0)4 72 43 13 88 sebastien.devillard at univ-lyon1.fr http://lbbe.univ-lyon1.fr/-Devillard-Sebastien-.html http://sebastien.devillard.perso.sfr.fr ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ From fernando.cruz at ebd.csic.es Fri Nov 15 19:53:12 2013 From: fernando.cruz at ebd.csic.es (Fernando Cruz) Date: Fri, 15 Nov 2013 19:53:12 +0100 Subject: [adegenet-forum] Request an example of genetic distance among two individuals Message-ID: <52866D98.6040106@ebd.csic.es> Hi Thibaut, I performed a NJ Tree using 1M SNPs with 10 samples, following the instructions in the documentation. However I would like to know exactly the genetic distance among individuals is calculated. Is it based on the number of shared alleles? Could you provide a simple example? Like for this two individuals using 5 SNPs: Ind1 00122 Ind2 02210 Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 Thanks in advance, Fernando Cruz -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail: fernando.cruz at ebd.csic.es Website: http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html **************************************** From t.jombart at imperial.ac.uk Sun Nov 17 16:07:32 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 17 Nov 2013 15:07:32 +0000 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <52866D98.6040106@ebd.csic.es> References: <52866D98.6040106@ebd.csic.es> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> Hello there, there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options. One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'. The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations): D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 Using your example: > x <- c(0,0,1,2,2) > y <- c(0,2,2,1,0) > sqrt(sum((x-y)^2)) [1] 3.162278 > dist(rbind.data.frame(x,y)) 1 2 3.162278 Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different: > x.rel <- x/2 > y.rel <- y/2 > dist(rbind.data.frame(x.rel,y.rel)) 1 2 1.581139 That is, the distance between the raw allele count profiles divided by the ploidy. As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances). Cheers Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es] Sent: 15 November 2013 18:53 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Request an example of genetic distance among two individuals Hi Thibaut, I performed a NJ Tree using 1M SNPs with 10 samples, following the instructions in the documentation. However I would like to know exactly the genetic distance among individuals is calculated. Is it based on the number of shared alleles? Could you provide a simple example? Like for this two individuals using 5 SNPs: Ind1 00122 Ind2 02210 Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 Thanks in advance, Fernando Cruz -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail: fernando.cruz at ebd.csic.es Website: http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html **************************************** _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From t.jombart at imperial.ac.uk Sun Nov 17 16:23:52 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 17 Nov 2013 15:23:52 +0000 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> References: <52866D98.6040106@ebd.csic.es>, <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk> Just realized a typo: sqrt(\sum_i (x_i - y_i)^2 should read sqrt{ \sum_i (x_i - y_i)^2 } Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk] Sent: 17 November 2013 15:07 To: Fernando Cruz; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals Hello there, there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options. One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'. The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations): D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 Using your example: > x <- c(0,0,1,2,2) > y <- c(0,2,2,1,0) > sqrt(sum((x-y)^2)) [1] 3.162278 > dist(rbind.data.frame(x,y)) 1 2 3.162278 Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different: > x.rel <- x/2 > y.rel <- y/2 > dist(rbind.data.frame(x.rel,y.rel)) 1 2 1.581139 That is, the distance between the raw allele count profiles divided by the ploidy. As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances). Cheers Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es] Sent: 15 November 2013 18:53 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Request an example of genetic distance among two individuals Hi Thibaut, I performed a NJ Tree using 1M SNPs with 10 samples, following the instructions in the documentation. However I would like to know exactly the genetic distance among individuals is calculated. Is it based on the number of shared alleles? Could you provide a simple example? Like for this two individuals using 5 SNPs: Ind1 00122 Ind2 02210 Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 Thanks in advance, Fernando Cruz -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail: fernando.cruz at ebd.csic.es Website: http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html **************************************** _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From fernando.cruz at ebd.csic.es Sun Nov 17 16:41:36 2013 From: fernando.cruz at ebd.csic.es (Fernando Cruz) Date: Sun, 17 Nov 2013 16:41:36 +0100 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk> References: <52866D98.6040106@ebd.csic.es>, <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk> Message-ID: <5288E3B0.8080206@ebd.csic.es> Thanks Tibaut, This clarifies. In both the euclidean and the Hamming distances, the distance between a pair of individuals depends on the number of "unshared alleles". By the way, then the standardized distance is plot in the NJ Tree instead of using the Saitou & Nei (1987) used by APE library, right? Cheers, Fernando On 11/17/13 4:23 PM, Jombart, Thibaut wrote: > Just realized a typo: > > sqrt(\sum_i (x_i - y_i)^2 > > should read > > sqrt{ \sum_i (x_i - y_i)^2 } > > Cheers > Thibaut > ________________________________________ > From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk] > Sent: 17 November 2013 15:07 > To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org > Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals > > Hello there, > > there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options. > > One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'. > > The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations): > > D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 > > Using your example: >> x <- c(0,0,1,2,2) >> y <- c(0,2,2,1,0) >> sqrt(sum((x-y)^2)) > [1] 3.162278 >> dist(rbind.data.frame(x,y)) > 1 > 2 3.162278 > > > Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different: >> x.rel <- x/2 >> y.rel <- y/2 >> dist(rbind.data.frame(x.rel,y.rel)) > 1 > 2 1.581139 > > That is, the distance between the raw allele count profiles divided by the ploidy. > > As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances). > > Cheers > > Thibaut > > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary?s Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > t.jombart at imperial.ac.uk > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es] > Sent: 15 November 2013 18:53 > To:adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] Request an example of genetic distance among two individuals > > Hi Thibaut, > > I performed a NJ Tree using 1M SNPs with 10 samples, following the > instructions in the documentation. However I would like to know exactly > the genetic distance among individuals is calculated. Is it based on the > number of shared alleles? > > Could you provide a simple example? Like for this two individuals using > 5 SNPs: > Ind1 00122 > Ind2 02210 > > Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 > > Thanks in advance, > Fernando Cruz > > > -- > **************************************** > Dr. Fernando Cruz > Estaci?n Biol?gica de Do?ana (EBD-CSIC) > Avd. Americo Vespucio s/n > 41092-Seville (Spain) > Tel. +34 954466700/Ext. 1079 > Fax: +34 95 4621125 > Room: 0/12 > > e-mail:fernando.cruz at ebd.csic.es > Website:http://openwetware.org/wiki/User:Fernando_Cruz > Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html > **************************************** > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail:fernando.cruz at ebd.csic.es Website:http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html **************************************** From t.jombart at imperial.ac.uk Sun Nov 17 16:45:51 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 17 Nov 2013 15:45:51 +0000 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <5288E3B0.8080206@ebd.csic.es> References: <52866D98.6040106@ebd.csic.es>, <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>, <5288E3B0.8080206@ebd.csic.es> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk> Hi there, I'm not sure which tree you are referring to. Cheers Thibaut ________________________________________ From: Fernando Cruz [fernando.cruz at ebd.csic.es] Sent: 17 November 2013 15:41 To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals Thanks Tibaut, This clarifies. In both the euclidean and the Hamming distances, the distance between a pair of individuals depends on the number of "unshared alleles". By the way, then the standardized distance is plot in the NJ Tree instead of using the Saitou & Nei (1987) used by APE library, right? Cheers, Fernando On 11/17/13 4:23 PM, Jombart, Thibaut wrote: > Just realized a typo: > > sqrt(\sum_i (x_i - y_i)^2 > > should read > > sqrt{ \sum_i (x_i - y_i)^2 } > > Cheers > Thibaut > ________________________________________ > From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk] > Sent: 17 November 2013 15:07 > To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org > Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals > > Hello there, > > there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options. > > One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'. > > The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations): > > D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 > > Using your example: >> x <- c(0,0,1,2,2) >> y <- c(0,2,2,1,0) >> sqrt(sum((x-y)^2)) > [1] 3.162278 >> dist(rbind.data.frame(x,y)) > 1 > 2 3.162278 > > > Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different: >> x.rel <- x/2 >> y.rel <- y/2 >> dist(rbind.data.frame(x.rel,y.rel)) > 1 > 2 1.581139 > > That is, the distance between the raw allele count profiles divided by the ploidy. > > As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances). > > Cheers > > Thibaut > > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary?s Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > t.jombart at imperial.ac.uk > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es] > Sent: 15 November 2013 18:53 > To:adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] Request an example of genetic distance among two individuals > > Hi Thibaut, > > I performed a NJ Tree using 1M SNPs with 10 samples, following the > instructions in the documentation. However I would like to know exactly > the genetic distance among individuals is calculated. Is it based on the > number of shared alleles? > > Could you provide a simple example? Like for this two individuals using > 5 SNPs: > Ind1 00122 > Ind2 02210 > > Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 > > Thanks in advance, > Fernando Cruz > > > -- > **************************************** > Dr. Fernando Cruz > Estaci?n Biol?gica de Do?ana (EBD-CSIC) > Avd. Americo Vespucio s/n > 41092-Seville (Spain) > Tel. +34 954466700/Ext. 1079 > Fax: +34 95 4621125 > Room: 0/12 > > e-mail:fernando.cruz at ebd.csic.es > Website:http://openwetware.org/wiki/User:Fernando_Cruz > Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html > **************************************** > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail:fernando.cruz at ebd.csic.es Website:http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html **************************************** From fernando.cruz at ebd.csic.es Sun Nov 17 17:03:21 2013 From: fernando.cruz at ebd.csic.es (Fernando Cruz) Date: Sun, 17 Nov 2013 17:03:21 +0100 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk> References: <52866D98.6040106@ebd.csic.es>, <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>, <5288E3B0.8080206@ebd.csic.es> <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk> Message-ID: <5288E8C9.6050505@ebd.csic.es> Hi Tibaut, The nj tree of APE. What I basically did was: mygenlight <- read.snp("/Users/Nando/Documents/mydata.snp", chunk=2) x<- seploc(k31_13c_lp23,n.block=100) # ~10000 SNPs each library(ape) lD<-lapply(x, function(e) dist(as.matrix(e))) # dist is used within a lapply loop to compute pairwise distances between individuals for each block class(lD[[1]]) #The general distance matrix is obtained by summing these: D <- Reduce("+", lD) plot (nj(D), type="fan") Cheers, Fernando On 11/17/13 4:45 PM, Jombart, Thibaut wrote: > Hi there, > > I'm not sure which tree you are referring to. > > Cheers > Thibaut > ________________________________________ > From: Fernando Cruz [fernando.cruz at ebd.csic.es] > Sent: 17 November 2013 15:41 > To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org > Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals > > Thanks Tibaut, > > This clarifies. In both the euclidean and the Hamming distances, the > distance between a pair of individuals depends on the number of > "unshared alleles". > By the way, then the standardized distance is plot in the NJ Tree > instead of using the Saitou & Nei (1987) used by APE library, right? > > Cheers, > Fernando > > On 11/17/13 4:23 PM, Jombart, Thibaut wrote: >> Just realized a typo: >> >> sqrt(\sum_i (x_i - y_i)^2 >> >> should read >> >> sqrt{ \sum_i (x_i - y_i)^2 } >> >> Cheers >> Thibaut >> ________________________________________ >> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk] >> Sent: 17 November 2013 15:07 >> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org >> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals >> >> Hello there, >> >> there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options. >> >> One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'. >> >> The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations): >> >> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 >> >> Using your example: >>> x <- c(0,0,1,2,2) >>> y <- c(0,2,2,1,0) >>> sqrt(sum((x-y)^2)) >> [1] 3.162278 >>> dist(rbind.data.frame(x,y)) >> 1 >> 2 3.162278 >> >> >> Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different: >>> x.rel <- x/2 >>> y.rel <- y/2 >>> dist(rbind.data.frame(x.rel,y.rel)) >> 1 >> 2 1.581139 >> >> That is, the distance between the raw allele count profiles divided by the ploidy. >> >> As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances). >> >> Cheers >> >> Thibaut >> >> >> -- >> ###################################### >> Dr Thibaut JOMBART >> MRC Centre for Outbreak Analysis and Modelling >> Department of Infectious Disease Epidemiology >> Imperial College - School of Public Health >> St Mary?s Campus >> Norfolk Place >> London W2 1PG >> United Kingdom >> Tel. : 0044 (0)20 7594 3658 >> t.jombart at imperial.ac.uk >> http://sites.google.com/site/thibautjombart/ >> http://adegenet.r-forge.r-project.org/ >> ________________________________________ >> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es] >> Sent: 15 November 2013 18:53 >> To:adegenet-forum at lists.r-forge.r-project.org >> Subject: [adegenet-forum] Request an example of genetic distance among two individuals >> >> Hi Thibaut, >> >> I performed a NJ Tree using 1M SNPs with 10 samples, following the >> instructions in the documentation. However I would like to know exactly >> the genetic distance among individuals is calculated. Is it based on the >> number of shared alleles? >> >> Could you provide a simple example? Like for this two individuals using >> 5 SNPs: >> Ind1 00122 >> Ind2 02210 >> >> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 >> >> Thanks in advance, >> Fernando Cruz >> >> >> -- >> **************************************** >> Dr. Fernando Cruz >> Estaci?n Biol?gica de Do?ana (EBD-CSIC) >> Avd. Americo Vespucio s/n >> 41092-Seville (Spain) >> Tel. +34 954466700/Ext. 1079 >> Fax: +34 95 4621125 >> Room: 0/12 >> >> e-mail:fernando.cruz at ebd.csic.es >> Website:http://openwetware.org/wiki/User:Fernando_Cruz >> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html >> **************************************** >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > -- > **************************************** > Dr. Fernando Cruz > Estaci?n Biol?gica de Do?ana (EBD-CSIC) > Avd. Americo Vespucio s/n > 41092-Seville (Spain) > Tel. +34 954466700/Ext. 1079 > Fax: +34 95 4621125 > Room: 0/12 > > e-mail:fernando.cruz at ebd.csic.es > Website:http://openwetware.org/wiki/User:Fernando_Cruz > Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html > **************************************** > -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail: fernando.cruz at ebd.csic.es Website: http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html **************************************** From fernando.cruz at ebd.csic.es Sun Nov 17 17:13:29 2013 From: fernando.cruz at ebd.csic.es (Fernando Cruz) Date: Sun, 17 Nov 2013 17:13:29 +0100 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <5288E8C9.6050505@ebd.csic.es> References: <52866D98.6040106@ebd.csic.es>, <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>, <5288E3B0.8080206@ebd.csic.es> <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk> <5288E8C9.6050505@ebd.csic.es> Message-ID: <5288EB29.80202@ebd.csic.es> Well,there's a typo sorry. "k31_13c_lp23" is the same as "mygenlight" Thanks, Fernando On 11/17/13 5:03 PM, Fernando Cruz wrote: > Hi Tibaut, > > The nj tree of APE. What I basically did was: > > mygenlight <- read.snp("/Users/Nando/Documents/mydata.snp", chunk=2) > > x<- seploc(k31_13c_lp23,n.block=100) # ~10000 SNPs each > > library(ape) > lD<-lapply(x, function(e) dist(as.matrix(e))) # dist is used within a > lapply loop to compute pairwise distances between individuals for each > block > class(lD[[1]]) > > #The general distance matrix is obtained by summing these: > D <- Reduce("+", lD) > plot (nj(D), type="fan") > > Cheers, > Fernando > > On 11/17/13 4:45 PM, Jombart, Thibaut wrote: >> Hi there, >> >> I'm not sure which tree you are referring to. >> >> Cheers >> Thibaut >> ________________________________________ >> From: Fernando Cruz [fernando.cruz at ebd.csic.es] >> Sent: 17 November 2013 15:41 >> To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org >> Subject: Re: [adegenet-forum] Request an example of genetic distance >> among two individuals >> >> Thanks Tibaut, >> >> This clarifies. In both the euclidean and the Hamming distances, the >> distance between a pair of individuals depends on the number of >> "unshared alleles". >> By the way, then the standardized distance is plot in the NJ Tree >> instead of using the Saitou & Nei (1987) used by APE library, right? >> >> Cheers, >> Fernando >> >> On 11/17/13 4:23 PM, Jombart, Thibaut wrote: >>> Just realized a typo: >>> >>> sqrt(\sum_i (x_i - y_i)^2 >>> >>> should read >>> >>> sqrt{ \sum_i (x_i - y_i)^2 } >>> >>> Cheers >>> Thibaut >>> ________________________________________ >>> From:adegenet-forum-bounces at lists.r-forge.r-project.org >>> [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of >>> Jombart, Thibaut [t.jombart at imperial.ac.uk] >>> Sent: 17 November 2013 15:07 >>> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org >>> Subject: Re: [adegenet-forum] Request an example of genetic distance >>> among two individuals >>> >>> Hello there, >>> >>> there are many different distances that can be computed between >>> allelic profiles, but at an individual levels there is somewhat less >>> options. >>> >>> One is the Hamming distance, which you mention here (D=6), and which >>> you can deduce from 'propShared'. >>> >>> The usual Euclidean distance is different though. Between two >>> vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean >>> distance is given by (using latex notations): >>> >>> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 >>> >>> Using your example: >>>> x <- c(0,0,1,2,2) >>>> y <- c(0,2,2,1,0) >>>> sqrt(sum((x-y)^2)) >>> [1] 3.162278 >>>> dist(rbind.data.frame(x,y)) >>> 1 >>> 2 3.162278 >>> >>> >>> Note that in adegenet, data in genind objects are standardized to >>> relative frequencies, so that the distance would be different: >>>> x.rel <- x/2 >>>> y.rel <- y/2 >>>> dist(rbind.data.frame(x.rel,y.rel)) >>> 1 >>> 2 1.581139 >>> >>> That is, the distance between the raw allele count profiles divided >>> by the ploidy. >>> >>> As a last note, there is a particular case for haploid data, where >>> the Hamming distance equals the squared Euclidean distance (it >>> follows that a PCA on the covariance matrix is also the best >>> reduced-space representation of Hamming distances). >>> >>> Cheers >>> >>> Thibaut >>> >>> >>> -- >>> ###################################### >>> Dr Thibaut JOMBART >>> MRC Centre for Outbreak Analysis and Modelling >>> Department of Infectious Disease Epidemiology >>> Imperial College - School of Public Health >>> St Mary?s Campus >>> Norfolk Place >>> London W2 1PG >>> United Kingdom >>> Tel. : 0044 (0)20 7594 3658 >>> t.jombart at imperial.ac.uk >>> http://sites.google.com/site/thibautjombart/ >>> http://adegenet.r-forge.r-project.org/ >>> ________________________________________ >>> From:adegenet-forum-bounces at lists.r-forge.r-project.org >>> [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of >>> Fernando Cruz [fernando.cruz at ebd.csic.es] >>> Sent: 15 November 2013 18:53 >>> To:adegenet-forum at lists.r-forge.r-project.org >>> Subject: [adegenet-forum] Request an example of genetic distance >>> among two individuals >>> >>> Hi Thibaut, >>> >>> I performed a NJ Tree using 1M SNPs with 10 samples, following the >>> instructions in the documentation. However I would like to know exactly >>> the genetic distance among individuals is calculated. Is it based on >>> the >>> number of shared alleles? >>> >>> Could you provide a simple example? Like for this two individuals >>> using >>> 5 SNPs: >>> Ind1 00122 >>> Ind2 02210 >>> >>> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 >>> >>> Thanks in advance, >>> Fernando Cruz >>> >>> >>> -- >>> **************************************** >>> Dr. Fernando Cruz >>> Estaci?n Biol?gica de Do?ana (EBD-CSIC) >>> Avd. Americo Vespucio s/n >>> 41092-Seville (Spain) >>> Tel. +34 954466700/Ext. 1079 >>> Fax: +34 95 4621125 >>> Room: 0/12 >>> >>> e-mail:fernando.cruz at ebd.csic.es >>> Website:http://openwetware.org/wiki/User:Fernando_Cruz >>> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html >>> **************************************** >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>> >> >> -- >> **************************************** >> Dr. Fernando Cruz >> Estaci?n Biol?gica de Do?ana (EBD-CSIC) >> Avd. Americo Vespucio s/n >> 41092-Seville (Spain) >> Tel. +34 954466700/Ext. 1079 >> Fax: +34 95 4621125 >> Room: 0/12 >> >> e-mail:fernando.cruz at ebd.csic.es >> Website:http://openwetware.org/wiki/User:Fernando_Cruz >> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html >> **************************************** >> > > -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail: fernando.cruz at ebd.csic.es Website: http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html **************************************** From t.jombart at imperial.ac.uk Sun Nov 17 19:52:59 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 17 Nov 2013 18:52:59 +0000 Subject: [adegenet-forum] Request an example of genetic distance among two individuals In-Reply-To: <5288E8C9.6050505@ebd.csic.es> References: <52866D98.6040106@ebd.csic.es>, <2CB2DA8E426F3541AB1907F98ABA65706390D61C@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65706390D656@icexch-m1.ic.ac.uk>, <5288E3B0.8080206@ebd.csic.es> <2CB2DA8E426F3541AB1907F98ABA65706390D694@icexch-m1.ic.ac.uk>, <5288E8C9.6050505@ebd.csic.es> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65706390D738@icexch-m1.ic.ac.uk> Hello, just to clarify, 'nj' from APE is agnostic with respect to the distance used. Here in your code you are using 'dist', thus the Euclidean distance between SNP profiles. Cheers Thibaut ________________________________________ From: Fernando Cruz [fernando.cruz at ebd.csic.es] Sent: 17 November 2013 16:03 To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals Hi Tibaut, The nj tree of APE. What I basically did was: mygenlight <- read.snp("/Users/Nando/Documents/mydata.snp", chunk=2) x<- seploc(k31_13c_lp23,n.block=100) # ~10000 SNPs each library(ape) lD<-lapply(x, function(e) dist(as.matrix(e))) # dist is used within a lapply loop to compute pairwise distances between individuals for each block class(lD[[1]]) #The general distance matrix is obtained by summing these: D <- Reduce("+", lD) plot (nj(D), type="fan") Cheers, Fernando On 11/17/13 4:45 PM, Jombart, Thibaut wrote: > Hi there, > > I'm not sure which tree you are referring to. > > Cheers > Thibaut > ________________________________________ > From: Fernando Cruz [fernando.cruz at ebd.csic.es] > Sent: 17 November 2013 15:41 > To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org > Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals > > Thanks Tibaut, > > This clarifies. In both the euclidean and the Hamming distances, the > distance between a pair of individuals depends on the number of > "unshared alleles". > By the way, then the standardized distance is plot in the NJ Tree > instead of using the Saitou & Nei (1987) used by APE library, right? > > Cheers, > Fernando > > On 11/17/13 4:23 PM, Jombart, Thibaut wrote: >> Just realized a typo: >> >> sqrt(\sum_i (x_i - y_i)^2 >> >> should read >> >> sqrt{ \sum_i (x_i - y_i)^2 } >> >> Cheers >> Thibaut >> ________________________________________ >> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Jombart, Thibaut [t.jombart at imperial.ac.uk] >> Sent: 17 November 2013 15:07 >> To: Fernando Cruz;adegenet-forum at lists.r-forge.r-project.org >> Subject: Re: [adegenet-forum] Request an example of genetic distance among two individuals >> >> Hello there, >> >> there are many different distances that can be computed between allelic profiles, but at an individual levels there is somewhat less options. >> >> One is the Hamming distance, which you mention here (D=6), and which you can deduce from 'propShared'. >> >> The usual Euclidean distance is different though. Between two vectors of allelic profiles x=[x_i] and y=[y_i], the Euclidean distance is given by (using latex notations): >> >> D(x,y) = || x - y || = sqrt{ (x-y)^T (x-y)} = sqrt(\sum_i (x_i - y_i)^2 >> >> Using your example: >>> x <- c(0,0,1,2,2) >>> y <- c(0,2,2,1,0) >>> sqrt(sum((x-y)^2)) >> [1] 3.162278 >>> dist(rbind.data.frame(x,y)) >> 1 >> 2 3.162278 >> >> >> Note that in adegenet, data in genind objects are standardized to relative frequencies, so that the distance would be different: >>> x.rel <- x/2 >>> y.rel <- y/2 >>> dist(rbind.data.frame(x.rel,y.rel)) >> 1 >> 2 1.581139 >> >> That is, the distance between the raw allele count profiles divided by the ploidy. >> >> As a last note, there is a particular case for haploid data, where the Hamming distance equals the squared Euclidean distance (it follows that a PCA on the covariance matrix is also the best reduced-space representation of Hamming distances). >> >> Cheers >> >> Thibaut >> >> >> -- >> ###################################### >> Dr Thibaut JOMBART >> MRC Centre for Outbreak Analysis and Modelling >> Department of Infectious Disease Epidemiology >> Imperial College - School of Public Health >> St Mary?s Campus >> Norfolk Place >> London W2 1PG >> United Kingdom >> Tel. : 0044 (0)20 7594 3658 >> t.jombart at imperial.ac.uk >> http://sites.google.com/site/thibautjombart/ >> http://adegenet.r-forge.r-project.org/ >> ________________________________________ >> From:adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Fernando Cruz [fernando.cruz at ebd.csic.es] >> Sent: 15 November 2013 18:53 >> To:adegenet-forum at lists.r-forge.r-project.org >> Subject: [adegenet-forum] Request an example of genetic distance among two individuals >> >> Hi Thibaut, >> >> I performed a NJ Tree using 1M SNPs with 10 samples, following the >> instructions in the documentation. However I would like to know exactly >> the genetic distance among individuals is calculated. Is it based on the >> number of shared alleles? >> >> Could you provide a simple example? Like for this two individuals using >> 5 SNPs: >> Ind1 00122 >> Ind2 02210 >> >> Using the binary information, they share 2+0+1+1+0= 4 alleles out of 10 >> >> Thanks in advance, >> Fernando Cruz >> >> >> -- >> **************************************** >> Dr. Fernando Cruz >> Estaci?n Biol?gica de Do?ana (EBD-CSIC) >> Avd. Americo Vespucio s/n >> 41092-Seville (Spain) >> Tel. +34 954466700/Ext. 1079 >> Fax: +34 95 4621125 >> Room: 0/12 >> >> e-mail:fernando.cruz at ebd.csic.es >> Website:http://openwetware.org/wiki/User:Fernando_Cruz >> Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html >> **************************************** >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > -- > **************************************** > Dr. Fernando Cruz > Estaci?n Biol?gica de Do?ana (EBD-CSIC) > Avd. Americo Vespucio s/n > 41092-Seville (Spain) > Tel. +34 954466700/Ext. 1079 > Fax: +34 95 4621125 > Room: 0/12 > > e-mail:fernando.cruz at ebd.csic.es > Website:http://openwetware.org/wiki/User:Fernando_Cruz > Web EcoGenes EU-FP7:http://www.ebd.csic.es/ecogenes/news.html > **************************************** > -- **************************************** Dr. Fernando Cruz Estaci?n Biol?gica de Do?ana (EBD-CSIC) Avd. Americo Vespucio s/n 41092-Seville (Spain) Tel. +34 954466700/Ext. 1079 Fax: +34 95 4621125 Room: 0/12 e-mail: fernando.cruz at ebd.csic.es Website: http://openwetware.org/wiki/User:Fernando_Cruz Web EcoGenes EU-FP7: http://www.ebd.csic.es/ecogenes/news.html **************************************** From katherine.miller at students.tamuk.edu Sat Nov 23 22:36:57 2013 From: katherine.miller at students.tamuk.edu (katherine.miller) Date: Sat, 23 Nov 2013 21:36:57 +0000 Subject: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Message-ID: <8DECF27DB3B2534F8C9873C9B734441F713E9765@BY2PRD0810MB392.namprd08.prod.outlook.com> Greetings, I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions: 1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area? 2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error: error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated! Katherine S. Miller Ph.D. candidate Caesar Kleberg Wildlife Research Institute Texas A&M University-Kingsville MSC 218, 700 University Blvd Kingsville, TX 78363 (361) 593-4486, office -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Sun Nov 24 15:45:30 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 24 Nov 2013 14:45:30 +0000 Subject: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion In-Reply-To: <8DECF27DB3B2534F8C9873C9B734441F713E9765@BY2PRD0810MB392.namprd08.prod.outlook.com> References: <8DECF27DB3B2534F8C9873C9B734441F713E9765@BY2PRD0810MB392.namprd08.prod.outlook.com> Message-ID: <2CB2DA8E426F3541AB1907F98ABA657063918FF6@icexch-m1.ic.ac.uk> Dear Katherine, sPCA uses a rather crude model of spatial proximities (most commonly a binary connection network), so that conversion from latitudes/longitudes, even at that regional scale, should not be much of an issue. As for the choice of network or the error you report, it is difficult to provide advice / guess the origin of the error without a spatial distribution of your locations, or a sample of data and code reproducing the error. In general, I would advocated using a binary connection network (e.g. Delaunay's triangulation, Gabriel's graph) where possible. If the sampling design is very uneven, treating clusters as populations (possibly with finer scale, within-population geographic analyses) might be an option. Cheers Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of katherine.miller [katherine.miller at students.tamuk.edu] Sent: 23 November 2013 21:36 To: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Greetings, I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions: 1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area? 2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error: error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated! Katherine S. Miller Ph.D. candidate Caesar Kleberg Wildlife Research Institute Texas A&M University-Kingsville MSC 218, 700 University Blvd Kingsville, TX 78363 (361) 593-4486, office From katherine.miller at students.tamuk.edu Sun Nov 24 19:35:31 2013 From: katherine.miller at students.tamuk.edu (katherine.miller) Date: Sun, 24 Nov 2013 18:35:31 +0000 Subject: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Message-ID: <8DECF27DB3B2534F8C9873C9B734441F713E978F@BY2PRD0810MB392.namprd08.prod.outlook.com> Thank you so much for your input. When I try to do the Delaunay triangulation it tells me: "Error in tri2nb(xy) : too few coordinates" I'm wondering if the problem is the XY data. I followed the format from Robinson et al. 2012: The walk is never random: subtle landscape effects shape gene flow in a continuous white-tailed deer population in the Midwestern United States. http://datadryad.org/resource/doi:10.5061/dryad.p7639 I'm trying to duplicate Robinson et al.'s R script, uploaded with my data here: https://www.dropbox.com/sh/8bckmlf2nbx5hdh/75HaBklOlG This includes my genetic data, the .gen file, and the locations. The genetic data represents northern bobwhite genetic samples, each line a new sample, and alleles at 13 loci. In response to the finer scale approach, I've been setting up my data for a spatial regression in spdep, but I would like to get the spatial pca to run first. Comments and suggestions are definitely appreciated! Thank you in advance! Katherine S. Miller Ph.D. candidate Caesar Kleberg Wildlife Research Institute Texas A&M University-Kingsville MSC 218, 700 University Blvd Kingsville, TX 78363 (361) 593-4486, office ________________________________________ From: Jombart, Thibaut [t.jombart at imperial.ac.uk] Sent: Sunday, November 24, 2013 8:45 AM To: katherine.miller; adegenet-forum at lists.r-forge.r-project.org Subject: RE: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Dear Katherine, sPCA uses a rather crude model of spatial proximities (most commonly a binary connection network), so that conversion from latitudes/longitudes, even at that regional scale, should not be much of an issue. As for the choice of network or the error you report, it is difficult to provide advice / guess the origin of the error without a spatial distribution of your locations, or a sample of data and code reproducing the error. In general, I would advocated using a binary connection network (e.g. Delaunay's triangulation, Gabriel's graph) where possible. If the sampling design is very uneven, treating clusters as populations (possibly with finer scale, within-population geographic analyses) might be an option. Cheers Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of katherine.miller [katherine.miller at students.tamuk.edu] Sent: 23 November 2013 21:36 To: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Greetings, I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions: 1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area? 2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error: error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated! Katherine S. Miller Ph.D. candidate Caesar Kleberg Wildlife Research Institute Texas A&M University-Kingsville MSC 218, 700 University Blvd Kingsville, TX 78363 (361) 593-4486, office From RoyFrancis.Mathew at agrsci.dk Mon Nov 25 12:32:05 2013 From: RoyFrancis.Mathew at agrsci.dk (Roy Mathew Francis) Date: Mon, 25 Nov 2013 11:32:05 +0000 Subject: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion In-Reply-To: <8DECF27DB3B2534F8C9873C9B734441F713E978F@BY2PRD0810MB392.namprd08.prod.outlook.com> References: <8DECF27DB3B2534F8C9873C9B734441F713E978F@BY2PRD0810MB392.namprd08.prod.outlook.com> Message-ID: Hi, I am not an expert on this, but I have done some sPCA using large areas. If your area spans more than one UTM zone, just use any one zone that fits best. You will still get coordinates for points outside the UTM zone since it's based on one transverse meridian. When you plot it later on a background map, make sure the background map is plotted using the same UTM coordinates. When using points outside the UTM zone, distances might be fine but the projection would be distorted. Regarding the minimum distance, I always thought that was the distances your individuals could migrate (maybe avg dis or max dist). But, maybe that is neighbourhood by distance type 5. Roy -----Original Message----- From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of katherine.miller Sent: 24 November 2013 19:36 To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Thank you so much for your input. When I try to do the Delaunay triangulation it tells me: "Error in tri2nb(xy) : too few coordinates" I'm wondering if the problem is the XY data. I followed the format from Robinson et al. 2012: The walk is never random: subtle landscape effects shape gene flow in a continuous white-tailed deer population in the Midwestern United States. http://datadryad.org/resource/doi:10.5061/dryad.p7639 I'm trying to duplicate Robinson et al.'s R script, uploaded with my data here: https://www.dropbox.com/sh/8bckmlf2nbx5hdh/75HaBklOlG This includes my genetic data, the .gen file, and the locations. The genetic data represents northern bobwhite genetic samples, each line a new sample, and alleles at 13 loci. In response to the finer scale approach, I've been setting up my data for a spatial regression in spdep, but I would like to get the spatial pca to run first. Comments and suggestions are definitely appreciated! Thank you in advance! Katherine S. Miller Ph.D. candidate Caesar Kleberg Wildlife Research Institute Texas A&M University-Kingsville MSC 218, 700 University Blvd Kingsville, TX 78363 (361) 593-4486, office ________________________________________ From: Jombart, Thibaut [t.jombart at imperial.ac.uk] Sent: Sunday, November 24, 2013 8:45 AM To: katherine.miller; adegenet-forum at lists.r-forge.r-project.org Subject: RE: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Dear Katherine, sPCA uses a rather crude model of spatial proximities (most commonly a binary connection network), so that conversion from latitudes/longitudes, even at that regional scale, should not be much of an issue. As for the choice of network or the error you report, it is difficult to provide advice / guess the origin of the error without a spatial distribution of your locations, or a sample of data and code reproducing the error. In general, I would advocated using a binary connection network (e.g. Delaunay's triangulation, Gabriel's graph) where possible. If the sampling design is very uneven, treating clusters as populations (possibly with finer scale, within-population geographic analyses) might be an option. Cheers Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary's Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of katherine.miller [katherine.miller at students.tamuk.edu] Sent: 23 November 2013 21:36 To: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] UTMs vs lat/lon, and inverse distance network confusion Greetings, I am new to SPCA analysis, and haven't used R that much either. I have 2 somewhat related questions: 1) I read somewhere that lat and lon have to be converted to UTMs prior to analysis. I have a rather large spatial area (Iowa to Texas), so it spans 14R to 15T regions. I've looked at this page, http://www.inside-r.org/packages/cran/PBSmapping/docs/convUL , and I'm wondering about the erroneous results and how this will work. Has anyone out there done SPCA with a large area? 2) Additionally, I've tried to duplicate a set of data on a smaller scale (XY data is already in UTMs, alleles from 14 loci are in a .gen file), and when I am prompted to choose a network, I choose 7 (inverse distance, the locations are not evenly distributed). I am prompted to choose an exponent and a minimum distance. I understand the exponent of 1, I think, but either my understanding of the min distance is wrong, or something else is producing this error: error in if (any(x < 0)) stop("values in x cannot be negative") : missing value where TRUE/FALSE needed I have read through the tutorials for spca and adegenet, but clearly I'm still confused. Any help would appreciated! Katherine S. Miller Ph.D. candidate Caesar Kleberg Wildlife Research Institute Texas A&M University-Kingsville MSC 218, 700 University Blvd Kingsville, TX 78363 (361) 593-4486, office _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum