From Mark.Coulson.ic at uhi.ac.uk Thu Feb 1 18:01:46 2018 From: Mark.Coulson.ic at uhi.ac.uk (Mark Coulson) Date: Thu, 1 Feb 2018 17:01:46 +0000 Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data In-Reply-To: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> Message-ID: Hi Ben, I have used allelotype data with the input as a matrix of the frequency of the A allele in each group to run DAPC and it worked well. However, my groups were defined already but could the same type of input not be used to find.clusters? Mark -----Original Message----- From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin Dauphin Sent: 31 January 2018 09:18 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data Dear all, I am newly working on pool sequencing data and I simply wonder if I can use kmeans (find.cluster) and DAPC to investigate population structure from poolseq data (allele frequencies)? How find.clusters can deal with allele frequencies? Dataset: 7 pools and 100?000 SNPs Any comment or help would be much appreciated. Best regards Ben _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197. From Mark.Coulson.ic at uhi.ac.uk Thu Feb 1 21:36:12 2018 From: Mark.Coulson.ic at uhi.ac.uk (Mark Coulson) Date: Thu, 1 Feb 2018 20:36:12 +0000 Subject: [adegenet-forum] How to interpret Density Plot for K=2 In-Reply-To: References: Message-ID: Hi Nikki, Your interpretation of the plot seems correct, however I'd ask if you ran the xvalDAPC cross validation? It may be that you have kept too many PCs so are overfitting the data. The xvalDAPC will find the optimal number of PCs to retain for your two groups. Then use this number of PCs to run a new DAPC. It will likely result in more overlap between the two groups, which would then be more consistent with the low differentiation you are seeing based on FST. Hope this helps. Mark From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Nikki Vollmer Sent: 30 January 2018 18:08 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] How to interpret Density Plot for K=2 Hi, I am trying to analyze ~200 RADseq loci for ~200 individuals. STRUCTURE results suggest the best number of populations given the data is 2. Pairwise Fst values are quite low for my taxa (<0.003) with pvalue 0.01802. I was trying to do a DAPC on this same data to compare results. DAPC similarly suggested the best # of clusters is 2 and I was able to plot a 1-dimensional density plot for the one DF I kept (attached). However, I am not sure how to interpret the plot. Is it correct to say that because the two peaks do not overlap that suggests the 2 clusters are quite differentiated from one another (similar to two clusters on a scatter plot being in opposite quadrants)? (...or is that logic flawed?) I am trying to figure out if these 2 groups are very genetically differentiated or not, and I am not clear what the density plot is supporting/suggesting. I very much appreciate any guidance on this matter! Thank you, Nikki Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197. -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Fri Feb 2 17:53:34 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Fri, 2 Feb 2018 16:53:34 +0000 Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data In-Reply-To: References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> Message-ID: Hi there find.clusters is implemented for matrices as well, and should deal nicely with any kind of quantitative data. So it should apply readily to your data. Same for DAPC. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 1 February 2018 at 17:01, Mark Coulson wrote: > Hi Ben, > > I have used allelotype data with the input as a matrix of the frequency of > the A allele in each group to run DAPC and it worked well. However, my > groups were defined already but could the same type of input not be used to > find.clusters? > > Mark > > > -----Original Message----- > From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto: > adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin > Dauphin > Sent: 31 January 2018 09:18 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data > > Dear all, > > I am newly working on pool sequencing data and I simply wonder if I can > use kmeans (find.cluster) and DAPC to investigate population structure from > poolseq data (allele frequencies)? How find.clusters can deal with allele > frequencies? > > Dataset: 7 pools and 100?000 SNPs > > Any comment or help would be much appreciated. > Best regards > Ben > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > Inverness College UHI, a partner in the University of the Highlands and > Islands www.inverness.uhi.ac.uk Board of Management of Inverness College > (known as Inverness College UHI), Scottish Charity No SC021197. > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin.dauphin at wsl.ch Fri Feb 2 10:07:00 2018 From: benjamin.dauphin at wsl.ch (Benjamin Dauphin) Date: Fri, 2 Feb 2018 10:07:00 +0100 Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data In-Reply-To: References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> Message-ID: <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch> Hi Mark, Thanks for response. I?ve run find.clusters() with the matrix of allele frequencies as input file, and then run the DAPC using still the matrix (not the genind or genlight object) by assigning the group generated with kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve" for the kmean analysis. Is it a common picture for pooldseq data? Thanks, Ben -------------- next part -------------- A non-text attachment was scrubbed... Name: kmean_HJ_cohorts.pdf Type: application/pdf Size: 5078 bytes Desc: not available URL: -------------- next part -------------- > On 1 Feb 2018, at 18:01, Mark Coulson wrote: > > Hi Ben, > > I have used allelotype data with the input as a matrix of the frequency of the A allele in each group to run DAPC and it worked well. However, my groups were defined already but could the same type of input not be used to find.clusters? > > Mark > > > -----Original Message----- > From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin Dauphin > Sent: 31 January 2018 09:18 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data > > Dear all, > > I am newly working on pool sequencing data and I simply wonder if I can use kmeans (find.cluster) and DAPC to investigate population structure from poolseq data (allele frequencies)? How find.clusters can deal with allele frequencies? > > Dataset: 7 pools and 100?000 SNPs > > Any comment or help would be much appreciated. > Best regards > Ben > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197. From thibautjombart at gmail.com Fri Feb 2 18:17:47 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Fri, 2 Feb 2018 17:17:47 +0000 Subject: [adegenet-forum] snapclust In-Reply-To: References: Message-ID: Hi there, I would analyse the empirical data separately. If you have clearly identified parental populations (i.e. prior knowledge, not identified by the method), sure you can benchmark the method using simulated hybrids. Otherwise, simulations will have less interest. How would you go about bootstrapping the final probabilities? Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 31 January 2018 at 00:18, Danielle Louise wrote: > Hello. I am looking at implementing your snapclust function, and I am > reading through your recent paper. > > I have a few questions regarding incorporating empirical data. I have > simulated data sets with parental and F1 F2 and BC and I am wondering how > to incorporate the empirical data - do I add it in to the simulated data > and measure the accuracy of the assignment to classes to then determine the > reliability of detection of hybrids in the empirical data? The tutorial > gives a good outline of using the simulated data, but I think I am missing > something when it comes to checking the empirical data, so I am asking for > some really practical advice about how to incorporate the empirical data ? > Also should we bootstrap the final probabilities to clarify the results? > > Thanks > Dan > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Fri Feb 2 18:22:30 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Fri, 2 Feb 2018 17:22:30 +0000 Subject: [adegenet-forum] How to interpret Density Plot for K=2 In-Reply-To: References: Message-ID: Hi there, I would definitely second Mark's comment and use cross-validation here. Also for the clustering, I would give snapclust a try - I have just pushed a new version on github which is now properly documented. Especially check what the 'optimal k' is according to the various goodness of fit stats (snapclust.choose.k) - AIC, AICc, BIC, KIC. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 1 February 2018 at 20:36, Mark Coulson wrote: > Hi Nikki, > > > > Your interpretation of the plot seems correct, however I?d ask if you ran > the xvalDAPC cross validation? It may be that you have kept too many PCs so > are overfitting the data. The xvalDAPC will find the optimal number of PCs > to retain for your two groups. Then use this number of PCs to run a new > DAPC. It will likely result in more overlap between the two groups, which > would then be more consistent with the low differentiation you are seeing > based on FST. > > > > Hope this helps. > > > > Mark > > > > *From:* adegenet-forum-bounces at lists.r-forge.r-project.org [mailto: > adegenet-forum-bounces at lists.r-forge.r-project.org] *On Behalf Of *Nikki > Vollmer > *Sent:* 30 January 2018 18:08 > *To:* adegenet-forum at lists.r-forge.r-project.org > *Subject:* [adegenet-forum] How to interpret Density Plot for K=2 > > > > Hi, > > > > I am trying to analyze ~200 RADseq loci for ~200 individuals. STRUCTURE > results suggest the best number of populations given the data is 2. > Pairwise Fst values are quite low for my taxa (<0.003) with pvalue > 0.01802. I was trying to do a DAPC on this same data to compare results. > DAPC similarly suggested the best # of clusters is 2 and I was able to plot > a 1-dimensional density plot for the one DF I kept (attached). However, I > am not sure how to interpret the plot. Is it correct to say that because > the two peaks do not overlap that suggests the 2 clusters are quite > differentiated from one another (similar to two clusters on a scatter plot > being in opposite quadrants)? (...or is that logic flawed?) > > > > I am trying to figure out if these 2 groups are very genetically > differentiated or not, and I am not clear what the density plot is > supporting/suggesting. > > > > I very much appreciate any guidance on this matter! > > > > Thank you, > > Nikki > > > Inverness College UHI, a partner in the University of the Highlands and > Islands www.inverness.uhi.ac.uk Board of Management of Inverness College > (known as Inverness College UHI), Scottish Charity No SC021197. > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Fri Feb 2 18:25:45 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Fri, 2 Feb 2018 17:25:45 +0000 Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data In-Reply-To: <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch> References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch> Message-ID: Hi again, such plot typically indicates no clustering. Just to confirm: are we talking about 7 rows and 100,000 columns? If so, your pools are technically your statistical individuals, and the method explore clustering solutions for 1-6 clusters for 7 individuals, which won't go far - not enough individuals to detect clustering really. Apologies if I misunderstood. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 2 February 2018 at 09:07, Benjamin Dauphin wrote: > Hi Mark, > > Thanks for response. I?ve run find.clusters() with the matrix of allele > frequencies as input file, and then run the DAPC using still the matrix > (not the genind or genlight object) by assigning the group generated with > kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve" > for the kmean analysis. > Is it a common picture for pooldseq data? > > Thanks, > Ben > > > > > > On 1 Feb 2018, at 18:01, Mark Coulson wrote: > > > > Hi Ben, > > > > I have used allelotype data with the input as a matrix of the frequency > of the A allele in each group to run DAPC and it worked well. However, my > groups were defined already but could the same type of input not be used to > find.clusters? > > > > Mark > > > > > > -----Original Message----- > > From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto: > adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin > Dauphin > > Sent: 31 January 2018 09:18 > > To: adegenet-forum at lists.r-forge.r-project.org > > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data > > > > Dear all, > > > > I am newly working on pool sequencing data and I simply wonder if I can > use kmeans (find.cluster) and DAPC to investigate population structure from > poolseq data (allele frequencies)? How find.clusters can deal with allele > frequencies? > > > > Dataset: 7 pools and 100?000 SNPs > > > > Any comment or help would be much appreciated. > > Best regards > > Ben > > > > > > _______________________________________________ > > adegenet-forum mailing list > > adegenet-forum at lists.r-forge.r-project.org > > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > > Inverness College UHI, a partner in the University of the Highlands and > Islands www.inverness.uhi.ac.uk Board of Management of Inverness College > (known as Inverness College UHI), Scottish Charity No SC021197. > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From benjamin.dauphin at unine.ch Fri Feb 2 22:01:40 2018 From: benjamin.dauphin at unine.ch (DAUPHIN Benjamin) Date: Fri, 2 Feb 2018 21:01:40 +0000 Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data In-Reply-To: References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch>, Message-ID: <40a77a4d0903435b96871c3582004ee1@vRana01.UNINE.CH> Thanks Thibaut. Yes i have 7 pools (=7 rows or =7 individuals in the analysis), and i expect two clusters representing two already characterized lineages. I have found 4 likely clusters based on HCPC but i want to double check this, with a kmeans if possible. Best Ben ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thibaut Jombart [thibautjombart at gmail.com] Sent: 02 February 2018 18:25 To: Benjamin Dauphin Cc: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] Kmeans and DAPC on poolSeq data Hi again, such plot typically indicates no clustering. Just to confirm: are we talking about 7 rows and 100,000 columns? If so, your pools are technically your statistical individuals, and the method explore clustering solutions for 1-6 clusters for 7 individuals, which won't go far - not enough individuals to detect clustering really. Apologies if I misunderstood. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 2 February 2018 at 09:07, Benjamin Dauphin > wrote: Hi Mark, Thanks for response. I?ve run find.clusters() with the matrix of allele frequencies as input file, and then run the DAPC using still the matrix (not the genind or genlight object) by assigning the group generated with kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve" for the kmean analysis. Is it a common picture for pooldseq data? Thanks, Ben > On 1 Feb 2018, at 18:01, Mark Coulson > wrote: > > Hi Ben, > > I have used allelotype data with the input as a matrix of the frequency of the A allele in each group to run DAPC and it worked well. However, my groups were defined already but could the same type of input not be used to find.clusters? > > Mark > > > -----Original Message----- > From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Benjamin Dauphin > Sent: 31 January 2018 09:18 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data > > Dear all, > > I am newly working on pool sequencing data and I simply wonder if I can use kmeans (find.cluster) and DAPC to investigate population structure from poolseq data (allele frequencies)? How find.clusters can deal with allele frequencies? > > Dataset: 7 pools and 100?000 SNPs > > Any comment or help would be much appreciated. > Best regards > Ben > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197. _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From thibautjombart at gmail.com Mon Feb 5 12:32:59 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 5 Feb 2018 11:32:59 +0000 Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data In-Reply-To: <40a77a4d0903435b96871c3582004ee1@vRana01.UNINE.CH> References: <22A6ABF6-1D2B-4DB6-9D52-5899300649A8@wsl.ch> <77640E6F-D646-48E2-86BB-6FE894DBDD54@wsl.ch> <40a77a4d0903435b96871c3582004ee1@vRana01.UNINE.CH> Message-ID: Hi Ben while I'm not aware of hard rules for numbers of individuals needed to detect a specific number of clusters, and I appreciate it will depend on how clear-cut differences are, I don't think it is realistic to look for 4 clusters amongst 7 observations. Even 2 clusters will already be a stretch, unless differences are really very obvious. Cheers Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 2 February 2018 at 21:01, DAUPHIN Benjamin wrote: > Thanks Thibaut. > Yes i have 7 pools (=7 rows or =7 individuals in the analysis), and i > expect two clusters representing two already characterized lineages. I have > found 4 likely clusters based on HCPC but i want to double check this, with > a kmeans if possible. > Best > Ben > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [ > adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Thibaut > Jombart [thibautjombart at gmail.com] > Sent: 02 February 2018 18:25 > To: Benjamin Dauphin > Cc: adegenet-forum at lists.r-forge.r-project.org > Subject: Re: [adegenet-forum] Kmeans and DAPC on poolSeq data > > Hi again, > > such plot typically indicates no clustering. Just to confirm: are we > talking about 7 rows and 100,000 columns? > > If so, your pools are technically your statistical individuals, and the > method explore clustering solutions for 1-6 clusters for 7 individuals, > which won't go far - not enough individuals to detect clustering really. > Apologies if I misunderstood. > > Best > Thibaut > > > -- > Dr Thibaut Jombart > Lecturer, Department of Infectious Disease Epidemiology, Imperial College > London > Head of RECON: repidemicsconsortium.org > WHO Consultant - outbreak analysis > https://thibautjombart.netlify.com > Twitter: @TeebzR > +44(0)20 7594 3658 > > On 2 February 2018 at 09:07, Benjamin Dauphin mailto:benjamin.dauphin at wsl.ch>> wrote: > Hi Mark, > > Thanks for response. I?ve run find.clusters() with the matrix of allele > frequencies as input file, and then run the DAPC using still the matrix > (not the genind or genlight object) by assigning the group generated with > kmeans (grp$grp). It works but I have a strange ?inverted parabolic curve" > for the kmean analysis. > Is it a common picture for pooldseq data? > > Thanks, > Ben > > > > > > On 1 Feb 2018, at 18:01, Mark Coulson Mark.Coulson.ic at uhi.ac.uk>> wrote: > > > > Hi Ben, > > > > I have used allelotype data with the input as a matrix of the frequency > of the A allele in each group to run DAPC and it worked well. However, my > groups were defined already but could the same type of input not be used to > find.clusters? > > > > Mark > > > > > > -----Original Message----- > > From: adegenet-forum-bounces at lists.r-forge.r-project.org degenet-forum-bounces at lists.r-forge.r-project.org> [mailto:adegenet-forum- > bounces at lists.r-forge.r-project.org forum-bounces at lists.r-forge.r-project.org>] On Behalf Of Benjamin Dauphin > > Sent: 31 January 2018 09:18 > > To: adegenet-forum at lists.r-forge.r-project.org forum at lists.r-forge.r-project.org> > > Subject: [adegenet-forum] Kmeans and DAPC on poolSeq data > > > > Dear all, > > > > I am newly working on pool sequencing data and I simply wonder if I can > use kmeans (find.cluster) and DAPC to investigate population structure from > poolseq data (allele frequencies)? How find.clusters can deal with allele > frequencies? > > > > Dataset: 7 pools and 100?000 SNPs > > > > Any comment or help would be much appreciated. > > Best regards > > Ben > > > > > > _______________________________________________ > > adegenet-forum mailing list > > adegenet-forum at lists.r-forge.r-project.org forum at lists.r-forge.r-project.org> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > > Inverness College UHI, a partner in the University of the Highlands and > Islands www.inverness.uhi.ac.uk Board of > Management of Inverness College (known as Inverness College UHI), Scottish > Charity No SC021197. > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org forum at lists.r-forge.r-project.org> > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Wed Feb 7 19:57:28 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 7 Feb 2018 18:57:28 +0000 Subject: [adegenet-forum] new release, snapclust, podcast Message-ID: Dear all, a new version of adegenet (2.1.1) has now been released on CRAN. This version implements snapclust, a fast maximum-likelihood genetic clustering method, recently published in Methods in Ecology and Evolution. snapclust is presented in the following podcast: https://www.youtube.com/watch?v=Vl3cf0XHG7Q Comments and feedback welcome! Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 -------------- next part -------------- An HTML attachment was scrubbed... URL: From 17197751 at students.latrobe.edu.au Fri Feb 9 01:48:39 2018 From: 17197751 at students.latrobe.edu.au (JEREMY SAMUEL BENWELL-CLARKE) Date: Fri, 9 Feb 2018 00:48:39 +0000 Subject: [adegenet-forum] Analysing mixed ploidy datasets Message-ID: Hi everyone, I'm trying to analyse a mixed ploidy dataset. The majority of my samples are diploid but there are a few triploids too. To make ploidy even in my raw data matrix I added zeros to all my diploid samples, which makes them triploids. I then use the read.genalex function from the poppr package to read in my data setting ploidy=3. However, I don't want '0' to be recognised as an extra allele and I want the true diploid samples to be separate from the true triploid samples. Therefore, I use the recode_polyploids function from poppr and set newploidy=T. Here is my code: genclone<-read.genalex("C:/Users/...", ploidy = 3) genclone<-recode_polyploids(genclone, newploidy = T) I'm wondering if this the right way (statistically speaking) to analyse mixed ploidy datasets in R? Will my estimates of genetic diversity and structure be accurate? I know there is the POLYSAT package, which seems to have been developed specifically for dealing with mixed ploidy datasets, however, I would rather stick to using adagenet and poppr, as I'm familiar with the functions and the structure of genind and genclone objects. Any help would be much appreciated! Cheers, Jeremy -------------- next part -------------- An HTML attachment was scrubbed... URL: From zkamvar at gmail.com Mon Feb 12 22:10:16 2018 From: zkamvar at gmail.com (Zhian Kamvar) Date: Mon, 12 Feb 2018 15:10:16 -0600 Subject: [adegenet-forum] Analysing mixed ploidy datasets In-Reply-To: References: Message-ID: Hi Jeremy, > I'm wondering if this the right way (statistically speaking) to analyse mixed ploidy datasets in R? Will my estimates of genetic diversity and structure be accurate? I think the answer is... it depends. Meirmans, Liu, and Tienderen just came out with a review on this topic: https://doi.org/10.1093/jhered/esy006 *. For multivariate analyses (sPCA, PCA, DAPC, etc), you will want to reduce the effect of polyploidy by converting your data to allele frequencies with: tab(myData, freq = TRUE) If you have an organism that changes ploidy by life stage, you may need to analyze them separately. Moreover, if you use Bruvo's distance, the choice of model is very important. > I know there is the POLYSAT package, which seems to have been developed specifically for dealing with mixed ploidy datasets, however, I would rather stick to using adagenet and poppr, as I'm familiar with the functions and the structure of genind and genclone objects. I know conversion is a PITA, but I would highly recommend using POLYSAT. There are far more tools to deal with polyploidy in that package that can complement any analyses you would perform with poppr or adegenet. A few years ago, I wrote a tiny function to help with conversion from genind to polysat data: https://gist.github.com/zkamvar/aeaff83b9d126d55aade In fact, Clarke and Schreier came out with a paper a year ago talking about how to resolve ambiguous ploidy, and it's only available in polysat: http://onlinelibrary.wiley.com/doi/10.1111/1755-0998.12639/abstract Hope that helps, Zhian * The last figure in this paper will change slightly since they used an old version of poppr to calculate Bruvo's distance, which had a silent bug. ----- Zhian N. Kamvar, Ph. D. Postdoctoral Researcher (Everhart Lab) Department of Plant Pathology University of Nebraska-Lincoln ORCID: 0000-0003-1458-7108 > > Date: Fri, 9 Feb 2018 00:48:39 +0000 > From: JEREMY SAMUEL BENWELL-CLARKE <17197751 at students.latrobe.edu.au> > To: "adegenet-forum at lists.r-forge.r-project.org" > > Subject: [adegenet-forum] Analysing mixed ploidy datasets > Message-ID: > > > Content-Type: text/plain; charset="us-ascii" > > Hi everyone, > > I'm trying to analyse a mixed ploidy dataset. The majority of my samples are diploid but there are a few triploids too. To make ploidy even in my raw data matrix I added zeros to all my diploid samples, which makes them triploids. I then use the read.genalex function from the poppr package to read in my data setting ploidy=3. However, I don't want '0' to be recognised as an extra allele and I want the true diploid samples to be separate from the true triploid samples. Therefore, I use the recode_polyploids function from poppr and set newploidy=T. Here is my code: > > genclone<-read.genalex("C:/Users/...", ploidy = 3) > genclone<-recode_polyploids(genclone, newploidy = T) > > > I'm wondering if this the right way (statistically speaking) to analyse mixed ploidy datasets in R? Will my estimates of genetic diversity and structure be accurate? > I know there is the POLYSAT package, which seems to have been developed specifically for dealing with mixed ploidy datasets, however, I would rather stick to using adagenet and poppr, as I'm familiar with the functions and the structure of genind and genclone objects. > > Any help would be much appreciated! > > Cheers, > > Jeremy > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > > ------------------------------ > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > End of adegenet-forum Digest, Vol 114, Issue 7 > ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Message signed with OpenPGP URL: From briank.lists at gmail.com Fri Feb 16 16:57:46 2018 From: briank.lists at gmail.com (brian knaus) Date: Fri, 16 Feb 2018 07:57:46 -0800 Subject: [adegenet-forum] snapclust when HW is not expected Message-ID: Hi and congrats on your snapclust paper! I was thinking of trying the method on a couple of projects I'm working on. However, I work with fungi and fungus-like plant pathogens that exhibit a mixture of reproductive modes (e.g., selfing, clonality, mitotic reproduction). This means that we do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual seems to come out pretty early stating that HW is important. I would guess that linkage disequilibrium (non-independence of loci) may be an issue also. So this raises my question: in systems where HW may not be assumed and where there may be linkage disequilibrium would I be better of using DAPC than snapclust? Thanks! Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Fri Feb 16 18:24:54 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Fri, 16 Feb 2018 17:24:54 +0000 Subject: [adegenet-forum] snapclust when HW is not expected In-Reply-To: References: Message-ID: Hi Brian thanks for reposting your question here. I am assuming that by 'DAPC' you mean the K-means clustering presented in the DAPC paper, not the factorial method itself. It is an interesting topic, and there are many possible answers. I'll try to mention a few. snapclust uses HW to compute the likelihood, like most other model-based (likelihood, bayesian) clustering methods I know of. Similarly, it assumes independence of loci, as that: (global log-likelihood) = sum(likelihood of every loci) Deviation from HW and linkage between loci will have the same kind of effect: the computed likelihood will be an approximation of the true, unknown likelihood. How good the approximation is in a particular case? I don't think we know, in general, but I'd like to see such a study published. And then, the next question is: how does it change the clustering solution? Again, more work would be interesting on this topic. I suspect attitudes will vary, pretty much depending on whether one decides to be purist or pragmatic. As an anecdote, developing various Bayesian of ML methods, it happened several times to realise the likelihood was 'wrong' (coding error), sometimes even one full component of the likelihood was entirely left out, and the reason I had not flagged it out before was results were still okay. Similarly, a linear regression may still give sensible results despite non-normally distributed results. k-means clustering is often used without checking that groups have similar within-group variances. And ML phylogenies from full alignments are commonplace, while the likelihood also assumes independence of loci - see Joe Felsenstein's cheeky comment on that in his pruning algorithm paper. In short: it could be a problem, but we (at least, I) don't know which impact it'll have. I know, disappointing. My 2 cents would be: - fairly evenly distributed LD: snapclust should be fine - a bit of clonality mixed up with some recombination / sexual reproduction: should be worth looking at - full clonality: work on haplotype frequencies / MLST type of markers (see apex package), and then snapclust will be fine - never rely on a single method if you can avoid it; I like using a hierarchical clustering and further exploration using factorial methods (PCA, DAPC) as a complement Please feel free to comment / discuss, everyone. I might put this in a podcast, time allowing. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 16 February 2018 at 15:57, brian knaus wrote: > Hi and congrats on your snapclust paper! I was thinking of trying the > method on a couple of projects I'm working on. However, I work with fungi > and fungus-like plant pathogens that exhibit a mixture of reproductive > modes (e.g., selfing, clonality, mitotic reproduction). This means that we > do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual > seems to come out pretty early stating that HW is important. I would guess > that linkage disequilibrium (non-independence of loci) may be an issue > also. So this raises my question: in systems where HW may not be assumed > and where there may be linkage disequilibrium would I be better of using > DAPC than snapclust? > > Thanks! > Brian > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From briank.lists at gmail.com Fri Feb 16 23:56:06 2018 From: briank.lists at gmail.com (brian knaus) Date: Fri, 16 Feb 2018 14:56:06 -0800 Subject: [adegenet-forum] snapclust when HW is not expected In-Reply-To: References: Message-ID: Thank you for a very thoughtful response! I think a summary is that we can bend the rules, just try not to break things. And I think that was a message expressed by Pritchard's group. They had a paper where they used STRUCTURE on Helicobacter pylori. I think an issue though is that there are many in the biological community do not understand the methods well enough to know if and when they may have gone too far. I appreciate your recommendations, but for many of these projects we have a reason to expect mixed mating modes, but we do not know how much of any particular mode to expect. In fact, the research goal is frequently to infer mating mode. Or perhaps which groups of samples may be outcrossing and which are not. I suspect that might be a lot to ask for. I appreciate your insights! And I find it encouraging that you would like to see more work on this. Perhaps we'll get to that one day? Brian On Fri, Feb 16, 2018 at 9:24 AM, Thibaut Jombart wrote: > Hi Brian > > thanks for reposting your question here. I am assuming that by 'DAPC' you > mean the K-means clustering presented in the DAPC paper, not the factorial > method itself. It is an interesting topic, and there are many possible > answers. I'll try to mention a few. > > snapclust uses HW to compute the likelihood, like most other model-based > (likelihood, bayesian) clustering methods I know of. Similarly, it assumes > independence of loci, as that: (global log-likelihood) = sum(likelihood of > every loci) > > Deviation from HW and linkage between loci will have the same kind of > effect: the computed likelihood will be an approximation of the true, > unknown likelihood. How good the approximation is in a particular case? I > don't think we know, in general, but I'd like to see such a study > published. And then, the next question is: how does it change the > clustering solution? Again, more work would be interesting on this topic. > > I suspect attitudes will vary, pretty much depending on whether one > decides to be purist or pragmatic. As an anecdote, developing various > Bayesian of ML methods, it happened several times to realise the likelihood > was 'wrong' (coding error), sometimes even one full component of the > likelihood was entirely left out, and the reason I had not flagged it out > before was results were still okay. Similarly, a linear regression may > still give sensible results despite non-normally distributed results. > k-means clustering is often used without checking that groups have similar > within-group variances. And ML phylogenies from full alignments are > commonplace, while the likelihood also assumes independence of loci - see > Joe Felsenstein's cheeky comment on that in his pruning algorithm paper. > > In short: it could be a problem, but we (at least, I) don't know which > impact it'll have. I know, disappointing. My 2 cents would be: > - fairly evenly distributed LD: snapclust should be fine > - a bit of clonality mixed up with some recombination / sexual > reproduction: should be worth looking at > - full clonality: work on haplotype frequencies / MLST type of markers > (see apex package), and then snapclust will be fine > - never rely on a single method if you can avoid it; I like using a > hierarchical clustering and further exploration using factorial methods > (PCA, DAPC) as a complement > > Please feel free to comment / discuss, everyone. I might put this in a > podcast, time allowing. > > Best > Thibaut > > > > -- > Dr Thibaut Jombart > Lecturer, Department of Infectious Disease Epidemiology, Imperial College > London > Head of RECON: repidemicsconsortium.org > WHO Consultant - outbreak analysis > https://thibautjombart.netlify.com > Twitter: @TeebzR > +44(0)20 7594 3658 <+44%2020%207594%203658> > > On 16 February 2018 at 15:57, brian knaus wrote: > >> Hi and congrats on your snapclust paper! I was thinking of trying the >> method on a couple of projects I'm working on. However, I work with fungi >> and fungus-like plant pathogens that exhibit a mixture of reproductive >> modes (e.g., selfing, clonality, mitotic reproduction). This means that we >> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual >> seems to come out pretty early stating that HW is important. I would guess >> that linkage disequilibrium (non-independence of loci) may be an issue >> also. So this raises my question: in systems where HW may not be assumed >> and where there may be linkage disequilibrium would I be better of using >> DAPC than snapclust? >> >> Thanks! >> Brian >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo >> /adegenet-forum >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Mon Feb 19 09:41:54 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 19 Feb 2018 08:41:54 +0000 Subject: [adegenet-forum] snapclust when HW is not expected In-Reply-To: References: Message-ID: Again, it'd be fun to give it a try on an actual case study ;) If one treated clonal data as independent loci (rather than as a single one), this will result in fully correlated allele frequencies, but this shouldn't change the clustering itself. It will change summary statistics (AIC etc), but as both the deviance and the number of parameters will be overestimated not sure how much this would impact the choice of the 'optimal K'. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 16 February 2018 at 22:56, brian knaus wrote: > Thank you for a very thoughtful response! I think a summary is that we can > bend the rules, just try not to break things. And I think that was a > message expressed by Pritchard's group. They had a paper where they used > STRUCTURE on Helicobacter pylori. I think an issue though is that there > are many in the biological community do not understand the methods well > enough to know if and when they may have gone too far. I appreciate your > recommendations, but for many of these projects we have a reason to expect > mixed mating modes, but we do not know how much of any particular mode to > expect. In fact, the research goal is frequently to infer mating mode. Or > perhaps which groups of samples may be outcrossing and which are not. I > suspect that might be a lot to ask for. > > I appreciate your insights! And I find it encouraging that you would like > to see more work on this. Perhaps we'll get to that one day? > Brian > > On Fri, Feb 16, 2018 at 9:24 AM, Thibaut Jombart > wrote: > >> Hi Brian >> >> thanks for reposting your question here. I am assuming that by 'DAPC' you >> mean the K-means clustering presented in the DAPC paper, not the factorial >> method itself. It is an interesting topic, and there are many possible >> answers. I'll try to mention a few. >> >> snapclust uses HW to compute the likelihood, like most other model-based >> (likelihood, bayesian) clustering methods I know of. Similarly, it assumes >> independence of loci, as that: (global log-likelihood) = sum(likelihood of >> every loci) >> >> Deviation from HW and linkage between loci will have the same kind of >> effect: the computed likelihood will be an approximation of the true, >> unknown likelihood. How good the approximation is in a particular case? I >> don't think we know, in general, but I'd like to see such a study >> published. And then, the next question is: how does it change the >> clustering solution? Again, more work would be interesting on this topic. >> >> I suspect attitudes will vary, pretty much depending on whether one >> decides to be purist or pragmatic. As an anecdote, developing various >> Bayesian of ML methods, it happened several times to realise the likelihood >> was 'wrong' (coding error), sometimes even one full component of the >> likelihood was entirely left out, and the reason I had not flagged it out >> before was results were still okay. Similarly, a linear regression may >> still give sensible results despite non-normally distributed results. >> k-means clustering is often used without checking that groups have similar >> within-group variances. And ML phylogenies from full alignments are >> commonplace, while the likelihood also assumes independence of loci - see >> Joe Felsenstein's cheeky comment on that in his pruning algorithm paper. >> >> In short: it could be a problem, but we (at least, I) don't know which >> impact it'll have. I know, disappointing. My 2 cents would be: >> - fairly evenly distributed LD: snapclust should be fine >> - a bit of clonality mixed up with some recombination / sexual >> reproduction: should be worth looking at >> - full clonality: work on haplotype frequencies / MLST type of markers >> (see apex package), and then snapclust will be fine >> - never rely on a single method if you can avoid it; I like using a >> hierarchical clustering and further exploration using factorial methods >> (PCA, DAPC) as a complement >> >> Please feel free to comment / discuss, everyone. I might put this in a >> podcast, time allowing. >> >> Best >> Thibaut >> >> >> >> -- >> Dr Thibaut Jombart >> Lecturer, Department of Infectious Disease Epidemiology, Imperial College >> London >> Head of RECON: repidemicsconsortium.org >> WHO Consultant - outbreak analysis >> https://thibautjombart.netlify.com >> Twitter: @TeebzR >> +44(0)20 7594 3658 <+44%2020%207594%203658> >> >> On 16 February 2018 at 15:57, brian knaus wrote: >> >>> Hi and congrats on your snapclust paper! I was thinking of trying the >>> method on a couple of projects I'm working on. However, I work with fungi >>> and fungus-like plant pathogens that exhibit a mixture of reproductive >>> modes (e.g., selfing, clonality, mitotic reproduction). This means that we >>> do not necessarily expect Hardy-Weinberg assumptions to be met. Your manual >>> seems to come out pretty early stating that HW is important. I would guess >>> that linkage disequilibrium (non-independence of loci) may be an issue >>> also. So this raises my question: in systems where HW may not be assumed >>> and where there may be linkage disequilibrium would I be better of using >>> DAPC than snapclust? >>> >>> Thanks! >>> Brian >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo >>> /adegenet-forum >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jdpalacio at utexas.edu Tue Feb 20 15:04:54 2018 From: jdpalacio at utexas.edu (Juan D Palacio Mejia) Date: Tue, 20 Feb 2018 08:04:54 -0600 Subject: [adegenet-forum] snapclust, table.value Message-ID: <4E649EFD-5585-4F1D-9E51-E9D1D06AB055@utexas.edu> Hi there, I have a silly question, when I ran a cluster using snapclust or find.clusters, the legend in the table.value graph give me a square size using odd numbers, such as 0.5, 1.5, how I can change it to entere numbers? Thanks guys, Juan From thibautjombart at gmail.com Thu Feb 22 12:24:24 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Thu, 22 Feb 2018 11:24:24 +0000 Subject: [adegenet-forum] snapclust, table.value In-Reply-To: <4E649EFD-5585-4F1D-9E51-E9D1D06AB055@utexas.edu> References: <4E649EFD-5585-4F1D-9E51-E9D1D06AB055@utexas.edu> Message-ID: Hi Juan The basic graphics of ade4 are not that customisable. Maybe try using adegraphics? https://cran.r-project.org/web/packages/adegraphics/vignettes/adegraphics.html Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 20 February 2018 at 14:04, Juan D Palacio Mejia wrote: > Hi there, > > I have a silly question, when I ran a cluster using snapclust or > find.clusters, the legend in the table.value graph give me a square size > using odd numbers, such as 0.5, 1.5, how I can change it to entere numbers? > > Thanks guys, > > Juan > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arsalane at protonmail.com Thu Feb 22 12:27:19 2018 From: arsalane at protonmail.com (Arsalan Emami-Khoyi) Date: Thu, 22 Feb 2018 06:27:19 -0500 Subject: [adegenet-forum] Spatial Data with large DNA alignement Message-ID: Hello Thibaut, I hope that you are fine and all goes well. A rapid question : When I import a large alignment of DNA sequences using the different tools, How can I efficiently assign geographical location to each individuals or population. e.g , An alignment of 1000 sequences from 10 populations each sampled for 100 animals. What I do at the moment is to make a 1000 lines text file for x and Y coordinates for each individuals. Of course each 100 rows has exactly the same coordinates. I guess they should be a more efficient way of doing that, am I correct ? heaps of thanks in advance Regards Arsalan Emami-Khoyi Postdoctoral Research Fellow in Wildlife Genomics University of Johannesburg_Center for Ecological Genomics and Wildlife Conservation Auckland Park 2006 South Africa Email : Arsalane at uj.ac.za Phone :+27 (0)11 559 3373 Cellphone:+27 79 88 14 628 Website :https://sites.google.com/site/drpeterteske/postdocs [EGWC-LOGO (1).png] Sent with [ProtonMail](https://protonmail.com) Secure Email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: EGWC-LOGO (1).png Type: image/png Size: 15092 bytes Desc: not available URL: From thibautjombart at gmail.com Thu Feb 22 15:00:51 2018 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Thu, 22 Feb 2018 14:00:51 +0000 Subject: [adegenet-forum] Spatial Data with large DNA alignement In-Reply-To: References: Message-ID: Hello, by the sound if it base R will make your life easier - have a look at ?rep For instance: > rep(c("foo", "bar"), each = 10) [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" [13] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "bar" > Cheers Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis https://thibautjombart.netlify.com Twitter: @TeebzR +44(0)20 7594 3658 On 22 February 2018 at 11:27, Arsalan Emami-Khoyi wrote: > Hello Thibaut, > I hope that you are fine and all goes well. > A rapid question : > When I import a large alignment of DNA sequences using the different > tools, How can I efficiently assign geographical location to each > individuals or population. > e.g , > An alignment of 1000 sequences from 10 populations each sampled for 100 > animals. > What I do at the moment is to make a 1000 lines text file for x and Y > coordinates for each individuals. > Of course each 100 rows has exactly the same coordinates. > I guess they should be a more efficient way of doing that, am I correct ? > heaps of thanks in advance > Regards > > Arsalan Emami-Khoyi > Postdoctoral Research Fellow in Wildlife Genomics > University of Johannesburg_Center for Ecological Genomics and Wildlife > Conservation > Auckland Park 2006 > South Africa > Email : Arsalane at uj.ac.za > Phone :+27 (0)11 559 3373 <+27%2011%20559%203373> > Cellphone:+27 79 88 14 628 > Website :https://sites.google.com/site/drpeterteske/postdocs > [image: EGWC-LOGO (1).png] > > Sent with ProtonMail Secure Email. > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: EGWC-LOGO (1).png Type: image/png Size: 15092 bytes Desc: not available URL: From dollykc at gmail.com Thu Feb 22 21:34:15 2018 From: dollykc at gmail.com (Katharine Coykendall) Date: Thu, 22 Feb 2018 15:34:15 -0500 Subject: [adegenet-forum] different number of individuals in fastas Message-ID: Hello, I've been following the instructions found at http://popgen.nescent.org/PopDiffSequenceData.html to do a hierarchical F analysis on my data. I have two fasta files with the same 117 individuals in each file. The files are different genes. When I put them together using the read.multiFASTA command, it tells me that there are 120 sequences. When I turn it into a genid object it says there are 120 individuals and 19 alleles. I checked both files to make sure that the sample names are the same between them and they are. I was wondering if it had to do with add.gaps=TRUE so I tried to set it to false, but I get an error Error in file(con, "rb") : cannot open the connection In addition: Warning message: In file(con, "rb") : cannot open file 'FALSE': No such file or directory Is there an easy way to look at the multiFASTA or genid file to see where the extra three individuals are coming from? Thanks, Katharine -------------- next part -------------- An HTML attachment was scrubbed... URL: From dollykc at gmail.com Thu Feb 22 23:02:55 2018 From: dollykc at gmail.com (Katharine Coykendall) Date: Thu, 22 Feb 2018 17:02:55 -0500 Subject: [adegenet-forum] different number of individuals in fastas In-Reply-To: References: Message-ID: Figured it out! In one of my fasta files, I had three of the sample names capitalized and the other file they weren't. On Thu, Feb 22, 2018 at 3:34 PM, Katharine Coykendall wrote: > Hello, > I've been following the instructions found at http://popgen.nescent.org/ > PopDiffSequenceData.html to do a hierarchical F analysis on my data. I > have two fasta files with the same 117 individuals in each file. The files > are different genes. When I put them together using the read.multiFASTA > command, it tells me that there are 120 sequences. When I turn it into a > genid object it says there are 120 individuals and 19 alleles. I checked > both files to make sure that the sample names are the same between them and > they are. I was wondering if it had to do with add.gaps=TRUE so I tried to > set it to false, but I get an error > Error in file(con, "rb") : cannot open the connection > In addition: Warning message: > In file(con, "rb") : cannot open file 'FALSE': No such file or directory > > Is there an easy way to look at the multiFASTA or genid file to see where > the extra three individuals are coming from? > > Thanks, > Katharine > -------------- next part -------------- An HTML attachment was scrubbed... URL: