From thibautjombart at gmail.com Wed Jul 12 16:21:59 2017 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 12 Jul 2017 15:21:59 +0100 Subject: [adegenet-forum] dapc on allele frequencies In-Reply-To: References: Message-ID: Hi Mark, in principle you could use genlight, setting the ploidy for each pool to (the number of individuals) * ploidy. It should still be quite efficient in terms of memory savings, and run decently fast for a small number of pools (<100). Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis sites.google.com/site/thibautjombart/ Twitter: @TeebzR +44(0)20 7594 3658 On 17 May 2017 at 16:48, Mark Coulson wrote: > Hi > > > > I have allele frequency data for pools of individuals (no individual > genotype data) for >500,000 SNPs. I know I can do a dapc on allele > frequencies directly but given this many SNPs should I be using a > ?genlight? object or is this only for individual genotypes? > > > > Thanks, > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Wed Jul 12 16:25:44 2017 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 12 Jul 2017 15:25:44 +0100 Subject: [adegenet-forum] SNP data In-Reply-To: <13d20f53cc324ca99abc89d6633c8bac@PSU.EDU> References: <13d20f53cc324ca99abc89d6633c8bac@PSU.EDU> Message-ID: Hi there there are different ways you can go about this. If RAM isn't an issue, and some loci have more than 2 alleles, then you can use df2genind on a data.frame where alleles are coded with letters separated by a character, e.g. "a / t / a / g", or even "atag", i.e. no separator but you'll need to specify the n.char argument. If RAM is an issue, genlight will let you store information on polyploids (see the tutorial on 'genomics'). The input would be a matrix of integers, each representing the number of the 2nd alleles (indiv in rows, SNPs in columns). Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis sites.google.com/site/thibautjombart/ Twitter: @TeebzR +44(0)20 7594 3658 On 30 May 2017 at 20:06, Weiya Xue wrote: > Hi , > > I want to use adegenet for SNP data analysis in ployploids. How should I > prepare the data? > > Does any one have the syntax of the input file? > > Thanks, > > Weiya Xue > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Wed Jul 12 16:29:52 2017 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 12 Jul 2017 15:29:52 +0100 Subject: [adegenet-forum] When cross-validating DAPC (using web server), is it best to use 'group' or 'overall'? In-Reply-To: References: Message-ID: Hi Stephanie, do your groups have very different sizes? If so this would explain the discrepancy. When optimizing cross validation on each group, you basically make sure that every group is predicted as well as possible. When using overall classification, the largest group really is what gets optimized. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis sites.google.com/site/thibautjombart/ Twitter: @TeebzR +44(0)20 7594 3658 On 29 June 2017 at 22:20, Stephanie Coster wrote: > Thanks in advance! > > I have a dataset of genotypes from microsatellite loci and I'm looking to > analyze population structure. Program STRUCTURE shows essentially no > clusters, and I'd like to use DAPC to get another perspective. > > I've run the 'find.clusters' code and the BIC suggests K=2, but the > assignments are unreliable (equal assignments across all sites to both > clusters). I am interpreting this to mean that K=1 and all samples likely > form a single cluster. > > Now, I'd like to use my apriori site groupings to draw a scatterplot and > am using the DAPC web server to cross-validate and suggest the number of > PCs to retain. I get notably different scatterplots depending on whether I > choose 'group' or 'overall' to assess. The sites have more spatial > differentiation when using 'group', and essentially all overlap when using > 'overall'. I understand that success is calculated by my groups or overall > depending on the choice, but what does this mean in application? Can > someone help explain why these plots differ and which is better to use? > > Many thanks! > > Stephanie > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Wed Jul 12 16:27:07 2017 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 12 Jul 2017 15:27:07 +0100 Subject: [adegenet-forum] Varimax In-Reply-To: <9DA8692B-B1B4-43FD-B568-C961F263308C@inra.fr> References: <9DA8692B-B1B4-43FD-B568-C961F263308C@inra.fr> Message-ID: Dear JL, this is a good idea, but I must confess I have never used varimax rotation myself, so I can't advise on this. This said, it would be nice to see a documented example on genetic data, if it turns out to be useful. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis sites.google.com/site/thibautjombart/ Twitter: @TeebzR +44(0)20 7594 3658 On 9 June 2017 at 16:08, Jean-Luc Legras wrote: > Hi all > > Adegenet is a fantastic tool for population genetic analysis. > I used a PCA on my dataset and populations are clearly clustered. > However, in this case they do not fit with the axis. I was wondering if > it was possible to setup a Varimax procedure after the PCA (here centered, > unscaled PCA) in order to make the main population dispersal fit the axis... > > > Best regards. > > JL > > > scatter plot; > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Graphe-PCA.png Type: image/png Size: 53065 bytes Desc: not available URL: From thibautjombart at gmail.com Wed Jul 12 17:54:26 2017 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 12 Jul 2017 16:54:26 +0100 Subject: [adegenet-forum] When cross-validating DAPC (using web server), is it best to use 'group' or 'overall'? In-Reply-To: References: Message-ID: Good, so then the best choice sounds like the group otimization (rather than overall optim). Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis sites.google.com/site/thibautjombart/ Twitter: @TeebzR +44(0)20 7594 3658 On 12 July 2017 at 16:02, Stephanie Coster wrote: > Yes, they do have different sizes. Thanks, that helps explain it. > > Stephanie > > On Wed, Jul 12, 2017 at 10:29 AM, Thibaut Jombart < > thibautjombart at gmail.com> wrote: > >> Hi Stephanie, >> >> do your groups have very different sizes? If so this would explain the >> discrepancy. When optimizing cross validation on each group, you basically >> make sure that every group is predicted as well as possible. When using >> overall classification, the largest group really is what gets optimized. >> >> Best >> Thibaut >> >> >> -- >> Dr Thibaut Jombart >> Lecturer, Department of Infectious Disease Epidemiology, Imperial College >> London >> Head of RECON: repidemicsconsortium.org >> WHO Consultant - outbreak analysis >> sites.google.com/site/thibautjombart/ >> Twitter: @TeebzR >> +44(0)20 7594 3658 <+44%2020%207594%203658> >> >> On 29 June 2017 at 22:20, Stephanie Coster >> wrote: >> >>> Thanks in advance! >>> >>> I have a dataset of genotypes from microsatellite loci and I'm looking >>> to analyze population structure. Program STRUCTURE shows essentially no >>> clusters, and I'd like to use DAPC to get another perspective. >>> >>> I've run the 'find.clusters' code and the BIC suggests K=2, but the >>> assignments are unreliable (equal assignments across all sites to both >>> clusters). I am interpreting this to mean that K=1 and all samples likely >>> form a single cluster. >>> >>> Now, I'd like to use my apriori site groupings to draw a scatterplot and >>> am using the DAPC web server to cross-validate and suggest the number of >>> PCs to retain. I get notably different scatterplots depending on whether I >>> choose 'group' or 'overall' to assess. The sites have more spatial >>> differentiation when using 'group', and essentially all overlap when using >>> 'overall'. I understand that success is calculated by my groups or overall >>> depending on the choice, but what does this mean in application? Can >>> someone help explain why these plots differ and which is better to use? >>> >>> Many thanks! >>> >>> Stephanie >>> >>> >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo >>> /adegenet-forum >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Coulson.ic at uhi.ac.uk Wed Jul 12 17:58:21 2017 From: Mark.Coulson.ic at uhi.ac.uk (Mark Coulson) Date: Wed, 12 Jul 2017 15:58:21 +0000 Subject: [adegenet-forum] dapc on allele frequencies In-Reply-To: References: Message-ID: Hi Thibaut, I have been using in just as a normal matrix (i.e. not a genlight object) and it is pretty fast on a decent cpu. However, I still have an outstanding question on the DAPC itself. My earlier post was as follows: I'm using DAPC to try to discriminate between two groups. However, the data are not individual genotypes, but rather the result of genotyping pools of samples. There are 20 individual pools in each of the two groups. So basically I am providing the analysis with a frequency of the A allele (all dimorphic SNPs) for each pool. There are ~600,000 SNPs in the dataset. I ran the xvalDapc function and it identified 20 PC as the optimum. However when I run the DAPC on the 20, I get the following warning: Warning message: In dapc.data.frame(as.data.frame(x), ...) : number of retained PCs of PCA may be too large (> N /3) results may be unstable What does this mean in terms of my discrimination, which is pretty good among the two groups? In other analyses such as ranking SNPs according to FST, outlier analyses, etc. the separation is pretty good but not as clear as with DAPC overall. Therefore I am not sure if 1) DAPC is genuinely doing a better job at separating the groups or (2) there is still over-fitting of the data with DAPC given the large number of variables and am I simply finding a solution (which may not be real?) Also, I have a question on the xvalDapc function. When I run the following xval1 <- xvalDapc(FD_t, group, n.pca.max=40, result="groupMean", center=TRUE, scale=FALSE, xval.plot=TRUE) I get results back at 5, 10, 15, 20, 25, 30, 35 However, when I run (on the same dataset) xval1a <- xvalDapc(FD_t, group, n.pca.max=40, result="groupMean", training.set=0.7, center=TRUE, scale=FALSE, xval.plot=TRUE) I get results back at 13 different PCA axes levels, roughly by increments of 2 Also, I am looking to specify the increments so tried something like the following: xval2 <- xvalDapc(FD_t, group, n.pca.max=40, result="groupMean", training.set=0.7, center=TRUE, scale=FALSE, n.pca=seq(5, by=5,to=40),xval.plot=TRUE) but I don't get these exact increments. So what determines the scale of the x-axis? Any thoughts would be helpful Dr Mark Coulson Researcher ? Rivers and Lochs Institute T: 01463 273576 / 279477 Normal working days: Tues-Friday [cid:image006.jpg at 01D2FB30.0CE9DA20] 1 Inverness Campus Inverness IV2 5NA [cid:image005.png at 01D05FDC.CF5914F0][cid:image006.png at 01D05FDC.CF5914F0] www.inverness.uhi.ac.uk [IIP_GOLD_19] [CSEUK Primary (r) RGB] From: adegenet-forum-bounces at lists.r-forge.r-project.org [mailto:adegenet-forum-bounces at lists.r-forge.r-project.org] On Behalf Of Thibaut Jombart Sent: 12 July 2017 15:22 To: Mark Coulson Cc: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] dapc on allele frequencies Hi Mark, in principle you could use genlight, setting the ploidy for each pool to (the number of individuals) * ploidy. It should still be quite efficient in terms of memory savings, and run decently fast for a small number of pools (<100). Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org WHO Consultant - outbreak analysis sites.google.com/site/thibautjombart/ Twitter: @TeebzR +44(0)20 7594 3658 On 17 May 2017 at 16:48, Mark Coulson > wrote: Hi I have allele frequency data for pools of individuals (no individual genotype data) for >500,000 SNPs. I know I can do a dapc on allele frequencies directly but given this many SNPs should I be using a ?genlight? object or is this only for individual genotypes? Thanks, _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum Inverness College UHI, a partner in the University of the Highlands and Islands www.inverness.uhi.ac.uk Board of Management of Inverness College (known as Inverness College UHI), Scottish Charity No SC021197. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 665 bytes Desc: image002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 708 bytes Desc: image003.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.jpg Type: image/jpeg Size: 2689 bytes Desc: image006.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image007.jpg Type: image/jpeg Size: 1928 bytes Desc: image007.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image008.jpg Type: image/jpeg Size: 2873 bytes Desc: image008.jpg URL: From thibautjombart at gmail.com Wed Jul 12 19:03:37 2017 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Wed, 12 Jul 2017 18:03:37 +0100 Subject: [adegenet-forum] dapc on allele frequencies In-Reply-To: References: Message-ID: Hi Mark sorry about the delay in the reply. > I'm using DAPC to try to discriminate between two groups. However, the data are not individual genotypes, but rather the result of genotyping pools of samples. There are 20 individual pools in each of the two groups. So basically I am providing the analysis with a frequency of the A allele (all dimorphic SNPs) for each pool. There are ~600,000 SNPs in the dataset. I ran the xvalDapc function and it identified 20 PC as the optimum. However when I run the DAPC on the 20, I get the following warning: > > > > Warning message: > > In dapc.data.frame(as.data.frame(x), ...) : > > number of retained PCs of PCA may be too large (> N /3) > > results may be unstable > > > What does this mean in terms of my discrimination, which is pretty good among the two groups? In other analyses such as ranking SNPs according to FST, outlier analyses, etc. the separation is pretty good but not as clear as with DAPC overall. You can safely ignore the warning. I think it's been removed from the devel version, and will be gone in the next CRAN release. > Therefore I am not sure if 1) DAPC is genuinely doing a better job at separating the groups or (2) there is still over-fitting of the data with DAPC given the large number of variables and am I simply finding a solution (which may not be real?) I would just examine the results of the cross validation for this. Are the predictions significantly better than the random expectation (dashed horizontal lines)? > Also, I have a question on the xvalDapc function. > When I run the following > > xval1 <- xvalDapc(FD_t, group, n.pca.max=40, result="groupMean", center=TRUE, scale=FALSE, xval.plot=TRUE) > I get results back at 5, 10, 15, 20, 25, 30, 35 > However, when I run (on the same dataset) > xval1a <- xvalDapc(FD_t, group, n.pca.max=40, result="groupMean", training.set=0.7, center=TRUE, scale=FALSE, xval.plot=TRUE) > I get results back at 13 different PCA axes levels, roughly by increments of 2 > Also, I am looking to specify the increments so tried something like the following: > > xval2 <- xvalDapc(FD_t, group, n.pca.max=40, result="groupMean", training.set=0.7, center=TRUE, scale=FALSE, n.pca=seq(5, by=5,to=40),xval.plot=TRUE) > but I don't get these exact increments. > So what determines the scale of the x-axis? Can you try with the current devel version? I suspect it might have been a bug which has been fixed since the last release. Best Thibaut From pskipwith at gmail.com Sat Jul 22 06:11:36 2017 From: pskipwith at gmail.com (Phillip Skipwith) Date: Fri, 21 Jul 2017 21:11:36 -0700 Subject: [adegenet-forum] Individuals for genind not plotting dapc Message-ID: Hi, I'm pretty new to Adegenet, but I have been through the tutorials and have been more or less successful getting it to work on my empirical data. This is a phylogenomic dataset of 83 individuals from eight clades and 4,268 loci (I'm using 4,035 SNPs for ordination, etc.). I realize the sample size is small, but this is hard-earned field data. The problem arises when I'm trying to use dapc after find.clusters on the below genind object. gen.struct /// GENIND OBJECT ///////// // 83 individuals; 4,035 loci; 8,341 alleles; size: 4.5 Mb // Basic content @tab: 83 x 8341 matrix of allele counts @loc.n.all: number of alleles per locus (range: 2-4) @loc.fac: locus factor for the 8341 columns of @tab @all.names: list of allele names for each locus @ploidy: ploidy of each individual (range: 2-2) @type: codom @call: read.structure(file = "final_Struct_good_maybe.str", n.ind = 83, n.loc = 4035, onerowperind = F, col.lab = 1, col.pop = 2, row.marknames = 0, ask = F) // Optional content @pop: population of each individual (group size range: 2-27) grp <- find.clusters(gen.struct, max.n.clust=35) Choose the number PCs to retain (>=1): 80 Choose the number of clusters (>=2: 9 dapc1 <- dapc(gen.struct, grp$grp) dapc1 ################################################# # Discriminant Analysis of Principal Components # ################################################# class: dapc $call: dapc.genind(x = gen.struct, pop = grp$grp) $n.pca: 60 first PCs of PCA used $n.da: 4 discriminant functions saved $var (proportion of conserved variance): 0.946 $eig (eigenvalues): 182000 71010 34130 20710 16790 ... vector length content 1 $eig 8 eigenvalues 2 $grp 83 prior group assignment 3 $prior 9 prior group probabilities 4 $assign 83 posterior group assignment 5 $pca.cent 8341 centring vector of PCA 6 $pca.norm 8341 scaling vector of PCA 7 $pca.eig 82 eigenvalues of PCA data.frame nrow ncol content 1 $tab 83 60 retained PCs of PCA 2 $means 9 60 group means 3 $loadings 60 4 loadings of variables 4 $ind.coord 83 4 coordinates of individuals (principal components) 5 $grp.coord 9 4 coordinates of groups 6 $posterior 83 9 posterior membership probabilities 7 $pca.loadings 8341 60 PCA loadings of original variables 8 $var.contr 8341 4 contribution of original variables Choose the number PCs to retain (>=1): 60 Choose the number discriminant functions to retain (>=1): 4 scatter(dapc1, scree.da = T) The end result is a plot with the centroid points for each of the clusters but not the individuals. I know there is probably something simple that I'm missing or there's something intrinsically wrong with my code and or data. I've perused the forum for similar issues and nothing is quite spot on to what I'm asking here. Any help would be greatly appreciated. Best, Phillip -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: daps_plot_failed.pdf Type: application/pdf Size: 24356 bytes Desc: not available URL: From pskipwith at gmail.com Sat Jul 22 06:19:36 2017 From: pskipwith at gmail.com (Phillip Skipwith) Date: Fri, 21 Jul 2017 21:19:36 -0700 Subject: [adegenet-forum] Fwd: Individuals for genind not plotting dapc In-Reply-To: References: Message-ID: Hi, I'm pretty new to Adegenet, but I have been through the tutorials and have been more or less successful getting it to work on my empirical data. This is a phylogenomic dataset of 83 individuals from eight clades and 4,268 loci (I'm using 4,035 SNPs for ordination, etc.). I realize the sample size is small, but this is hard-earned field data. The problem arises when I'm trying to use dapc after find.clusters on the below genind object. gen.struct /// GENIND OBJECT ///////// // 83 individuals; 4,035 loci; 8,341 alleles; size: 4.5 Mb // Basic content @tab: 83 x 8341 matrix of allele counts @loc.n.all: number of alleles per locus (range: 2-4) @loc.fac: locus factor for the 8341 columns of @tab @all.names: list of allele names for each locus @ploidy: ploidy of each individual (range: 2-2) @type: codom @call: read.structure(file = "final_Struct_good_maybe.str", n.ind = 83, n.loc = 4035, onerowperind = F, col.lab = 1, col.pop = 2, row.marknames = 0, ask = F) // Optional content @pop: population of each individual (group size range: 2-27) grp <- find.clusters(gen.struct, max.n.clust=35) Choose the number PCs to retain (>=1): 80 Choose the number of clusters (>=2: 9 dapc1 <- dapc(gen.struct, grp$grp) dapc1 ################################################# # Discriminant Analysis of Principal Components # ################################################# class: dapc $call: dapc.genind(x = gen.struct, pop = grp$grp) $n.pca: 60 first PCs of PCA used $n.da: 4 discriminant functions saved $var (proportion of conserved variance): 0.946 $eig (eigenvalues): 182000 71010 34130 20710 16790 ... vector length content 1 $eig 8 eigenvalues 2 $grp 83 prior group assignment 3 $prior 9 prior group probabilities 4 $assign 83 posterior group assignment 5 $pca.cent 8341 centring vector of PCA 6 $pca.norm 8341 scaling vector of PCA 7 $pca.eig 82 eigenvalues of PCA data.frame nrow ncol content 1 $tab 83 60 retained PCs of PCA 2 $means 9 60 group means 3 $loadings 60 4 loadings of variables 4 $ind.coord 83 4 coordinates of individuals (principal components) 5 $grp.coord 9 4 coordinates of groups 6 $posterior 83 9 posterior membership probabilities 7 $pca.loadings 8341 60 PCA loadings of original variables 8 $var.contr 8341 4 contribution of original variables Choose the number PCs to retain (>=1): 60 Choose the number discriminant functions to retain (>=1): 4 scatter(dapc1, scree.da = T) The end result is a plot with the centroid points for each of the clusters but not the individuals. I know there is probably something simple that I'm missing or there's something intrinsically wrong with my code and or data. I've perused the forum for similar issues and nothing is quite spot on to what I'm asking here. Any help would be greatly appreciated. Best, Phillip -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: daps_plot_failed.pdf Type: application/pdf Size: 24356 bytes Desc: not available URL: