From t.jombart at imperial.ac.uk Mon Jun 2 18:20:53 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Mon, 2 Jun 2014 16:20:53 +0000 Subject: [adegenet-forum] Monmonier algorithm and individual scores In-Reply-To: References: Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk> Hi Manuela, thanks for re-posting on the forum. In this case, it seems that locations are very aggregated - a lot of genotypes were sampled roughly at the same place. Monmonier is unlikely to do well under such circumstances. The algorithm is very sensitive to local differences, and these are unstable for this kind of spatial distribution. I would recommend other approaches. For instance, if you want to define spatial clusters, you could use a basic clustering algorithm based on the principal components of a PCA (if spatial structure is obvious) or sPCA (if not, but there is still a spatial structure). Assuming 'foo' is your analysis (PCA or sPCA), one example would be using something along the lines of: h1 <- hclust(dist(foo$li)^2) plot(h1) cutree(h1) Etc. Check ?hclust for different clustering methods. Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Manuela [manuelacorreia2 at gmail.com] Sent: 31 May 2014 21:46 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Monmonier algorithm and individual scores Dear colleagues of Adegenet forum, First of all I must congratulate Doctor Thimbault for the wonderful work he has been so far developed. And following his own suggestion I'm sharing with you a specific issue raised by the output generated by Monmonier algorithm used for boundary detection. I have a sample made of 170 individuals, collected on 9 different places and genotyped for 19 SNPs by Realtime PCR. Before I run this line on the R script I had to explain to you about each one of them: mon1<- monmonier(xy ,D, gab) xy ? spatial coordinates UTM/Km) ; D ? pairwise allele sharing distance (?Prabclus? package); gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation) plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?) >From the output produced, it can be clearly seen that there are 4 clusters of individuals having four scores (50,100,150,200). But, I can't find a way to have access to individual scores. As matter in fact, I consulted in detail all the arguments provided on Plot function but none of them seemed to me to be on the way I could extract the individuals scores (IS). I?m wondering if you could give me a hint about it. Any help will be appreciated. Kind regards, Manuela (Biochemist) From manuelacorreia2 at gmail.com Tue Jun 3 11:01:18 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Tue, 3 Jun 2014 10:01:18 +0100 Subject: [adegenet-forum] Monmonier algorithm and individual scores In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk> References: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk> Message-ID: Doctor Thibault and dear colleagues, I would like to thank you for the valuable criticism you made in this output. The idea behind the IS was, solely, to have a first draft of the georeferenced clusters because in spatial clusters I'm well-aware that several different genoypes at the same coordinates in species with a very low mobility or with no mobility could be a strong indication that the genetic variability is only due to environment while a great genetic diversity nearby may result from a short dispersal highly spatial correlated. To need of further confirmation by sPCA and/or clustering techniques. The identification of spatial clusters in PCA, particularly by sPCA is no doubt more realiable than with Monmonier algoritm in this case. But I'd rather try to study more deeply each one of the 3 different methods (distance based-methods, Parsymony and maximum Likelihood) proposed in your tutorial "Trees" just to check it in first place if they might be appropriate to this dataset, Secondly, if they would gave different information perhaps with higher resolution when compared to classic NJ Tree, after validation by bootstrap. Eventually, if none is appropriate I always be able to rely on several clustering techniques more adequate for qualitative data, available at the "Cluster" package and to perform the validation by "cl Valid" following several criteria. >From a very simplistic point of view, PCA analysis (not scaled) might provides us with information of the genetic variability whereas sPCA about the significance of local and global structures. But, on the whole, the information provided by these two analysis: Moran's Index , variance and allele loadings, enable us to discriminate the loci more informative on genetic variability but not spatially structured from those whose variability its spatial structured. To be further confirmed through biplots. Another challenge ahead. To figure out the way to select the PC's having biological meaning and most probably not associated to the highest eigenvalues. Particularly, in the absence of traits or phenotype information. Please, feel free to make more comments or to give another suggestion(s). Cheers, Manuela 2014-06-02 17:20 GMT+01:00 Jombart, Thibaut : > Hi Manuela, > > thanks for re-posting on the forum. In this case, it seems that locations > are very aggregated - a lot of genotypes were sampled roughly at the same > place. Monmonier is unlikely to do well under such circumstances. The > algorithm is very sensitive to local differences, and these are unstable > for this kind of spatial distribution. I would recommend other approaches. > For instance, if you want to define spatial clusters, you could use a basic > clustering algorithm based on the principal components of a PCA (if spatial > structure is obvious) or sPCA (if not, but there is still a spatial > structure). Assuming 'foo' is your analysis (PCA or sPCA), one example > would be using something along the lines of: > > h1 <- hclust(dist(foo$li)^2) > plot(h1) > cutree(h1) > > Etc. > Check ?hclust for different clustering methods. > > Cheers > Thibaut > > > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [ > adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Manuela [ > manuelacorreia2 at gmail.com] > Sent: 31 May 2014 21:46 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] Monmonier algorithm and individual scores > > Dear colleagues of Adegenet forum, > > First of all I must congratulate Doctor Thimbault for the wonderful work > he has been so far developed. And following his own suggestion I'm sharing > with you a specific issue raised by the output generated by Monmonier > algorithm used for boundary detection. > I have a sample made of 170 individuals, collected on 9 different places > and genotyped for 19 SNPs by Realtime PCR. > Before I run this line on the R script I had to explain to you about each > one of them: > mon1<- monmonier(xy ,D, gab) > > xy ? spatial coordinates UTM/Km) ; > D ? pairwise allele sharing distance (?Prabclus? package); > gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation) > > plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?) > From the output produced, it can be clearly seen that there are 4 clusters > of individuals having four scores (50,100,150,200). But, I can't find a way > to have access to individual scores. As matter in fact, I consulted in > detail all the arguments provided on Plot function but none of them seemed > to me to be on the way I could extract the individuals scores (IS). > I?m wondering if you could give me a hint about it. Any help will be > appreciated. > Kind regards, > Manuela (Biochemist) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Tue Jun 3 11:26:20 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 3 Jun 2014 09:26:20 +0000 Subject: [adegenet-forum] Monmonier algorithm and individual scores In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEE2E3@icexch-m1.ic.ac.uk> Hi there, I would not recommend using all three phylogenetic reconstruction methods, even if with 19 SNPs there shouldn't be major differences. I covered the maximum parsimony for historical reasons, but I can't see it being useful here. Other clustering approaches sounds like a good idea. If you ever fancy documenting how to use them on genetic data in a small tutorial, I think that would be a very handy to others ;) As for your last question, it makes a lot of sense, but you will need external information for this. Eigenvalue selection procedures based on inertia will basically fail to detect the structures you talk about. So you will need to be able to test e.g. the correlation of your PCs to a set of traits, or their spatial distribution, etc. Cheers Thibaut ________________________________________ From: Manuela [manuelacorreia2 at gmail.com] Sent: 03 June 2014 10:01 To: Jombart, Thibaut Cc: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] Monmonier algorithm and individual scores Doctor Thibault and dear colleagues, I would like to thank you for the valuable criticism you made in this output. The idea behind the IS was, solely, to have a first draft of the georeferenced clusters because in spatial clusters I'm well-aware that several different genoypes at the same coordinates in species with a very low mobility or with no mobility could be a strong indication that the genetic variability is only due to environment while a great genetic diversity nearby may result from a short dispersal highly spatial correlated. To need of further confirmation by sPCA and/or clustering techniques. The identification of spatial clusters in PCA, particularly by sPCA is no doubt more realiable than with Monmonier algoritm in this case. But I'd rather try to study more deeply each one of the 3 different methods (distance based-methods, Parsymony and maximum Likelihood) proposed in your tutorial "Trees" just to check it in first place if they might be appropriate to this dataset, Secondly, if they would gave different information perhaps with higher resolution when compared to classic NJ Tree, after validation by bootstrap. Eventually, if none is appropriate I always be able to rely on several clustering techniques more adequate for qualitative data, available at the "Cluster" package and to perform the validation by "cl Valid" following several criteria. >From a very simplistic point of view, PCA analysis (not scaled) might provides us with information of the genetic variability whereas sPCA about the significance of local and global structures. But, on the whole, the information provided by these two analysis: Moran's Index , variance and allele loadings, enable us to discriminate the loci more informative on genetic variability but not spatially structured from those whose variability its spatial structured. To be further confirmed through biplots. Another challenge ahead. To figure out the way to select the PC's having biological meaning and most probably not associated to the highest eigenvalues. Particularly, in the absence of traits or phenotype information. Please, feel free to make more comments or to give another suggestion(s). Cheers, Manuela 2014-06-02 17:20 GMT+01:00 Jombart, Thibaut >: Hi Manuela, thanks for re-posting on the forum. In this case, it seems that locations are very aggregated - a lot of genotypes were sampled roughly at the same place. Monmonier is unlikely to do well under such circumstances. The algorithm is very sensitive to local differences, and these are unstable for this kind of spatial distribution. I would recommend other approaches. For instance, if you want to define spatial clusters, you could use a basic clustering algorithm based on the principal components of a PCA (if spatial structure is obvious) or sPCA (if not, but there is still a spatial structure). Assuming 'foo' is your analysis (PCA or sPCA), one example would be using something along the lines of: h1 <- hclust(dist(foo$li)^2) plot(h1) cutree(h1) Etc. Check ?hclust for different clustering methods. Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Manuela [manuelacorreia2 at gmail.com] Sent: 31 May 2014 21:46 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Monmonier algorithm and individual scores Dear colleagues of Adegenet forum, First of all I must congratulate Doctor Thimbault for the wonderful work he has been so far developed. And following his own suggestion I'm sharing with you a specific issue raised by the output generated by Monmonier algorithm used for boundary detection. I have a sample made of 170 individuals, collected on 9 different places and genotyped for 19 SNPs by Realtime PCR. Before I run this line on the R script I had to explain to you about each one of them: mon1<- monmonier(xy ,D, gab) xy ? spatial coordinates UTM/Km) ; D ? pairwise allele sharing distance (?Prabclus? package); gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation) plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?) >From the output produced, it can be clearly seen that there are 4 clusters of individuals having four scores (50,100,150,200). But, I can't find a way to have access to individual scores. As matter in fact, I consulted in detail all the arguments provided on Plot function but none of them seemed to me to be on the way I could extract the individuals scores (IS). I?m wondering if you could give me a hint about it. Any help will be appreciated. Kind regards, Manuela (Biochemist) From manuelacorreia2 at gmail.com Tue Jun 3 15:27:32 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Tue, 3 Jun 2014 14:27:32 +0100 Subject: [adegenet-forum] Monmonier algorithm and individual scores In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA657087BEE2E3@icexch-m1.ic.ac.uk> Message-ID: Doctor Thibault and dear colleagues, Deal:). I'll do my best. About the PC's with biological meaning but not possessing traits/phenotypic information. Later on, I'll explain to you why I think this"crazy" idea might be feasible, in this case. Thank you once more for the helpful suggestions. Cheers, Manuela 2014-06-03 12:47 GMT+01:00 Manuela : > Doctor Thibault and dear colleagues, > > Deal:). I'll do my best. > > About the PC's with biological meaning but not having traits/phenotipic > information. Later I'll explain to you the reason why I insist on using the > softwares you have developed for PCA and sPCA to go on with this "crazy" > idea. > > Thank you once more for the helpful suggestions. > > Cheers, > Manuela > > > 2014-06-03 10:26 GMT+01:00 Jombart, Thibaut : > > >> Hi there, >> >> I would not recommend using all three phylogenetic reconstruction >> methods, even if with 19 SNPs there shouldn't be major differences. I >> covered the maximum parsimony for historical reasons, but I can't see it >> being useful here. >> >> Other clustering approaches sounds like a good idea. If you ever fancy >> documenting how to use them on genetic data in a small tutorial, I think >> that would be a very handy to others ;) >> >> As for your last question, it makes a lot of sense, but you will need >> external information for this. Eigenvalue selection procedures based on >> inertia will basically fail to detect the structures you talk about. So you >> will need to be able to test e.g. the correlation of your PCs to a set of >> traits, or their spatial distribution, etc. >> >> Cheers >> Thibaut >> >> >> ________________________________________ >> From: Manuela [manuelacorreia2 at gmail.com] >> Sent: 03 June 2014 10:01 >> To: Jombart, Thibaut >> Cc: adegenet-forum at lists.r-forge.r-project.org >> Subject: Re: [adegenet-forum] Monmonier algorithm and individual scores >> >> Doctor Thibault and dear colleagues, >> >> I would like to thank you for the valuable criticism you made in this >> output. The idea behind the IS was, solely, to have a first draft of the >> georeferenced clusters because in spatial clusters I'm well-aware that >> several different genoypes at the same coordinates in species with a very >> low mobility or with no mobility could be a strong indication that the >> genetic variability is only due to environment while a great genetic >> diversity nearby may result from a short dispersal highly spatial >> correlated. To need of further confirmation by sPCA and/or clustering >> techniques. >> >> The identification of spatial clusters in PCA, particularly by sPCA is no >> doubt more realiable than with Monmonier algoritm in this case. But I'd >> rather try to study more deeply each one of the 3 different methods >> (distance based-methods, Parsymony and maximum Likelihood) proposed in your >> tutorial "Trees" just to check it in first place if they might be >> appropriate to this dataset, Secondly, if they would gave different >> information perhaps with higher resolution when compared to classic NJ >> Tree, after validation by bootstrap. Eventually, if none is appropriate I >> always be able to rely on several clustering techniques more adequate for >> qualitative data, available at the "Cluster" package and to perform the >> validation by "cl Valid" following several criteria. >> >> From a very simplistic point of view, PCA analysis (not scaled) might >> provides us with information of the genetic variability whereas sPCA about >> the significance of local and global structures. But, on the whole, the >> information provided by these two analysis: Moran's Index , variance and >> allele loadings, enable us to discriminate the loci more informative on >> genetic variability but not spatially structured from those whose >> variability its spatial structured. To be further confirmed through biplots. >> >> Another challenge ahead. To figure out the way to select the PC's having >> biological meaning and most probably not associated to the highest >> eigenvalues. Particularly, in the absence of traits or phenotype >> information. >> >> Please, feel free to make more comments or to give another suggestion(s). >> >> Cheers, >> Manuela >> >> >> 2014-06-02 17:20 GMT+01:00 Jombart, Thibaut > >: >> Hi Manuela, >> >> thanks for re-posting on the forum. In this case, it seems that locations >> are very aggregated - a lot of genotypes were sampled roughly at the same >> place. Monmonier is unlikely to do well under such circumstances. The >> algorithm is very sensitive to local differences, and these are unstable >> for this kind of spatial distribution. I would recommend other approaches. >> For instance, if you want to define spatial clusters, you could use a basic >> clustering algorithm based on the principal components of a PCA (if spatial >> structure is obvious) or sPCA (if not, but there is still a spatial >> structure). Assuming 'foo' is your analysis (PCA or sPCA), one example >> would be using something along the lines of: >> >> h1 <- hclust(dist(foo$li)^2) >> plot(h1) >> cutree(h1) >> >> Etc. >> Check ?hclust for different clustering methods. >> >> Cheers >> Thibaut >> >> >> ________________________________________ >> From: adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org> [ >> adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of >> Manuela [manuelacorreia2 at gmail.com] >> Sent: 31 May 2014 21:46 >> To: adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org> >> Subject: [adegenet-forum] Monmonier algorithm and individual scores >> >> Dear colleagues of Adegenet forum, >> >> First of all I must congratulate Doctor Thimbault for the wonderful work >> he has been so far developed. And following his own suggestion I'm sharing >> with you a specific issue raised by the output generated by Monmonier >> algorithm used for boundary detection. >> I have a sample made of 170 individuals, collected on 9 different places >> and genotyped for 19 SNPs by Realtime PCR. >> Before I run this line on the R script I had to explain to you about each >> one of them: >> mon1<- monmonier(xy ,D, gab) >> >> xy ? spatial coordinates UTM/Km) ; >> D ? pairwise allele sharing distance (?Prabclus? package); >> gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation) >> >> plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?) >> From the output produced, it can be clearly seen that there are 4 >> clusters of individuals having four scores (50,100,150,200). But, I can't >> find a way to have access to individual scores. As matter in fact, I >> consulted in detail all the arguments provided on Plot function but none of >> them seemed to me to be on the way I could extract the individuals scores >> (IS). >> I?m wondering if you could give me a hint about it. Any help will be >> appreciated. >> Kind regards, >> Manuela (Biochemist) >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From apcar at deakin.edu.au Fri Jun 6 04:44:07 2014 From: apcar at deakin.edu.au (ADAM PETER CARDILINI) Date: Fri, 6 Jun 2014 02:44:07 +0000 Subject: [adegenet-forum] read.PLINK error Message-ID: G'day Everyone, I have recently produced a .vcf file for a set of SNPs obtained using Genotype-by-sequencing. The .vcf file is the final output from the TASSEL pipeline which takes in fastq sequence files. I converted my .vcf file to a .ped and .map files using vcftools and then converted the .ped file to .raw so that I could load it into R using 'adegenet' function 'read.PLINK'. The linux vcftools and plink code was as follows: vcftools --vcf myfile.vcf --out myfile.plink --plink plink --file myfile.plink --out myfile.plink --recodeA I successfully loaded my unaltered file into R using 'adegenet', however it has way many SNPs that I am not interested in (because it has only been sequenced for a couple of individuals) so I thought I would filter my .vcf snp file using vcftools. I filtered my original file so that only SNPs that were sequenced from >90% of samples remained. This significantly reduced the number of SNPs I had and produced a new .vcf file. I then converted this file to .ped and .map, and then .ped to .raw so I could bring it into R and have a quick look. When I tried to import the new, filtered .raw file using 'read.PLINK' I got the following error: Reading PLINK raw format into a genlight object... Reading loci information... Reading and converting genotypes... .Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'nLoc' for signature '"try-error"' In addition: Warning message: In mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), : 9 function calls resulted in an error It seems as if something has gone wrong when I have produced the new .vcf file during filtering. I was wondering if anyone might know what I have done wrong, what these error messages mean and whether there is a fix I can try? Thanks in advance for your time and help, I appreciate it. Kind regards, Adam Cardilini PhD Candidate Schools of Life and Environmental Sciences, Deakin University, 75 Pigdons Rd, Waurn Ponds, Vic, Australia, 3217 Mob: 0431 566 340 Email: apcar at deakin.edu.au -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaumelouvel at hotmail.fr Sun Jun 8 14:57:39 2014 From: guillaumelouvel at hotmail.fr (Guillaume Louvel) Date: Sun, 8 Jun 2014 14:57:39 +0200 Subject: [adegenet-forum] relevant way to compare posterior probabilities between DAPC with the same prior groups and the same individuals Message-ID: Hi everyone, I have performed DAPC on a set of 934 individuals, using 10 predefined groups. I did this with different sets of SNPs (coming from epigenetics assays in different tissues); now I would like to compare the posterior assignments, to know if the tissue has an effect, and I don't know what would be the best way. I have thought about the following: 1- compare the slot assign.per.pop of the summary(dapc), which is the percentage of individuals a posteriori assigned to their original prior group, for each group. for me a vector of 10 values. To make it clearer, what I want to compare is sthg like that: prior1 prior2 ... prior j ... prior10 tissue 1 p1,1 p1,2 ... p1,10 ... ... tissue i pi,j where pi,j is the proportion of individuals from prior j correctly assigned to j, using tissue i. I cannot really use anova, because I have only one value per group per tissue. I think it is useless to repeat the dapc in order to get several value for each categorie to be able to do an anova, because if the results come from multiple simulations, they would be really close I suppose. So I don't know what would be the error values of this proportion of correct reassignment. Maybe if I knew what is the error associated with these proportions I could conclude. I started doing chi-squared tests on the posterior group sizes, but this is not really relevant because the posterior groups are a mix of the correct and the wrong assignments. 2- compare at the level of the individual the probabilities of assignment. That is, create a table with those fields : individual - priorgrp - post proba of assignment to prior grp - tissue And then do something like a glm( post proba ~ priorgrp + tissue ). I cannot do an anova because for one cluster and for one tissue the proba doesn't have a normal distribution, so I assume it is better with the generalized linear model. Or, use a manova: same than the glm, except that instead of taking only the posterior proba of assignment to the prior grp, I take the vector of proba of assignment to every group. For now I haven't clearly found the conditions to apply a manova, so I am not sure if I can apply it with the distribution I have. How would you compare posterior probabilities of DAPC ? Hope this not too unclear. Thank you in advance, Guillaume PS: I have not be able to find the information, but how are established the posterior probabilities of assignment ? by simulation or analytically ? If by simulation, how many iterations are performed ? From t.jombart at imperial.ac.uk Sun Jun 8 19:41:26 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 8 Jun 2014 17:41:26 +0000 Subject: [adegenet-forum] read.PLINK error In-Reply-To: References: Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEF8DB@icexch-m1.ic.ac.uk> Hello, what command line did you use to read the data? Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of ADAM PETER CARDILINI [apcar at deakin.edu.au] Sent: 06 June 2014 03:44 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] read.PLINK error G?day Everyone, I have recently produced a .vcf file for a set of SNPs obtained using Genotype-by-sequencing. The .vcf file is the final output from the TASSEL pipeline which takes in fastq sequence files. I converted my .vcf file to a .ped and .map files using vcftools and then converted the .ped file to .raw so that I could load it into R using ?adegenet? function ?read.PLINK?. The linux vcftools and plink code was as follows: vcftools --vcf myfile.vcf --out myfile.plink --plink plink --file myfile.plink --out myfile.plink --recodeA I successfully loaded my unaltered file into R using ?adegenet?, however it has way many SNPs that I am not interested in (because it has only been sequenced for a couple of individuals) so I thought I would filter my .vcf snp file using vcftools. I filtered my original file so that only SNPs that were sequenced from >90% of samples remained. This significantly reduced the number of SNPs I had and produced a new .vcf file. I then converted this file to .ped and .map, and then .ped to .raw so I could bring it into R and have a quick look. When I tried to import the new, filtered .raw file using ?read.PLINK? I got the following error: Reading PLINK raw format into a genlight object... Reading loci information... Reading and converting genotypes... .Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ?nLoc? for signature ?"try-error"? In addition: Warning message: In mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), : 9 function calls resulted in an error It seems as if something has gone wrong when I have produced the new .vcf file during filtering. I was wondering if anyone might know what I have done wrong, what these error messages mean and whether there is a fix I can try? Thanks in advance for your time and help, I appreciate it. Kind regards, Adam Cardilini PhD Candidate Schools of Life and Environmental Sciences, Deakin University, 75 Pigdons Rd, Waurn Ponds, Vic, Australia, 3217 Mob: 0431 566 340 Email: apcar at deakin.edu.au From t.jombart at imperial.ac.uk Sun Jun 8 19:44:09 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 8 Jun 2014 17:44:09 +0000 Subject: [adegenet-forum] relevant way to compare posterior probabilities between DAPC with the same prior groups and the same individuals In-Reply-To: References: Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEF8EB@icexch-m1.ic.ac.uk> Hello, I don't have time for a long answer now and had to go through the question quickly, but it will probably be useful to have a look at the DAPC tutorial, and the following functions for dapc objects: summary, predict, a.score, xvalDapc Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Guillaume Louvel [guillaumelouvel at hotmail.fr] Sent: 08 June 2014 13:57 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] relevant way to compare posterior probabilities between DAPC with the same prior groups and the same individuals Hi everyone, I have performed DAPC on a set of 934 individuals, using 10 predefined groups. I did this with different sets of SNPs (coming from epigenetics assays in different tissues); now I would like to compare the posterior assignments, to know if the tissue has an effect, and I don't know what would be the best way. I have thought about the following: 1- compare the slot assign.per.pop of the summary(dapc), which is the percentage of individuals a posteriori assigned to their original prior group, for each group. for me a vector of 10 values. To make it clearer, what I want to compare is sthg like that: prior1 prior2 ... prior j ... prior10 tissue 1 p1,1 p1,2 ... p1,10 ... ... tissue i pi,j where pi,j is the proportion of individuals from prior j correctly assigned to j, using tissue i. I cannot really use anova, because I have only one value per group per tissue. I think it is useless to repeat the dapc in order to get several value for each categorie to be able to do an anova, because if the results come from multiple simulations, they would be really close I suppose. So I don't know what would be the error values of this proportion of correct reassignment. Maybe if I knew what is the error associated with these proportions I could conclude. I started doing chi-squared tests on the posterior group sizes, but this is not really relevant because the posterior groups are a mix of the correct and the wrong assignments. 2- compare at the level of the individual the probabilities of assignment. That is, create a table with those fields : individual - priorgrp - post proba of assignment to prior grp - tissue And then do something like a glm( post proba ~ priorgrp + tissue ). I cannot do an anova because for one cluster and for one tissue the proba doesn't have a normal distribution, so I assume it is better with the generalized linear model. Or, use a manova: same than the glm, except that instead of taking only the posterior proba of assignment to the prior grp, I take the vector of proba of assignment to every group. For now I haven't clearly found the conditions to apply a manova, so I am not sure if I can apply it with the distribution I have. How would you compare posterior probabilities of DAPC ? Hope this not too unclear. Thank you in advance, Guillaume PS: I have not be able to find the information, but how are established the posterior probabilities of assignment ? by simulation or analytically ? If by simulation, how many iterations are performed ? _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From apcar at deakin.edu.au Mon Jun 9 00:24:53 2014 From: apcar at deakin.edu.au (ADAM PETER CARDILINI) Date: Sun, 8 Jun 2014 22:24:53 +0000 Subject: [adegenet-forum] read.PLINK error In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657087BEF8DB@icexch-m1.ic.ac.uk> References: , <2CB2DA8E426F3541AB1907F98ABA657087BEF8DB@icexch-m1.ic.ac.uk> Message-ID: G'day Thibaut, Sorry I should have included that in the original email. The code I use to read the data was: dat <- read.PLINK('myfiltered_plinkconvertedfile.raw', map.file = 'myfiltered_plinkconvertedfile.map') This command line worked on the unfiltered data files, just not the ones I got after filtering in vcftools. Cheers, Adam Sent from my iPad > On 9 Jun 2014, at 3:42 am, "Jombart, Thibaut" wrote: > > > Hello, > > what command line did you use to read the data? > > Cheers > Thibaut > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of ADAM PETER CARDILINI [apcar at deakin.edu.au] > Sent: 06 June 2014 03:44 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] read.PLINK error > > G?day Everyone, > > I have recently produced a .vcf file for a set of SNPs obtained using Genotype-by-sequencing. The .vcf file is the final output from the TASSEL pipeline which takes in fastq sequence files. I converted my .vcf file to a .ped and .map files using vcftools and then converted the .ped file to .raw so that I could load it into R using ?adegenet? function ?read.PLINK?. The linux vcftools and plink code was as follows: > > vcftools --vcf myfile.vcf --out myfile.plink --plink > plink --file myfile.plink --out myfile.plink --recodeA > > I successfully loaded my unaltered file into R using ?adegenet?, however it has way many SNPs that I am not interested in (because it has only been sequenced for a couple of individuals) so I thought I would filter my .vcf snp file using vcftools. I filtered my original file so that only SNPs that were sequenced from >90% of samples remained. This significantly reduced the number of SNPs I had and produced a new .vcf file. I then converted this file to .ped and .map, and then .ped to .raw so I could bring it into R and have a quick look. > > When I tried to import the new, filtered .raw file using ?read.PLINK? I got the following error: > > > Reading PLINK raw format into a genlight object... > > Reading loci information... > > Reading and converting genotypes... > .Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function ?nLoc? for signature ?"try-error"? > In addition: Warning message: > In mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), : > 9 function calls resulted in an error > > > > It seems as if something has gone wrong when I have produced the new .vcf file during filtering. I was wondering if anyone might know what I have done wrong, what these error messages mean and whether there is a fix I can try? > > Thanks in advance for your time and help, I appreciate it. > > Kind regards, > > Adam Cardilini > PhD Candidate > Schools of Life and Environmental Sciences, > Deakin University, 75 Pigdons Rd, > Waurn Ponds, Vic, Australia, 3217 > Mob: 0431 566 340 > Email: apcar at deakin.edu.au > From emmanuel.wicker at cirad.fr Mon Jun 9 17:23:47 2014 From: emmanuel.wicker at cirad.fr (Emmanuel WICKER) Date: Mon, 9 Jun 2014 19:23:47 +0400 (RET) Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight In-Reply-To: <1464298593.12352.1402326964959.JavaMail.root@cirad.fr> Message-ID: <1340384164.12422.1402327427518.JavaMail.root@cirad.fr> Hi all I tried and convert a fasta alignment to a genlight object, and I had the following message: > toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command Converting FASTA alignment into a genlight object... Loading required package: parallel Looking for polymorphic positions... ........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), : 'mc.cores' > 1 is not supported on Windows ANy help ? I run R under Windows 7, adegenet version 1.4.2 Thank you Manu From t.jombart at imperial.ac.uk Mon Jun 9 17:35:55 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Mon, 9 Jun 2014 15:35:55 +0000 Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight In-Reply-To: <1340384164.12422.1402327427518.JavaMail.root@cirad.fr> References: <1464298593.12352.1402326964959.JavaMail.root@cirad.fr>, <1340384164.12422.1402327427518.JavaMail.root@cirad.fr> Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEFDB0@icexch-m1.ic.ac.uk> Hi can you try parallel = FALSE as argument? Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Emmanuel WICKER [emmanuel.wicker at cirad.fr] Sent: 09 June 2014 16:23 To: adegenet-forum at lists.r-forge.r-project.org Cc: wicker at cirad.fr Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight Hi all I tried and convert a fasta alignment to a genlight object, and I had the following message: > toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command Converting FASTA alignment into a genlight object... Loading required package: parallel Looking for polymorphic positions... .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. .......... Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), : 'mc.cores' > 1 is not supported on Windows ANy help ? I run R under Windows 7, adegenet version 1.4.2 Thank you Manu _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From emmanuel.wicker at cirad.fr Mon Jun 9 18:02:40 2014 From: emmanuel.wicker at cirad.fr (Emmanuel WICKER) Date: Mon, 9 Jun 2014 20:02:40 +0400 (RET) Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657087BEFDB0@icexch-m1.ic.ac.uk> Message-ID: <636378527.12663.1402329760033.JavaMail.root@cirad.fr> Hi Thibaut I already tested that, but still it doesn't work. For that command, and also for read.snp of a DNAbin object (same error message) Cheers Manu ----- Mail original ----- De: "Thibaut Jombart" ?: "Emmanuel WICKER" , adegenet-forum at lists.r-forge.r-project.org Cc: wicker at cirad.fr Envoy?: Lundi 9 Juin 2014 19:35:55 Objet: RE: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight Hi can you try parallel = FALSE as argument? Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Emmanuel WICKER [emmanuel.wicker at cirad.fr] Sent: 09 June 2014 16:23 To: adegenet-forum at lists.r-forge.r-project.org Cc: wicker at cirad.fr Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight Hi all I tried and convert a fasta alignment to a genlight object, and I had the following message: > toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command Converting FASTA alignment into a genlight object... Loading required package: parallel Looking for polymorphic positions... .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. .......... Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), : 'mc.cores' > 1 is not supported on Windows ANy help ? I run R under Windows 7, adegenet version 1.4.2 Thank you Manu _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From caitiecollins at gmail.com Mon Jun 9 19:32:19 2014 From: caitiecollins at gmail.com (Caitlin Collins) Date: Mon, 9 Jun 2014 18:32:19 +0100 Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight In-Reply-To: <636378527.12663.1402329760033.JavaMail.root@cirad.fr> References: <2CB2DA8E426F3541AB1907F98ABA657087BEFDB0@icexch-m1.ic.ac.uk> <636378527.12663.1402329760033.JavaMail.root@cirad.fr> Message-ID: Hi Emmanuel, I'm running adegenet on a Windows computer, and I've previously had the same error message that you're currently experiencing (see below, first example). For all the instances you have mentioned, however, I usually find that adding the argument parallel=FALSE does the trick. Would you be able to copy and paste the following example (the line below starting with myPath, and then the line from the second example starting with obj) and then reporting back with the outcome? Thanks very much. myPath <- system.file("files/usflu.fasta",package="adegenet") # without the parallel arguement --> same error message you are getting: > obj <- fasta2genlight(myPath, chunk=10) # process 10 sequences at a time Converting FASTA alignment into a genlight object... Loading required package: parallel Looking for polymorphic positions... .......... Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), : 'mc.cores' > 1 is not supported on Windows *# WITH the parallel=FALSE argument: * obj <- fasta2genlight(myPath, chunk=10, parallel=FALSE) # process 10 sequences at a time Converting FASTA alignment into a genlight object... Looking for polymorphic positions... ........................................................................................................................................................................................................................................................................................................................................................................ Extracting SNPs from the alignment... ........................................................................................................................................................................................................................................................................................................................................................................ Building final object... ...done. Cheers, Caitlin. On Mon, Jun 9, 2014 at 5:02 PM, Emmanuel WICKER wrote: > Hi Thibaut > I already tested that, but still it doesn't work. > For that command, and also for read.snp of a DNAbin object (same error > message) > Cheers > Manu > > ----- Mail original ----- > De: "Thibaut Jombart" > ?: "Emmanuel WICKER" , > adegenet-forum at lists.r-forge.r-project.org > Cc: wicker at cirad.fr > Envoy?: Lundi 9 Juin 2014 19:35:55 > Objet: RE: [adegenet-forum] Help: pbm conversion of a fasta alignement to > Genlight > > > Hi > > can you try > parallel = FALSE > > as argument? > > Cheers > Thibaut > > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [ > adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Emmanuel > WICKER [emmanuel.wicker at cirad.fr] > Sent: 09 June 2014 16:23 > To: adegenet-forum at lists.r-forge.r-project.org > Cc: wicker at cirad.fr > Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to > Genlight > > Hi all > I tried and convert a fasta alignment to a genlight object, and I had the > following message: > > > > toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command > > Converting FASTA alignment into a genlight object... > > Loading required package: parallel > > Looking for polymorphic positions... > > .............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................. > .......... > Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), : > 'mc.cores' > 1 is not supported on Windows > > ANy help ? > I run R under Windows 7, adegenet version 1.4.2 > Thank you > Manu > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From patriciasalerno at gmail.com Fri Jun 13 23:27:10 2014 From: patriciasalerno at gmail.com (Patricia Salerno) Date: Fri, 13 Jun 2014 16:27:10 -0500 Subject: [adegenet-forum] DAPC: loadings of original variables as table? Message-ID: Hi everyone, I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting different results with the two approaches, and the DAPC results are much more logical, biologically speaking (some individuals of a very well-supported cluster in DAPC are being assigned to the other cluster, even though the separation in PC1 is enormous!). Thus, I want to see if the discrepancies of population assignment in STRUCTURE are due to the fact that the DAPC initially transforms the data into vectors that maximize variation, thus effectively weighing my variables differently, while STRUCTURE weighs all SNPs equally. The only strategy I've come up with to investigate this issue further is to generate a table of the loadings of the SNP variables (the original, not the transformed variables after PCA), and prune my matrix to only keep the SNPs with sufficient contributions (setting some post-hoc cutoff). However, I cannot figure out how to print a table of the SNP loadings after the DAPC, or if it's even possible. What I would want is a printed matrix of two columns, one with the SNP names, and another with the contributions/loadings. Could anyone help me with this? Or, does anyone have another suggestion for approaching this issue? Thank you!! ~patricia. -- Patricia Salerno PhD Candidate Ecology Evolution and Behavior Section of Integrative Biology University of Texas at Austin -------------- next part -------------- An HTML attachment was scrubbed... URL: From goatsrunfaster at gmail.com Sat Jun 14 15:11:55 2014 From: goatsrunfaster at gmail.com (Spencer Bruce) Date: Sat, 14 Jun 2014 09:11:55 -0400 Subject: [adegenet-forum] Identifying clusters / Error in row names Message-ID: Hello All, I am trying to run a DAPC on some microsatellite data, and have had no problems going through the tutorial using the tutorial data, but I am immediately running into problems after converting my STRUCTURE file to a genind object. Given that as a first step I would like to identify clusters using my entire data set, I do the following, and receive the following error message: > x <- obj1 > x ##################### ### Genind object ### ##################### - genotypes of individuals - S4 class: genind @call: read.structure(file = file, missing = missing, quiet = quiet) @tab: 990 x 118 matrix of genotypes @ind.names: vector of 990 individual names @loc.names: vector of 11 locus names @loc.nall: number of alleles per locus @loc.fac: locus factor for the 118 columns of @tab @all.names: list of 11 components yielding allele names for each locus @ploidy: 2 @type: codom Optional contents: @pop: - empty - @pop.names: - empty - @other: - empty - > grp <- find.clusters(x, max.n.clust=41) Error in `row.names<-.data.frame`(`*tmp*`, value = c("001", "003", "005", : duplicate 'row.names' are not allowed In addition: Warning messages: 1: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,7 [... truncated] 2: non-unique values when setting 'row.names': This is what my original data set looks like in the STRUCTURE file (a first row of loci names, and then 2 rows of fragment lengths for each individual with no labels): SfoB52 SfoC24 SfoC28 SfoC38 SfoC86 SfoC88 SfoC113 SfoC129 SfoD75 SfoD91 SfoD100 203 113 179 143 101 181 133 221 188 228 230 225 113 191 143 116 184 139 230 208 236 238 215 113 183 143 110 184 133 230 180 212 214 219 122 191 143 116 184 139 230 188 220 214 211 113 179 143 101 184 142 230 180 212 214 219 113 191 143 110 190 151 230 204 228 214 etc. Any help would be very greatly appreciated, as I'm new to using R, but am excited about the possibilities! Best, Spencer -- Spencer A Bruce 200 Washington St. Troy, NY 12180 518 225 0787 -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuelacorreia2 at gmail.com Sat Jun 14 17:56:52 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Sat, 14 Jun 2014 16:56:52 +0100 Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7 In-Reply-To: References: Message-ID: Patr?cia, I made a small test with example suggested on sPCA tutorial ( http://adegenet.r-forge.r-project.org/) and apparently it seems that you can get the SNP loadings after modelling yout dataset by DAPC. The values you want are stored in the slot pca.loadings. Just try these two command lines: A<-dapc1$pca.loadings write.table(A,file=?A?) And afterwards open it in Excel. By default a file named ?A? is saved on MyDocuments folder. But if you have any trouble on open it please let me now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm this information. Hoping to be helpful, M. 2014-06-14 11:00 GMT+01:00 < adegenet-forum-request at lists.r-forge.r-project.org>: > Send adegenet-forum mailing list submissions to > adegenet-forum at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > or, via email, send a message with subject or body 'help' to > adegenet-forum-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > adegenet-forum-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of adegenet-forum digest..." > > > Today's Topics: > > 1. DAPC: loadings of original variables as table? (Patricia Salerno) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Fri, 13 Jun 2014 16:27:10 -0500 > From: Patricia Salerno > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] DAPC: loadings of original variables as > table? > Message-ID: > 531Ejp3VQEw at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi everyone, > > I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting > different results with the two approaches, and the DAPC results are much > more logical, biologically speaking (some individuals of a very > well-supported cluster in DAPC are being assigned to the other cluster, > even though the separation in PC1 is enormous!). Thus, I want to see if the > discrepancies of population assignment in STRUCTURE are due to the fact > that the DAPC initially transforms the data into vectors that maximize > variation, thus effectively weighing my variables differently, while > STRUCTURE weighs all SNPs equally. The only strategy I've come up with to > investigate this issue further is to generate a table of the loadings of > the SNP variables (the original, not the transformed variables after PCA), > and prune my matrix to only keep the SNPs with sufficient contributions > (setting some post-hoc cutoff). However, I cannot figure out how to print a > table of the SNP loadings after the DAPC, or if it's even possible. What I > would want is a printed matrix of two columns, one with the SNP names, and > another with the contributions/loadings. Could anyone help me with this? > Or, does anyone have another suggestion for approaching this issue? > > Thank you!! > > ~patricia. > > > -- > Patricia Salerno > PhD Candidate > Ecology Evolution and Behavior > Section of Integrative Biology > University of Texas at Austin > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html > > > > ------------------------------ > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > End of adegenet-forum Digest, Vol 70, Issue 7 > ********************************************* > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuelacorreia2 at gmail.com Sat Jun 14 18:00:03 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Sat, 14 Jun 2014 17:00:03 +0100 Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7 In-Reply-To: References: Message-ID: Sorry, I meant DAPC tutorial (March 24,2014). Cheers, M. 2014-06-14 16:56 GMT+01:00 Manuela : > Patr?cia, > > > > I made a small test with example suggested on sPCA tutorial ( > http://adegenet.r-forge.r-project.org/) and apparently it seems that you > can get the SNP loadings after modelling yout dataset by DAPC. The values > you want are stored in the slot pca.loadings. > > > Just try these two command lines: > > A<-dapc1$pca.loadings > > write.table(A,file=?A?) > > > And afterwards open it in Excel. By default a file named ?A? is saved on > MyDocuments folder. But if you have any trouble on open it please let me > now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm > this information. > > > Hoping to be helpful, > > M. > > > 2014-06-14 11:00 GMT+01:00 < > adegenet-forum-request at lists.r-forge.r-project.org>: > > Send adegenet-forum mailing list submissions to >> adegenet-forum at lists.r-forge.r-project.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> >> or, via email, send a message with subject or body 'help' to >> adegenet-forum-request at lists.r-forge.r-project.org >> >> You can reach the person managing the list at >> adegenet-forum-owner at lists.r-forge.r-project.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of adegenet-forum digest..." >> >> >> Today's Topics: >> >> 1. DAPC: loadings of original variables as table? (Patricia Salerno) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Fri, 13 Jun 2014 16:27:10 -0500 >> From: Patricia Salerno >> To: adegenet-forum at lists.r-forge.r-project.org >> Subject: [adegenet-forum] DAPC: loadings of original variables as >> table? >> Message-ID: >> > 531Ejp3VQEw at mail.gmail.com> >> Content-Type: text/plain; charset="utf-8" >> >> Hi everyone, >> >> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting >> different results with the two approaches, and the DAPC results are much >> more logical, biologically speaking (some individuals of a very >> well-supported cluster in DAPC are being assigned to the other cluster, >> even though the separation in PC1 is enormous!). Thus, I want to see if >> the >> discrepancies of population assignment in STRUCTURE are due to the fact >> that the DAPC initially transforms the data into vectors that maximize >> variation, thus effectively weighing my variables differently, while >> STRUCTURE weighs all SNPs equally. The only strategy I've come up with to >> investigate this issue further is to generate a table of the loadings of >> the SNP variables (the original, not the transformed variables after PCA), >> and prune my matrix to only keep the SNPs with sufficient contributions >> (setting some post-hoc cutoff). However, I cannot figure out how to print >> a >> table of the SNP loadings after the DAPC, or if it's even possible. What I >> would want is a printed matrix of two columns, one with the SNP names, and >> another with the contributions/loadings. Could anyone help me with this? >> Or, does anyone have another suggestion for approaching this issue? >> >> Thank you!! >> >> ~patricia. >> >> >> -- >> Patricia Salerno >> PhD Candidate >> Ecology Evolution and Behavior >> Section of Integrative Biology >> University of Texas at Austin >> -------------- next part -------------- >> An HTML attachment was scrubbed... >> URL: < >> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html >> > >> >> ------------------------------ >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> >> End of adegenet-forum Digest, Vol 70, Issue 7 >> ********************************************* >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuelacorreia2 at gmail.com Sat Jun 14 18:09:28 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Sat, 14 Jun 2014 17:09:28 +0100 Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7 In-Reply-To: References: Message-ID: Patr?cia, About this same subject I would like to recommend you an article I've read some time ago. Reference: Kalinowski, ST (2011) "The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure", Heredity, 106 :625-632 Cheers, M. 2014-06-14 17:00 GMT+01:00 Manuela : > Sorry, I meant DAPC tutorial (March 24,2014). > > Cheers, > M. > > > 2014-06-14 16:56 GMT+01:00 Manuela : > > Patr?cia, >> >> >> >> I made a small test with example suggested on sPCA tutorial ( >> http://adegenet.r-forge.r-project.org/) and apparently it seems that you >> can get the SNP loadings after modelling yout dataset by DAPC. The values >> you want are stored in the slot pca.loadings. >> >> >> Just try these two command lines: >> >> A<-dapc1$pca.loadings >> >> write.table(A,file=?A?) >> >> >> And afterwards open it in Excel. By default a file named ?A? is saved on >> MyDocuments folder. But if you have any trouble on open it please let me >> now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm >> this information. >> >> >> Hoping to be helpful, >> >> M. >> >> >> 2014-06-14 11:00 GMT+01:00 < >> adegenet-forum-request at lists.r-forge.r-project.org>: >> >> Send adegenet-forum mailing list submissions to >>> adegenet-forum at lists.r-forge.r-project.org >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>> >>> or, via email, send a message with subject or body 'help' to >>> adegenet-forum-request at lists.r-forge.r-project.org >>> >>> You can reach the person managing the list at >>> adegenet-forum-owner at lists.r-forge.r-project.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of adegenet-forum digest..." >>> >>> >>> Today's Topics: >>> >>> 1. DAPC: loadings of original variables as table? (Patricia Salerno) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Fri, 13 Jun 2014 16:27:10 -0500 >>> From: Patricia Salerno >>> To: adegenet-forum at lists.r-forge.r-project.org >>> Subject: [adegenet-forum] DAPC: loadings of original variables as >>> table? >>> Message-ID: >>> >> 531Ejp3VQEw at mail.gmail.com> >>> Content-Type: text/plain; charset="utf-8" >>> >>> Hi everyone, >>> >>> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting >>> different results with the two approaches, and the DAPC results are much >>> more logical, biologically speaking (some individuals of a very >>> well-supported cluster in DAPC are being assigned to the other cluster, >>> even though the separation in PC1 is enormous!). Thus, I want to see if >>> the >>> discrepancies of population assignment in STRUCTURE are due to the fact >>> that the DAPC initially transforms the data into vectors that maximize >>> variation, thus effectively weighing my variables differently, while >>> STRUCTURE weighs all SNPs equally. The only strategy I've come up with to >>> investigate this issue further is to generate a table of the loadings of >>> the SNP variables (the original, not the transformed variables after >>> PCA), >>> and prune my matrix to only keep the SNPs with sufficient contributions >>> (setting some post-hoc cutoff). However, I cannot figure out how to >>> print a >>> table of the SNP loadings after the DAPC, or if it's even possible. What >>> I >>> would want is a printed matrix of two columns, one with the SNP names, >>> and >>> another with the contributions/loadings. Could anyone help me with this? >>> Or, does anyone have another suggestion for approaching this issue? >>> >>> Thank you!! >>> >>> ~patricia. >>> >>> >>> -- >>> Patricia Salerno >>> PhD Candidate >>> Ecology Evolution and Behavior >>> Section of Integrative Biology >>> University of Texas at Austin >>> -------------- next part -------------- >>> An HTML attachment was scrubbed... >>> URL: < >>> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html >>> > >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>> >>> End of adegenet-forum Digest, Vol 70, Issue 7 >>> ********************************************* >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From patriciasalerno at gmail.com Sat Jun 14 20:10:04 2014 From: patriciasalerno at gmail.com (Patricia Salerno) Date: Sat, 14 Jun 2014 13:10:04 -0500 Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7 In-Reply-To: References: Message-ID: Thank you so much, Manuela, for the tip and for the reference!! very helpful... worked just fine with my data. Cheers!! ~patricia. On Sat, Jun 14, 2014 at 11:09 AM, Manuela wrote: > > > Patr?cia, > > About this same subject I would like to recommend you an article I've read > some time ago. > > Reference: > Kalinowski, ST (2011) "The computer program STRUCTURE does not reliably > identify the main genetic clusters within species: simulations and > implications for human population structure", Heredity, 106 :625-632 > > Cheers, > M. > > > 2014-06-14 17:00 GMT+01:00 Manuela : > > Sorry, I meant DAPC tutorial (March 24,2014). >> >> Cheers, >> M. >> >> >> 2014-06-14 16:56 GMT+01:00 Manuela : >> >> Patr?cia, >>> >>> >>> >>> I made a small test with example suggested on sPCA tutorial ( >>> http://adegenet.r-forge.r-project.org/) and apparently it seems that >>> you can get the SNP loadings after modelling yout dataset by DAPC. The >>> values you want are stored in the slot pca.loadings. >>> >>> >>> Just try these two command lines: >>> >>> A<-dapc1$pca.loadings >>> >>> write.table(A,file=?A?) >>> >>> >>> And afterwards open it in Excel. By default a file named ?A? is saved on >>> MyDocuments folder. But if you have any trouble on open it please let me >>> now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm >>> this information. >>> >>> >>> Hoping to be helpful, >>> >>> M. >>> >>> >>> 2014-06-14 11:00 GMT+01:00 < >>> adegenet-forum-request at lists.r-forge.r-project.org>: >>> >>> Send adegenet-forum mailing list submissions to >>>> adegenet-forum at lists.r-forge.r-project.org >>>> >>>> To subscribe or unsubscribe via the World Wide Web, visit >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>>> >>>> or, via email, send a message with subject or body 'help' to >>>> adegenet-forum-request at lists.r-forge.r-project.org >>>> >>>> You can reach the person managing the list at >>>> adegenet-forum-owner at lists.r-forge.r-project.org >>>> >>>> When replying, please edit your Subject line so it is more specific >>>> than "Re: Contents of adegenet-forum digest..." >>>> >>>> >>>> Today's Topics: >>>> >>>> 1. DAPC: loadings of original variables as table? (Patricia Salerno) >>>> >>>> >>>> ---------------------------------------------------------------------- >>>> >>>> Message: 1 >>>> Date: Fri, 13 Jun 2014 16:27:10 -0500 >>>> From: Patricia Salerno >>>> To: adegenet-forum at lists.r-forge.r-project.org >>>> Subject: [adegenet-forum] DAPC: loadings of original variables as >>>> table? >>>> Message-ID: >>>> >>> 531Ejp3VQEw at mail.gmail.com> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Hi everyone, >>>> >>>> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting >>>> different results with the two approaches, and the DAPC results are much >>>> more logical, biologically speaking (some individuals of a very >>>> well-supported cluster in DAPC are being assigned to the other cluster, >>>> even though the separation in PC1 is enormous!). Thus, I want to see if >>>> the >>>> discrepancies of population assignment in STRUCTURE are due to the fact >>>> that the DAPC initially transforms the data into vectors that maximize >>>> variation, thus effectively weighing my variables differently, while >>>> STRUCTURE weighs all SNPs equally. The only strategy I've come up with >>>> to >>>> investigate this issue further is to generate a table of the loadings of >>>> the SNP variables (the original, not the transformed variables after >>>> PCA), >>>> and prune my matrix to only keep the SNPs with sufficient contributions >>>> (setting some post-hoc cutoff). However, I cannot figure out how to >>>> print a >>>> table of the SNP loadings after the DAPC, or if it's even possible. >>>> What I >>>> would want is a printed matrix of two columns, one with the SNP names, >>>> and >>>> another with the contributions/loadings. Could anyone help me with this? >>>> Or, does anyone have another suggestion for approaching this issue? >>>> >>>> Thank you!! >>>> >>>> ~patricia. >>>> >>>> >>>> -- >>>> Patricia Salerno >>>> PhD Candidate >>>> Ecology Evolution and Behavior >>>> Section of Integrative Biology >>>> University of Texas at Austin >>>> -------------- next part -------------- >>>> An HTML attachment was scrubbed... >>>> URL: < >>>> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html >>>> > >>>> >>>> ------------------------------ >>>> >>>> _______________________________________________ >>>> adegenet-forum mailing list >>>> adegenet-forum at lists.r-forge.r-project.org >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>>> >>>> End of adegenet-forum Digest, Vol 70, Issue 7 >>>> ********************************************* >>>> >>> >>> >> > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -- Patricia Salerno PhD Candidate Ecology Evolution and Behavior Section of Integrative Biology University of Texas at Austin -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Sat Jun 14 22:15:06 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sat, 14 Jun 2014 20:15:06 +0000 Subject: [adegenet-forum] Identifying clusters / Error in row names In-Reply-To: References: Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BF13E2@icexch-m1.ic.ac.uk> Hi there, can you try replacing the individuals labels? Duplications would cause problems there. E.g.: indNames(x) <- 1:nInd(x) Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Spencer Bruce [goatsrunfaster at gmail.com] Sent: 14 June 2014 14:11 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] Identifying clusters / Error in row names Hello All, I am trying to run a DAPC on some microsatellite data, and have had no problems going through the tutorial using the tutorial data, but I am immediately running into problems after converting my STRUCTURE file to a genind object. Given that as a first step I would like to identify clusters using my entire data set, I do the following, and receive the following error message: > x <- obj1 > x ##################### ### Genind object ### ##################### - genotypes of individuals - S4 class: genind @call: read.structure(file = file, missing = missing, quiet = quiet) @tab: 990 x 118 matrix of genotypes @ind.names: vector of 990 individual names @loc.names: vector of 11 locus names @loc.nall: number of alleles per locus @loc.fac: locus factor for the 118 columns of @tab @all.names: list of 11 components yielding allele names for each locus @ploidy: 2 @type: codom Optional contents: @pop: - empty - @pop.names: - empty - @other: - empty - > grp <- find.clusters(x, max.n.clust=41) Error in `row.names<-.data.frame`(`*tmp*`, value = c("001", "003", "005", : duplicate 'row.names' are not allowed In addition: Warning messages: 1: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,7 [... truncated] 2: non-unique values when setting 'row.names': This is what my original data set looks like in the STRUCTURE file (a first row of loci names, and then 2 rows of fragment lengths for each individual with no labels): SfoB52 SfoC24 SfoC28 SfoC38 SfoC86 SfoC88 SfoC113 SfoC129 SfoD75 SfoD91 SfoD100 203 113 179 143 101 181 133 221 188 228 230 225 113 191 143 116 184 139 230 208 236 238 215 113 183 143 110 184 133 230 180 212 214 219 122 191 143 116 184 139 230 188 220 214 211 113 179 143 101 184 142 230 180 212 214 219 113 191 143 110 190 151 230 204 228 214 etc. Any help would be very greatly appreciated, as I'm new to using R, but am excited about the possibilities! Best, Spencer -- Spencer A Bruce 200 Washington St. Troy, NY 12180 518 225 0787 From neagef at gmail.com Tue Jun 17 11:12:40 2014 From: neagef at gmail.com (Andrea Garavito) Date: Tue, 17 Jun 2014 11:12:40 +0200 Subject: [adegenet-forum] SNP alleles Message-ID: Hi everybody! I'm currently trying to do a PCA analysis using a SNP matrix from a diploid organism, most of them are bi-allelic. Although the results that I obtain are logic in terms of previous knowledge of the groups, I'm confused with the genind object that I obtain, and I want to be sure about what's going on with the analysis. My data file is formatted using the nucleotides as alleles and a "/" separator, and missing data coded as "NA". ind mk1 mk2 ind1 G/A C/T ind2 G/G C/T After loading my data matrix with the df2genid function my data is stored as a matrix with for times the number of columns of the original file : ind mk1.A mk1.G mk1.A mk1.G mk2.C mk2.T mk2.C mk2.T ind1 0.5 0.0 0 0.5 0.0 0.5 0.5 0 ind2 0.0 0.5 0 0.5 0.0 0.5 0.5 0 Is that correct? I thought I would get two columns per marker loci instead of 4. >From there I obtain doubled statistics for each one of the alleles. Since I don't know the phase, an A/G is the same as a G/A, so how can I have the unified stats for each allele? Thank you for your answer Best regards Andrea -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitiecollins at gmail.com Tue Jun 17 13:36:18 2014 From: caitiecollins at gmail.com (Caitlin Collins) Date: Tue, 17 Jun 2014 12:36:18 +0100 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: Message-ID: Hi Andrea, I'm afraid that without seeing the exact code you used to generate the results you have presented, it is a bit difficult to say for certain what the origin of your problem is. So please forgive me if the following suggestion misses the mark. (If so, can I ask you to reply with the functions and arguments you used to generate that output?) I notice you've stated that your original data file is formatted using a "/" separator. One way of getting the df2genind output format you are experiencing is by neglecting to inform the df2genind function that you are using that separator. If you have not done so already, try adding the argument sep="/" to the list of arguments taken by df2genind. Let me know if that does the trick. If not, please post back with the code you are using and we can go from there. Best, Caitlin. On Tue, Jun 17, 2014 at 10:12 AM, Andrea Garavito wrote: > Hi everybody! > > I'm currently trying to do a PCA analysis using a SNP matrix from a > diploid organism, most of them are bi-allelic. > Although the results that I obtain are logic in terms of previous > knowledge of the groups, I'm confused with the genind object that I obtain, > and I want to be sure about what's going on with the analysis. > My data file is formatted using the nucleotides as alleles and a "/" > separator, and missing data coded as "NA". > ind mk1 mk2 > ind1 G/A C/T > ind2 G/G C/T > After loading my data matrix with the df2genid function my data is stored > as a matrix with for times the number of columns of the original file : > > ind mk1.A mk1.G mk1.A mk1.G mk2.C mk2.T mk2.C mk2.T > ind1 0.5 0.0 0 0.5 0.0 > 0.5 0.5 0 > ind2 0.0 0.5 0 0.5 0.0 > 0.5 0.5 0 > > Is that correct? I thought I would get two columns per marker loci instead > of 4. > From there I obtain doubled statistics for each one of the alleles. Since > I don't know the phase, an A/G is the same as a G/A, so how can I have the > unified stats for each allele? > > Thank you for your answer > > Best regards > Andrea > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Tue Jun 17 13:59:04 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 17 Jun 2014 11:59:04 +0000 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: , Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> Hi there, yes, as Caitlin said, it probably is something wrong about the conversion. I get: > dat=data.frame(mk1=c("G/A","G/G"), km2=c("C/T","C/T")) > dat mk1 km2 1 G/A C/T 2 G/G C/T > x=df2genind(dat,sep="/",ploidy=2) > truenames(x) mk1.A mk1.G km2.C km2.T 1 0.5 0.5 0.5 0.5 2 0.0 1.0 0.5 0.5 > Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Caitlin Collins [caitiecollins at gmail.com] Sent: 17 June 2014 12:36 To: Andrea Garavito Cc: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles Hi Andrea, I'm afraid that without seeing the exact code you used to generate the results you have presented, it is a bit difficult to say for certain what the origin of your problem is. So please forgive me if the following suggestion misses the mark. (If so, can I ask you to reply with the functions and arguments you used to generate that output?) I notice you've stated that your original data file is formatted using a "/" separator. One way of getting the df2genind output format you are experiencing is by neglecting to inform the df2genind function that you are using that separator. If you have not done so already, try adding the argument sep="/" to the list of arguments taken by df2genind. Let me know if that does the trick. If not, please post back with the code you are using and we can go from there. Best, Caitlin. On Tue, Jun 17, 2014 at 10:12 AM, Andrea Garavito > wrote: Hi everybody! I'm currently trying to do a PCA analysis using a SNP matrix from a diploid organism, most of them are bi-allelic. Although the results that I obtain are logic in terms of previous knowledge of the groups, I'm confused with the genind object that I obtain, and I want to be sure about what's going on with the analysis. My data file is formatted using the nucleotides as alleles and a "/" separator, and missing data coded as "NA". ind mk1 mk2 ind1 G/A C/T ind2 G/G C/T After loading my data matrix with the df2genid function my data is stored as a matrix with for times the number of columns of the original file : ind mk1.A mk1.G mk1.A mk1.G mk2.C mk2.T mk2.C mk2.T ind1 0.5 0.0 0 0.5 0.0 0.5 0.5 0 ind2 0.0 0.5 0 0.5 0.0 0.5 0.5 0 Is that correct? I thought I would get two columns per marker loci instead of 4. >From there I obtain doubled statistics for each one of the alleles. Since I don't know the phase, an A/G is the same as a G/A, so how can I have the unified stats for each allele? Thank you for your answer Best regards Andrea _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From neagef at gmail.com Tue Jun 17 14:47:13 2014 From: neagef at gmail.com (Andrea Garavito) Date: Tue, 17 Jun 2014 14:47:13 +0200 Subject: [adegenet-forum] SNP alleles In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> Message-ID: Hi Caitlin and Thibaut, Thanks for your answers. I did used the sep argument. My code to generate the genind object is : >myData_genid <- df2genind(myData, sep="/") The weird thing is that when I try the same code with a test object that I created: >dat = data.frame(loc1=c("A/A","T/A","T/A","T/T","T/A","A/T"), loc2=c("C/G","G/C","C/C","G/G","C/G","G/C")) >x=df2genind(dat, sep="/") I get the two columns per loci (as Thibaut does): >truenames(x) loc1.A loc1.T loc2.C loc2.G 1 1.0 0.0 0.5 0.5 2 0.5 0.5 0.5 0.5 3 0.5 0.5 1.0 0.0 4 0.0 1.0 0.0 1.0 5 0.5 0.5 0.5 0.5 6 0.5 0.5 0.5 0.5 But when I test a subset of my data >test<-myData[1:10,1:10] >test loc_29 loc_7 loc_43 etc... 1 "G / A" "C / T" "T / T" 2 "G / G" "C / T" "T/ T" etc... > test_genid <- df2genind(test,sep="/") I get again three or four columns: >truenames(test_genid) loc_29.A loc_29.G loc_29.G loc_7.C loc_7.T loc_7.C loc_43.C loc_43.T loc_43.C loc_43.T etc.. 1 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 2 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 etc... When I carry my PCA analysis with all my data: >X <- scaleGen(myData_genid, scale=F, missing="mean") >pca_myData<-dudi.pca(X,center=F,scale=F) I get the following message: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 3,4,... I really don't understand what is causing that, is there a hiden character in my data file that makes the df2genind divide my columns? Does that affect the results I get thereafter? By the way, I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents the "real picture". Thanks for your comments Andrea -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Tue Jun 17 14:57:33 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 17 Jun 2014 12:57:33 +0000 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A57D@icexch-m1.ic.ac.uk> What is "myData"? BTW it is safer to specify the ploidy when constructing a genind. Try: alleles(test_genid) # btw the name is 'genind' - genotype of individuals to see if it is a problem of empty characters. Cheers Thibaut ________________________________________ From: Andrea Garavito [neagef at gmail.com] Sent: 17 June 2014 13:47 To: Jombart, Thibaut Cc: Caitlin Collins; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles Hi Caitlin and Thibaut, Thanks for your answers. I did used the sep argument. My code to generate the genind object is : >myData_genid <- df2genind(myData, sep="/") The weird thing is that when I try the same code with a test object that I created: >dat = data.frame(loc1=c("A/A","T/A","T/A","T/T","T/A","A/T"), loc2=c("C/G","G/C","C/C","G/G","C/G","G/C")) >x=df2genind(dat, sep="/") I get the two columns per loci (as Thibaut does): >truenames(x) loc1.A loc1.T loc2.C loc2.G 1 1.0 0.0 0.5 0.5 2 0.5 0.5 0.5 0.5 3 0.5 0.5 1.0 0.0 4 0.0 1.0 0.0 1.0 5 0.5 0.5 0.5 0.5 6 0.5 0.5 0.5 0.5 But when I test a subset of my data >test<-myData[1:10,1:10] >test loc_29 loc_7 loc_43 etc... 1 "G / A" "C / T" "T / T" 2 "G / G" "C / T" "T/ T" etc... > test_genid <- df2genind(test,sep="/") I get again three or four columns: >truenames(test_genid) loc_29.A loc_29.G loc_29.G loc_7.C loc_7.T loc_7.C loc_43.C loc_43.T loc_43.C loc_43.T etc.. 1 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 2 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 etc... When I carry my PCA analysis with all my data: >X <- scaleGen(myData_genid, scale=F, missing="mean") >pca_myData<-dudi.pca(X,center=F,scale=F) I get the following message: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 3,4,... I really don't understand what is causing that, is there a hiden character in my data file that makes the df2genind divide my columns? Does that affect the results I get thereafter? By the way, I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents the "real picture". Thanks for your comments Andrea From t.jombart at imperial.ac.uk Tue Jun 17 15:24:07 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 17 Jun 2014 13:24:07 +0000 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA65709A12A57D@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A5ED@icexch-m1.ic.ac.uk> Me neither. But so far: - instructions we can verify all behave normally - we don't have reproducible code for the stated problem If you can send a small subset of data and command line used to create *myData*, and the commands showing the problem for this dataset, then we can try and figure it out. Best Thibaut ________________________________________ From: Andrea Garavito [neagef at gmail.com] Sent: 17 June 2014 14:15 To: Jombart, Thibaut Subject: Re: [adegenet-forum] SNP alleles Hi Thibaut, my Data is a matrix of 162 individuals with 10806 biallelic SNPs coded as I already mentioned. I've done the df2genind with the ploidy=as.integer(2) and ploidy=2 parameter and I get exactly the same result. It doesn't seem to be an empty character problem. I really don't understand. > alleles(test_genid) $L01 1 2 3 "A" "G" "G" $L02 1 2 3 "C" "T" "C" $L03 1 2 "G" "C" $L04 1 2 3 "A" "C" "A" $L05 1 2 "G" "A" $L06 1 2 "G" "C" $L07 1 2 3 4 "C" "T" "C" "T" $L08 1 2 3 "C" "C" "T" $L09 1 2 3 4 "G" "T" "G" "T" $L10 1 2 3 "C" "T" "T" Thanks again Andrea 2014-06-17 14:57 GMT+02:00 Jombart, Thibaut >: What is "myData"? BTW it is safer to specify the ploidy when constructing a genind. Try: alleles(test_genid) # btw the name is 'genind' - genotype of individuals to see if it is a problem of empty characters. Cheers Thibaut ________________________________________ From: Andrea Garavito [neagef at gmail.com] Sent: 17 June 2014 13:47 To: Jombart, Thibaut Cc: Caitlin Collins; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles Hi Caitlin and Thibaut, Thanks for your answers. I did used the sep argument. My code to generate the genind object is : >myData_genid <- df2genind(myData, sep="/") The weird thing is that when I try the same code with a test object that I created: >dat = data.frame(loc1=c("A/A","T/A","T/A","T/T","T/A","A/T"), loc2=c("C/G","G/C","C/C","G/G","C/G","G/C")) >x=df2genind(dat, sep="/") I get the two columns per loci (as Thibaut does): >truenames(x) loc1.A loc1.T loc2.C loc2.G 1 1.0 0.0 0.5 0.5 2 0.5 0.5 0.5 0.5 3 0.5 0.5 1.0 0.0 4 0.0 1.0 0.0 1.0 5 0.5 0.5 0.5 0.5 6 0.5 0.5 0.5 0.5 But when I test a subset of my data >test<-myData[1:10,1:10] >test loc_29 loc_7 loc_43 etc... 1 "G / A" "C / T" "T / T" 2 "G / G" "C / T" "T/ T" etc... > test_genid <- df2genind(test,sep="/") I get again three or four columns: >truenames(test_genid) loc_29.A loc_29.G loc_29.G loc_7.C loc_7.T loc_7.C loc_43.C loc_43.T loc_43.C loc_43.T etc.. 1 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 2 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5 etc... When I carry my PCA analysis with all my data: >X <- scaleGen(myData_genid, scale=F, missing="mean") >pca_myData<-dudi.pca(X,center=F,scale=F) I get the following message: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 3,4,... I really don't understand what is causing that, is there a hiden character in my data file that makes the df2genind divide my columns? Does that affect the results I get thereafter? By the way, I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents the "real picture". Thanks for your comments Andrea From m.navascues at gmail.com Tue Jun 17 16:01:02 2014 From: m.navascues at gmail.com (=?ISO-8859-1?Q?Miguel_Navascu=E9s?=) Date: Tue, 17 Jun 2014 16:01:02 +0200 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> Message-ID: <53A04A1E.5040409@gmail.com> In one of your messages (below) there seem to be spaces in addition to "/" separating the alleles. May be worth to check if that can cause the problem. Best Miguel On 17/06/14 14:47, Andrea Garavito wrote: > >test<-myData[1:10,1:10] > >test > loc_29 loc_7 loc_43 etc... > 1 "G / A" "C / T" "T / T" > 2 "G / G" "C / T" "T/ T" > etc... -- Miguel NAVASCU?S, PhD Charg? de Recherche (CR2) INRA UMR CBGP Centre de Biologie pour la Gestion des Populations Institut National de la Recherche Agronomique Campus International de Baillarguet, CS 30016 34988 Montferrier-sur-Lez (France) phone: +33(0)4.99.62.33.70 fax: +33(0)4.99.62.33.45 e-mail: miguel.navascues AT supagro.inra.fr e-mail: m.navascues AT gmail.com Skype: m.navascues web: http://www1.montpellier.inra.fr/cbgp/ web: http://sites.google.com/site/navascuesresearch/ From t.jombart at imperial.ac.uk Tue Jun 17 16:08:16 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 17 Jun 2014 14:08:16 +0000 Subject: [adegenet-forum] SNP alleles In-Reply-To: <53A04A1E.5040409@gmail.com> References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> , <53A04A1E.5040409@gmail.com> Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk> Ahah, well spotted! I totally missed it. Yep, open your file, remove all white spaces, and it should fly. Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Miguel Navascu?s [m.navascues at gmail.com] Sent: 17 June 2014 15:01 To: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles In one of your messages (below) there seem to be spaces in addition to "/" separating the alleles. May be worth to check if that can cause the problem. Best Miguel On 17/06/14 14:47, Andrea Garavito wrote: > >test<-myData[1:10,1:10] > >test > loc_29 loc_7 loc_43 etc... > 1 "G / A" "C / T" "T / T" > 2 "G / G" "C / T" "T/ T" > etc... -- Miguel NAVASCU?S, PhD Charg? de Recherche (CR2) INRA UMR CBGP Centre de Biologie pour la Gestion des Populations Institut National de la Recherche Agronomique Campus International de Baillarguet, CS 30016 34988 Montferrier-sur-Lez (France) phone: +33(0)4.99.62.33.70 fax: +33(0)4.99.62.33.45 e-mail: miguel.navascues AT supagro.inra.fr e-mail: m.navascues AT gmail.com Skype: m.navascues web: http://www1.montpellier.inra.fr/cbgp/ web: http://sites.google.com/site/navascuesresearch/ _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From caitiecollins at gmail.com Tue Jun 17 16:28:17 2014 From: caitiecollins at gmail.com (Caitlin Collins) Date: Tue, 17 Jun 2014 15:28:17 +0100 Subject: [adegenet-forum] SNP alleles In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk> References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> <53A04A1E.5040409@gmail.com> <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk> Message-ID: For this purpose, it would also be adequate to just change sep from "/" to " / ", but I suppose there may be other reasons to want to remove the white spaces. Cheers, Caitlin. On Tue, Jun 17, 2014 at 3:08 PM, Jombart, Thibaut wrote: > > Ahah, well spotted! I totally missed it. > > Yep, open your file, remove all white spaces, and it should fly. > > Cheers > Thibaut > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [ > adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Miguel > Navascu?s [m.navascues at gmail.com] > Sent: 17 June 2014 15:01 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: Re: [adegenet-forum] SNP alleles > > In one of your messages (below) there seem to be spaces in addition to > "/" separating the alleles. May be worth to check if that can cause the > problem. > > Best > > Miguel > > On 17/06/14 14:47, Andrea Garavito wrote: > > >test<-myData[1:10,1:10] > > >test > > loc_29 loc_7 loc_43 etc... > > 1 "G / A" "C / T" "T / T" > > 2 "G / G" "C / T" "T/ T" > > etc... > > > -- > Miguel NAVASCU?S, PhD > > Charg? de Recherche (CR2) INRA > > UMR CBGP Centre de Biologie pour la Gestion des Populations > Institut National de la Recherche Agronomique > Campus International de Baillarguet, CS 30016 > 34988 Montferrier-sur-Lez (France) > > phone: +33(0)4.99.62.33.70 > fax: +33(0)4.99.62.33.45 > e-mail: miguel.navascues AT supagro.inra.fr > e-mail: m.navascues AT gmail.com > Skype: m.navascues > web: http://www1.montpellier.inra.fr/cbgp/ > web: http://sites.google.com/site/navascuesresearch/ > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Tue Jun 17 16:36:33 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 17 Jun 2014 14:36:33 +0000 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> <53A04A1E.5040409@gmail.com> <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A69E@icexch-m1.ic.ac.uk> Yup. ________________________________________ From: Caitlin Collins [caitiecollins at gmail.com] Sent: 17 June 2014 15:28 To: Jombart, Thibaut Cc: Miguel Navascu?s; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles For this purpose, it would also be adequate to just change sep from "/" to " / ", but I suppose there may be other reasons to want to remove the white spaces. Cheers, Caitlin. On Tue, Jun 17, 2014 at 3:08 PM, Jombart, Thibaut > wrote: Ahah, well spotted! I totally missed it. Yep, open your file, remove all white spaces, and it should fly. Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Miguel Navascu?s [m.navascues at gmail.com] Sent: 17 June 2014 15:01 To: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles In one of your messages (below) there seem to be spaces in addition to "/" separating the alleles. May be worth to check if that can cause the problem. Best Miguel On 17/06/14 14:47, Andrea Garavito wrote: > >test<-myData[1:10,1:10] > >test > loc_29 loc_7 loc_43 etc... > 1 "G / A" "C / T" "T / T" > 2 "G / G" "C / T" "T/ T" > etc... -- Miguel NAVASCU?S, PhD Charg? de Recherche (CR2) INRA UMR CBGP Centre de Biologie pour la Gestion des Populations Institut National de la Recherche Agronomique Campus International de Baillarguet, CS 30016 34988 Montferrier-sur-Lez (France) phone: +33(0)4.99.62.33.70 fax: +33(0)4.99.62.33.45 e-mail: miguel.navascues AT supagro.inra.fr e-mail: m.navascues AT gmail.com Skype: m.navascues web: http://www1.montpellier.inra.fr/cbgp/ web: http://sites.google.com/site/navascuesresearch/ _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From neagef at gmail.com Tue Jun 17 17:24:57 2014 From: neagef at gmail.com (Andrea Garavito) Date: Tue, 17 Jun 2014 17:24:57 +0200 Subject: [adegenet-forum] SNP alleles In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk> References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> <53A04A1E.5040409@gmail.com> <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk> Message-ID: Thanks Miguel, You found the problem! I searched and replaced the space characters, redo the analysis et voila! I have my two columns per marker. With all the reformatting needed to obtain the A/T format from the original excell file, no wonder how those spaces got into the data! This allows me to rephrase my other question, that got lost in the discussion: I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 first PC axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents better the "real picture". Thank you all for the help Andrea -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Tue Jun 17 17:38:41 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Tue, 17 Jun 2014 15:38:41 +0000 Subject: [adegenet-forum] SNP alleles In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk> <53A04A1E.5040409@gmail.com> <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A754@icexch-m1.ic.ac.uk> scale=TRUE will give a lot more weight to rare alleles. So it depends on how much you want to trust these. I usually go for no scaling (scale=FALSE), so that alleles with low variability are not given an exaggerated weight. Cheers Thibaut ________________________________________ From: Andrea Garavito [neagef at gmail.com] Sent: 17 June 2014 16:24 To: Jombart, Thibaut Cc: Miguel Navascu?s; adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] SNP alleles Thanks Miguel, You found the problem! I searched and replaced the space characters, redo the analysis et voila! I have my two columns per marker. With all the reformatting needed to obtain the A/T format from the original excell file, no wonder how those spaces got into the data! This allows me to rephrase my other question, that got lost in the discussion: I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 first PC axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents better the "real picture". Thank you all for the help Andrea From manuelacorreia2 at gmail.com Wed Jun 18 19:51:53 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Wed, 18 Jun 2014 18:51:53 +0100 Subject: [adegenet-forum] set.seeds in DAPC Message-ID: Hi there, I'd like to understand the role of set.seeds and the criteria chosen in the DAPC examples according to the two examples presented in the lattested version of DAPC tutorial. I used to see set. seeds(N?) in the context of significance as well as bootstrap Monte Carlo procedures, but not within multivariate techniques or even with datasets. At page 20 from DAPC tutorial there is a set. seed(4) before getting the loadingplot. Also, another example at page 39, before split the dataset microbov in two parts. And by the way, what is 20 in the sample(e,20....)? 20 individuals picked at random from all microbov populations? So, I do have two questions. One is "why to use them?" here in these particular examples? The second one "what criteria were behind the choice of the number 4 in the former case, and the number 2 in the latter? How do I know which seed will be the best one for my datased in case I need to have the loadingplot? Thanks in advance, M. -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitiecollins at gmail.com Wed Jun 18 20:48:33 2014 From: caitiecollins at gmail.com (Caitlin Collins) Date: Wed, 18 Jun 2014 19:48:33 +0100 Subject: [adegenet-forum] set.seeds in DAPC In-Reply-To: References: Message-ID: Hi, Glad to see you've been reading the tutorial in such detail! These are great questions, and the way you have asked them actually hints at the answer: set.seed() is not inherently linked to multivariate techniques or datasets, but rather with random number generation (more specifically, with getting *reproducible* results from "random" processes). This is probably why you have seen set.seed come up in the context of bootstrap Monte Carlo procedures! Essentially, when R is asked to generate a "random" number, it actually generates a pseudo-random number by taking some input and generating an output that seems random. Without being given an input, R does this by using your computer's clock and using the current time as its starting point, from which it generates a seemingly random number. You would not get the same random number at a different time, so we find this adequate to call the process "random" number generation, BUT if in fact you tried to generate two "random" numbers at the exact same time (down to the millisecond), you would actually get the exact same "random" number. (Note: I have glossed over a lot of really interesting things about this process, so if you want to know more about random number generation, please read on here: http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf ). This potential problem with random number generation can occasionally be quite useful in cases where we want to run something that requires random number generation but where we would also like to get the same result each time. set.seed() is the way we control this. With set.seed(), the "seed" is used as the input to our random number generation (instead of the clock), which allows you to get *reproducible *"random" numbers. Try this example: rnorm(3) rnorm(3) set.seed(1) rnorm(3) set.seed(1) # note: for set.seed() to work, you need to use it before every instance of random number generation. rnorm(3) Neat! Having established this, we can now answer your questions about why we use set.seed() where we do in the DAPC tutorial. On page 20, we use it before creating a loading plot. This is just because we use the argument lab.jitter to move the labels around a bit. Jitter works by adding random noise, so we can control it with set.seed(). We have chosen to use set.seed(4) simply because it "randomly" put the labels in a nice enough place. Arguably, set.seed(6) would have done a better job (next time!), but it's a good thing we didn't use set.seed(2). If you would like, you can see for yourself: data(H3N2) pop(H3N2) <- factor(H3N2$other$epid) dapc.flu <- dapc(H3N2, n.pca=30,n.da=10) set.seed(4) contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1) set.seed(6) contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1) set.seed(2) contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1) Finally, we use set.seed(2) on page 39 to get a "random" sample of 20 individuals (you were right about that) to serve as our "supplementary individuals" for that exercise. Here, the use of set.seed(2) just ensures that no matter how many times we edit and re-build that tutorial, we will always get the same set of 20 individuals, which is useful for consistency's sake. All in all, I apologise for the long response that was possibly less related to DAPC than you might have expected, but I hope that helped answer your question! Best, Caitlin. On Wed, Jun 18, 2014 at 6:51 PM, Manuela wrote: > Hi there, > > > I'd like to understand the role of set.seeds and the criteria chosen in > the DAPC examples according to the two examples presented in the lattested > version of DAPC tutorial. > > I used to see set. seeds(N?) in the context of significance as well as > bootstrap Monte Carlo procedures, but not within multivariate techniques or > even with datasets. > > At page 20 from DAPC tutorial there is a set. seed(4) before getting the > loadingplot. Also, another example at page 39, before split the dataset > microbov in two parts. And by the way, what is 20 in the sample(e,20....)? > 20 individuals picked at random from all microbov populations? > > > So, I do have two questions. > One is "why to use them?" here in these particular examples? > The second one "what criteria were behind the choice of the number 4 in > the former case, and the number 2 in the latter? > > How do I know which seed will be the best one for my datased in case I > need to have the loadingplot? > > Thanks in advance, > M. > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuelacorreia2 at gmail.com Thu Jun 19 01:17:42 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Thu, 19 Jun 2014 00:17:42 +0100 Subject: [adegenet-forum] set.seeds in DAPC In-Reply-To: References: Message-ID: Dear Caitlin, Thank you for such a clear response and at same time for being so knowledgeable. It was quiet interesting to have a glimpse on the way how the Adegenet team decided to use the set.seeds to obtain consistent results, as well as (that was just brilliant!) to control the lab. jitter. As you point up with the 3 examples its better to try several set.seeds in order to find out the best labels position with our dataset. And when we reach the final stage of cross-validation we ought to choose one seed to ensure that the training set of supplementary individuals (no matter the number (10%, 20%)) will always made up of the same set of individuals. Thank you. I've learnt so much with this long response. Cheers, M. 2014-06-18 19:48 GMT+01:00 Caitlin Collins : > Hi, > > Glad to see you've been reading the tutorial in such detail! > > These are great questions, and the way you have asked them actually hints > at the answer: set.seed() is not inherently linked to multivariate > techniques or datasets, but rather with random number generation (more > specifically, with getting *reproducible* results from "random" > processes). This is probably why you have seen set.seed come up in the > context of bootstrap Monte Carlo procedures! > > Essentially, when R is asked to generate a "random" number, it actually > generates a pseudo-random number by taking some input and generating an > output that seems random. Without being given an input, R does this by > using your computer's clock and using the current time as its starting > point, from which it generates a seemingly random number. You would not get > the same random number at a different time, so we find this adequate to > call the process "random" number generation, BUT if in fact you tried to > generate two "random" numbers at the exact same time (down to the > millisecond), you would actually get the exact same "random" number. (Note: > I have glossed over a lot of really interesting things about this process, > so if you want to know more about random number generation, please read on > here: > http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf > ). > > This potential problem with random number generation can occasionally be > quite useful in cases where we want to run something that requires random > number generation but where we would also like to get the same result each > time. > set.seed() is the way we control this. With set.seed(), the "seed" is used > as the input to our random number generation (instead of the clock), which > allows you to get *reproducible *"random" numbers. > > Try this example: > > rnorm(3) > rnorm(3) > > set.seed(1) > rnorm(3) > > set.seed(1) # note: for set.seed() to work, you need to use it before > every instance of random number generation. > rnorm(3) > > Neat! Having established this, we can now answer your questions about why > we use set.seed() where we do in the DAPC tutorial. > > On page 20, we use it before creating a loading plot. This is just because > we use the argument lab.jitter to move the labels around a bit. Jitter > works by adding random noise, so we can control it with set.seed(). We have > chosen to use set.seed(4) simply because it "randomly" put the labels in a > nice enough place. Arguably, set.seed(6) would have done a better job (next > time!), but it's a good thing we didn't use set.seed(2). > > If you would like, you can see for yourself: > > data(H3N2) > pop(H3N2) <- factor(H3N2$other$epid) > dapc.flu <- dapc(H3N2, n.pca=30,n.da=10) > > set.seed(4) > contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1) > > set.seed(6) > contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1) > > set.seed(2) > contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1) > > Finally, we use set.seed(2) on page 39 to get a "random" sample of 20 > individuals (you were right about that) to serve as our "supplementary > individuals" for that exercise. Here, the use of set.seed(2) just ensures > that no matter how many times we edit and re-build that tutorial, we will > always get the same set of 20 individuals, which is useful for > consistency's sake. > > All in all, I apologise for the long response that was possibly less > related to DAPC than you might have expected, but I hope that helped answer > your question! > > Best, > Caitlin. > > > > > On Wed, Jun 18, 2014 at 6:51 PM, Manuela > wrote: > >> Hi there, >> >> >> I'd like to understand the role of set.seeds and the criteria chosen in >> the DAPC examples according to the two examples presented in the lattested >> version of DAPC tutorial. >> >> I used to see set. seeds(N?) in the context of significance as well as >> bootstrap Monte Carlo procedures, but not within multivariate techniques or >> even with datasets. >> >> At page 20 from DAPC tutorial there is a set. seed(4) before getting the >> loadingplot. Also, another example at page 39, before split the dataset >> microbov in two parts. And by the way, what is 20 in the sample(e,20....)? >> 20 individuals picked at random from all microbov populations? >> >> >> So, I do have two questions. >> One is "why to use them?" here in these particular examples? >> The second one "what criteria were behind the choice of the number 4 in >> the former case, and the number 2 in the latter? >> >> How do I know which seed will be the best one for my datased in case I >> need to have the loadingplot? >> >> Thanks in advance, >> M. >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From caitiecollins at gmail.com Thu Jun 19 02:32:04 2014 From: caitiecollins at gmail.com (Caitlin Collins) Date: Thu, 19 Jun 2014 01:32:04 +0100 Subject: [adegenet-forum] set.seeds in DAPC In-Reply-To: References: Message-ID: Hi Manuela, Glad to hear I could help a bit! I should stress that our use of set.seed() in the tutorial has been mainly for the purpose of making the tutorial, as a document, consistent and identically reproducible. In an experimental context, however, eg. in the case of selecting supplementary individuals, if you are truly attempting to test a concept (for example, in validating a model), you would actually *want* random behaviour (ie. an effectively random sample). This is particularly the case if you are performing repeated sampling, as one often does with supplementary individuals. So be careful to only set the seed when you do NOT want a random sample; otherwise, just leave out set.seed() from the process and let the computer pick a sample at random. Best, Caitlin. On Thu, Jun 19, 2014 at 12:17 AM, Manuela wrote: > Dear Caitlin, > > > Thank you for such a clear response and at same time for being so > knowledgeable. It was quiet interesting to have a glimpse on the way how > the Adegenet team decided to use the set.seeds to obtain consistent > results, as well as (that was just brilliant!) to control the lab. jitter. > > As you point up with the 3 examples its better to try several set.seeds in > order to find out the best labels position with our dataset. And when we > reach the final stage of cross-validation we ought to choose one seed to > ensure that the training set of supplementary individuals (no matter the > number (10%, 20%)) will always made up of the same set of individuals. > > Thank you. I've learnt so much with this long response. > > Cheers, > M. > > > 2014-06-18 19:48 GMT+01:00 Caitlin Collins : > > Hi, >> >> Glad to see you've been reading the tutorial in such detail! >> >> These are great questions, and the way you have asked them actually hints >> at the answer: set.seed() is not inherently linked to multivariate >> techniques or datasets, but rather with random number generation (more >> specifically, with getting *reproducible* results from "random" >> processes). This is probably why you have seen set.seed come up in the >> context of bootstrap Monte Carlo procedures! >> >> Essentially, when R is asked to generate a "random" number, it actually >> generates a pseudo-random number by taking some input and generating an >> output that seems random. Without being given an input, R does this by >> using your computer's clock and using the current time as its starting >> point, from which it generates a seemingly random number. You would not get >> the same random number at a different time, so we find this adequate to >> call the process "random" number generation, BUT if in fact you tried to >> generate two "random" numbers at the exact same time (down to the >> millisecond), you would actually get the exact same "random" number. (Note: >> I have glossed over a lot of really interesting things about this process, >> so if you want to know more about random number generation, please read on >> here: >> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf >> ). >> >> This potential problem with random number generation can occasionally be >> quite useful in cases where we want to run something that requires random >> number generation but where we would also like to get the same result each >> time. >> set.seed() is the way we control this. With set.seed(), the "seed" is >> used as the input to our random number generation (instead of the clock), >> which allows you to get *reproducible *"random" numbers. >> >> Try this example: >> >> rnorm(3) >> rnorm(3) >> >> set.seed(1) >> rnorm(3) >> >> set.seed(1) # note: for set.seed() to work, you need to use it before >> every instance of random number generation. >> rnorm(3) >> >> Neat! Having established this, we can now answer your questions about why >> we use set.seed() where we do in the DAPC tutorial. >> >> On page 20, we use it before creating a loading plot. This is just >> because we use the argument lab.jitter to move the labels around a bit. >> Jitter works by adding random noise, so we can control it with set.seed(). >> We have chosen to use set.seed(4) simply because it "randomly" put the >> labels in a nice enough place. Arguably, set.seed(6) would have done a >> better job (next time!), but it's a good thing we didn't use set.seed(2). >> >> If you would like, you can see for yourself: >> >> data(H3N2) >> pop(H3N2) <- factor(H3N2$other$epid) >> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10) >> >> set.seed(4) >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, >> lab.jitter=1) >> >> set.seed(6) >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, >> lab.jitter=1) >> >> set.seed(2) >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, >> lab.jitter=1) >> >> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20 >> individuals (you were right about that) to serve as our "supplementary >> individuals" for that exercise. Here, the use of set.seed(2) just ensures >> that no matter how many times we edit and re-build that tutorial, we will >> always get the same set of 20 individuals, which is useful for >> consistency's sake. >> >> All in all, I apologise for the long response that was possibly less >> related to DAPC than you might have expected, but I hope that helped answer >> your question! >> >> Best, >> Caitlin. >> >> >> >> >> On Wed, Jun 18, 2014 at 6:51 PM, Manuela >> wrote: >> >>> Hi there, >>> >>> >>> I'd like to understand the role of set.seeds and the criteria chosen >>> in the DAPC examples according to the two examples presented in the >>> lattested version of DAPC tutorial. >>> >>> I used to see set. seeds(N?) in the context of significance as well as >>> bootstrap Monte Carlo procedures, but not within multivariate techniques or >>> even with datasets. >>> >>> At page 20 from DAPC tutorial there is a set. seed(4) before getting the >>> loadingplot. Also, another example at page 39, before split the dataset >>> microbov in two parts. And by the way, what is 20 in the sample(e,20....)? >>> 20 individuals picked at random from all microbov populations? >>> >>> >>> So, I do have two questions. >>> One is "why to use them?" here in these particular examples? >>> The second one "what criteria were behind the choice of the number 4 in >>> the former case, and the number 2 in the latter? >>> >>> How do I know which seed will be the best one for my datased in case I >>> need to have the loadingplot? >>> >>> Thanks in advance, >>> M. >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mcialdini at gmail.com Thu Jun 19 14:27:39 2014 From: mcialdini at gmail.com (Manuela) Date: Thu, 19 Jun 2014 13:27:39 +0100 Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 16 In-Reply-To: References: Message-ID: Hi Caitlin. Good point! In fact, I' didn?t notice this tiny nuance in the rationale behind cross-validation on using a stratified sampling of 10% of individuals (validation set sample) in the well-exemplified nancycats datset, through the ciclic process of PC retention, sampling and DAPC procedures in each set number of PCAs retained, BUT not the same set of individuals in each round. >From the second one based on supplementary individuals used on predicting results. Also the way they were selected was different. They result from a split of the original sample into a stratified "testing sample" of X individuals, BUT using a non-random sample as defined by set.seed() function. Later, I'll present you a new set of questions raised by clines for being thoroughly evaluated on modelling by DAPC. Cheers, M. 2014-06-19 11:00 GMT+01:00 < adegenet-forum-request at lists.r-forge.r-project.org>: > Send adegenet-forum mailing list submissions to > adegenet-forum at lists.r-forge.r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > or, via email, send a message with subject or body 'help' to > adegenet-forum-request at lists.r-forge.r-project.org > > You can reach the person managing the list at > adegenet-forum-owner at lists.r-forge.r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of adegenet-forum digest..." > > > Today's Topics: > > 1. Re: set.seeds in DAPC (Caitlin Collins) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 Jun 2014 01:32:04 +0100 > From: Caitlin Collins > To: Manuela > Cc: "adegenet-forum at lists.r-forge.r-project.org" > > Subject: Re: [adegenet-forum] set.seeds in DAPC > Message-ID: > < > CAMon0MDGDDZmFji6_T2McFtsqTzNmr7ENTE0Fj1rXiFYP_P_9g at mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi Manuela, > > Glad to hear I could help a bit! > > I should stress that our use of set.seed() in the tutorial has been mainly > for the purpose of making the tutorial, as a document, consistent and > identically reproducible. In an experimental context, however, eg. in the > case of selecting supplementary individuals, if you are truly attempting to > test a concept (for example, in validating a model), you would actually > *want* random behaviour (ie. an effectively random sample). This is > particularly the case if you are performing repeated sampling, as one often > does with supplementary individuals. So be careful to only set the seed > when you do NOT want a random sample; otherwise, just leave out set.seed() > from the process and let the computer pick a sample at random. > > Best, > Caitlin. > > > On Thu, Jun 19, 2014 at 12:17 AM, Manuela > wrote: > > > Dear Caitlin, > > > > > > Thank you for such a clear response and at same time for being so > > knowledgeable. It was quiet interesting to have a glimpse on the way how > > the Adegenet team decided to use the set.seeds to obtain consistent > > results, as well as (that was just brilliant!) to control the lab. > jitter. > > > > As you point up with the 3 examples its better to try several set.seeds > in > > order to find out the best labels position with our dataset. And when we > > reach the final stage of cross-validation we ought to choose one seed to > > ensure that the training set of supplementary individuals (no matter the > > number (10%, 20%)) will always made up of the same set of individuals. > > > > Thank you. I've learnt so much with this long response. > > > > Cheers, > > M. > > > > > > 2014-06-18 19:48 GMT+01:00 Caitlin Collins : > > > > Hi, > >> > >> Glad to see you've been reading the tutorial in such detail! > >> > >> These are great questions, and the way you have asked them actually > hints > >> at the answer: set.seed() is not inherently linked to multivariate > >> techniques or datasets, but rather with random number generation (more > >> specifically, with getting *reproducible* results from "random" > >> processes). This is probably why you have seen set.seed come up in the > >> context of bootstrap Monte Carlo procedures! > >> > >> Essentially, when R is asked to generate a "random" number, it actually > >> generates a pseudo-random number by taking some input and generating an > >> output that seems random. Without being given an input, R does this by > >> using your computer's clock and using the current time as its starting > >> point, from which it generates a seemingly random number. You would not > get > >> the same random number at a different time, so we find this adequate to > >> call the process "random" number generation, BUT if in fact you tried to > >> generate two "random" numbers at the exact same time (down to the > >> millisecond), you would actually get the exact same "random" number. > (Note: > >> I have glossed over a lot of really interesting things about this > process, > >> so if you want to know more about random number generation, please read > on > >> here: > >> > http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf > >> ). > >> > >> This potential problem with random number generation can occasionally be > >> quite useful in cases where we want to run something that requires > random > >> number generation but where we would also like to get the same result > each > >> time. > >> set.seed() is the way we control this. With set.seed(), the "seed" is > >> used as the input to our random number generation (instead of the > clock), > >> which allows you to get *reproducible *"random" numbers. > >> > >> Try this example: > >> > >> rnorm(3) > >> rnorm(3) > >> > >> set.seed(1) > >> rnorm(3) > >> > >> set.seed(1) # note: for set.seed() to work, you need to use it before > >> every instance of random number generation. > >> rnorm(3) > >> > >> Neat! Having established this, we can now answer your questions about > why > >> we use set.seed() where we do in the DAPC tutorial. > >> > >> On page 20, we use it before creating a loading plot. This is just > >> because we use the argument lab.jitter to move the labels around a bit. > >> Jitter works by adding random noise, so we can control it with > set.seed(). > >> We have chosen to use set.seed(4) simply because it "randomly" put the > >> labels in a nice enough place. Arguably, set.seed(6) would have done a > >> better job (next time!), but it's a good thing we didn't use > set.seed(2). > >> > >> If you would like, you can see for yourself: > >> > >> data(H3N2) > >> pop(H3N2) <- factor(H3N2$other$epid) > >> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10) > >> > >> set.seed(4) > >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, > >> lab.jitter=1) > >> > >> set.seed(6) > >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, > >> lab.jitter=1) > >> > >> set.seed(2) > >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, > >> lab.jitter=1) > >> > >> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20 > >> individuals (you were right about that) to serve as our "supplementary > >> individuals" for that exercise. Here, the use of set.seed(2) just > ensures > >> that no matter how many times we edit and re-build that tutorial, we > will > >> always get the same set of 20 individuals, which is useful for > >> consistency's sake. > >> > >> All in all, I apologise for the long response that was possibly less > >> related to DAPC than you might have expected, but I hope that helped > answer > >> your question! > >> > >> Best, > >> Caitlin. > >> > >> > >> > >> > >> On Wed, Jun 18, 2014 at 6:51 PM, Manuela > >> wrote: > >> > >>> Hi there, > >>> > >>> > >>> I'd like to understand the role of set.seeds and the criteria chosen > >>> in the DAPC examples according to the two examples presented in the > >>> lattested version of DAPC tutorial. > >>> > >>> I used to see set. seeds(N?) in the context of significance as well as > >>> bootstrap Monte Carlo procedures, but not within multivariate > techniques or > >>> even with datasets. > >>> > >>> At page 20 from DAPC tutorial there is a set. seed(4) before getting > the > >>> loadingplot. Also, another example at page 39, before split the dataset > >>> microbov in two parts. And by the way, what is 20 in the > sample(e,20....)? > >>> 20 individuals picked at random from all microbov populations? > >>> > >>> > >>> So, I do have two questions. > >>> One is "why to use them?" here in these particular examples? > >>> The second one "what criteria were behind the choice of the number 4 in > >>> the former case, and the number 2 in the latter? > >>> > >>> How do I know which seed will be the best one for my datased in case I > >>> need to have the loadingplot? > >>> > >>> Thanks in advance, > >>> M. > >>> > >>> _______________________________________________ > >>> adegenet-forum mailing list > >>> adegenet-forum at lists.r-forge.r-project.org > >>> > >>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > >>> > >> > >> > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140619/db7b9f27/attachment-0001.html > > > > ------------------------------ > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > End of adegenet-forum Digest, Vol 70, Issue 16 > ********************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From manuelacorreia2 at gmail.com Thu Jun 19 14:48:54 2014 From: manuelacorreia2 at gmail.com (Manuela) Date: Thu, 19 Jun 2014 13:48:54 +0100 Subject: [adegenet-forum] set.seeds in DAPC In-Reply-To: References: Message-ID: Hi Caitlin. Good point! In fact, I' didn?t notice this tiny nuance in the rationale behind cross-validation on using a stratified sampling of 10% of individuals (validation set sample) in the well-exemplified nancycats datset, through the ciclic process of PC retention, sampling and DAPC procedures in each set number of PCAs retained, BUT not the same set of individuals in each round. >From the second one based on supplementary individuals used on predicting results. Also the way they were selected was different. They result from a split of the original sample into a stratified "testing sample" of X individuals, BUT using a non-random sample as defined by set.seed() function. Later, I'll present you a new set of questions raised by clines for being thoroughly evaluated on modelling by DAPC. Cheers, M. 2014-06-19 1:32 GMT+01:00 Caitlin Collins : > Hi Manuela, > > Glad to hear I could help a bit! > > I should stress that our use of set.seed() in the tutorial has been mainly > for the purpose of making the tutorial, as a document, consistent and > identically reproducible. In an experimental context, however, eg. in the > case of selecting supplementary individuals, if you are truly attempting to > test a concept (for example, in validating a model), you would actually > *want* random behaviour (ie. an effectively random sample). This is > particularly the case if you are performing repeated sampling, as one often > does with supplementary individuals. So be careful to only set the seed > when you do NOT want a random sample; otherwise, just leave out set.seed() > from the process and let the computer pick a sample at random. > > Best, > Caitlin. > > > On Thu, Jun 19, 2014 at 12:17 AM, Manuela > wrote: > >> Dear Caitlin, >> >> >> Thank you for such a clear response and at same time for being so >> knowledgeable. It was quiet interesting to have a glimpse on the way how >> the Adegenet team decided to use the set.seeds to obtain consistent >> results, as well as (that was just brilliant!) to control the lab. jitter. >> >> As you point up with the 3 examples its better to try several set.seeds >> in order to find out the best labels position with our dataset. And when we >> reach the final stage of cross-validation we ought to choose one seed to >> ensure that the training set of supplementary individuals (no matter the >> number (10%, 20%)) will always made up of the same set of individuals. >> >> Thank you. I've learnt so much with this long response. >> >> Cheers, >> M. >> >> >> 2014-06-18 19:48 GMT+01:00 Caitlin Collins : >> >> Hi, >>> >>> Glad to see you've been reading the tutorial in such detail! >>> >>> These are great questions, and the way you have asked them actually >>> hints at the answer: set.seed() is not inherently linked to multivariate >>> techniques or datasets, but rather with random number generation (more >>> specifically, with getting *reproducible* results from "random" >>> processes). This is probably why you have seen set.seed come up in the >>> context of bootstrap Monte Carlo procedures! >>> >>> Essentially, when R is asked to generate a "random" number, it actually >>> generates a pseudo-random number by taking some input and generating an >>> output that seems random. Without being given an input, R does this by >>> using your computer's clock and using the current time as its starting >>> point, from which it generates a seemingly random number. You would not get >>> the same random number at a different time, so we find this adequate to >>> call the process "random" number generation, BUT if in fact you tried to >>> generate two "random" numbers at the exact same time (down to the >>> millisecond), you would actually get the exact same "random" number. (Note: >>> I have glossed over a lot of really interesting things about this process, >>> so if you want to know more about random number generation, please read on >>> here: >>> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf >>> ). >>> >>> This potential problem with random number generation can occasionally be >>> quite useful in cases where we want to run something that requires random >>> number generation but where we would also like to get the same result each >>> time. >>> set.seed() is the way we control this. With set.seed(), the "seed" is >>> used as the input to our random number generation (instead of the clock), >>> which allows you to get *reproducible *"random" numbers. >>> >>> Try this example: >>> >>> rnorm(3) >>> rnorm(3) >>> >>> set.seed(1) >>> rnorm(3) >>> >>> set.seed(1) # note: for set.seed() to work, you need to use it before >>> every instance of random number generation. >>> rnorm(3) >>> >>> Neat! Having established this, we can now answer your questions about >>> why we use set.seed() where we do in the DAPC tutorial. >>> >>> On page 20, we use it before creating a loading plot. This is just >>> because we use the argument lab.jitter to move the labels around a bit. >>> Jitter works by adding random noise, so we can control it with set.seed(). >>> We have chosen to use set.seed(4) simply because it "randomly" put the >>> labels in a nice enough place. Arguably, set.seed(6) would have done a >>> better job (next time!), but it's a good thing we didn't use set.seed(2). >>> >>> If you would like, you can see for yourself: >>> >>> data(H3N2) >>> pop(H3N2) <- factor(H3N2$other$epid) >>> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10) >>> >>> set.seed(4) >>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, >>> lab.jitter=1) >>> >>> set.seed(6) >>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, >>> lab.jitter=1) >>> >>> set.seed(2) >>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, >>> lab.jitter=1) >>> >>> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20 >>> individuals (you were right about that) to serve as our "supplementary >>> individuals" for that exercise. Here, the use of set.seed(2) just ensures >>> that no matter how many times we edit and re-build that tutorial, we will >>> always get the same set of 20 individuals, which is useful for >>> consistency's sake. >>> >>> All in all, I apologise for the long response that was possibly less >>> related to DAPC than you might have expected, but I hope that helped answer >>> your question! >>> >>> Best, >>> Caitlin. >>> >>> >>> >>> >>> On Wed, Jun 18, 2014 at 6:51 PM, Manuela >>> wrote: >>> >>>> Hi there, >>>> >>>> >>>> I'd like to understand the role of set.seeds and the criteria chosen >>>> in the DAPC examples according to the two examples presented in the >>>> lattested version of DAPC tutorial. >>>> >>>> I used to see set. seeds(N?) in the context of significance as well as >>>> bootstrap Monte Carlo procedures, but not within multivariate techniques or >>>> even with datasets. >>>> >>>> At page 20 from DAPC tutorial there is a set. seed(4) before getting >>>> the loadingplot. Also, another example at page 39, before split the dataset >>>> microbov in two parts. And by the way, what is 20 in the sample(e,20....)? >>>> 20 individuals picked at random from all microbov populations? >>>> >>>> >>>> So, I do have two questions. >>>> One is "why to use them?" here in these particular examples? >>>> The second one "what criteria were behind the choice of the number 4 in >>>> the former case, and the number 2 in the latter? >>>> >>>> How do I know which seed will be the best one for my datased in case I >>>> need to have the loadingplot? >>>> >>>> Thanks in advance, >>>> M. >>>> >>>> _______________________________________________ >>>> adegenet-forum mailing list >>>> adegenet-forum at lists.r-forge.r-project.org >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kelly.bennett at manchester.ac.uk Wed Jun 18 13:45:00 2014 From: kelly.bennett at manchester.ac.uk (Kelly Bennett) Date: Wed, 18 Jun 2014 11:45:00 +0000 Subject: [adegenet-forum] confusing p value in mantel test Message-ID: Hello, I have run a mantel test with the following code dna <- read.dna(file = "dna_manteltest.fasta", format = "fasta") dna.dists <- dist(dna, method = "euclidean") as.matrix(dna.dists)[1:5, 1:5] geo <- read.csv(file = "geo_matrix.csv") geo[1:2, 1:2] geo.dists <- dist(geo, method = "euclidean") as.matrix(geo.dists)[1:5, 1:5] mantelresult<-mantel.rtest(dna.dists, geo.dists, nrepet = 9999) cor.test(geo.dists, dna.dists) plot(mantelresult <- mantel.rtest(dna.dists, geo.dists), main = "Mantel's test") mantelresult >From my plot it looks like there should be isolation by distance and a correlation test shows a significant association but my p value for the Monte Carlo test = 1 Does anyone have any ideas about this contradiction? I have attached the plot to this email Thank you very much, Kelly -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Mantelplot.pdf Type: application/pdf Size: 4508 bytes Desc: Mantelplot.pdf URL: From vojta at trapa.cz Mon Jun 23 14:58:17 2014 From: vojta at trapa.cz (=?utf-8?B?Vm9qdMSbY2g=?= Zeisek) Date: Mon, 23 Jun 2014 14:58:17 +0200 Subject: [adegenet-forum] confusing p value in mantel test In-Reply-To: References: Message-ID: <15094289.eak1km79LX@veles.site> Hello Dne St 18. ?ervna 2014 11:45:00, Kelly Bennett napsal(a): > Hello, > > I have run a mantel test with the following code > > dna <- read.dna(file = "dna_manteltest.fasta", format = "fasta") > dna.dists <- dist(dna, method = "euclidean") Why do You use function dist() and not dist.dna() (package APE) having various mutations models? IMHO, Euclidean distance is not the best for nucleotide data, I'd use it for fragmentation data, but not here. > as.matrix(dna.dists)[1:5, 1:5] > geo <- read.csv(file = "geo_matrix.csv") > geo[1:2, 1:2] > geo.dists <- dist(geo, method = "euclidean") > as.matrix(geo.dists)[1:5, 1:5] > mantelresult<-mantel.rtest(dna.dists, geo.dists, nrepet = 9999) > cor.test(geo.dists, dna.dists) > plot(mantelresult <- mantel.rtest(dna.dists, geo.dists), main = "Mantel's > test") mantelresult > > From my plot it looks like there should be isolation by distance and a > correlation test shows a significant association but my p value for the > Monte Carlo test = 1 > > Does anyone have any ideas about this contradiction? I have attached the > plot to this email Well, it will produce some result every time You give it some data, even if they are wrongly used. Right now it might be the case. > Thank you very much, > > Kelly Sincerely, Vojt?ch -- Vojt?ch Zeisek http://trapa.cz/en/ Department of Botany, Faculty of Science Charles University in Prague Ben?tsk? 2, Prague, 12801, CZ http://botany.natur.cuni.cz/en/ Institute of Botany, Academy of Science Z?mek 1, Pr?honice, 25243, CZ http://www.ibot.cas.cz/en/ Czech Republic -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 490 bytes Desc: This is a digitally signed message part. URL: From m.navascues at gmail.com Mon Jun 23 15:06:35 2014 From: m.navascues at gmail.com (Miguel Navascues) Date: Mon, 23 Jun 2014 15:06:35 +0200 Subject: [adegenet-forum] confusing p value in mantel test In-Reply-To: References: Message-ID: <53A8265B.7050005@supagro.inra.fr> Hello Kelly, It looks like there is a NEGATIVE correlation between genetic and geographical distance, no isolation by distance... Best, Miguel On 18/06/14 13:45, Kelly Bennett wrote: > > > Hello, > > I have run a mantel test with the following code > > dna <- read.dna(file = "dna_manteltest.fasta", format = "fasta") > dna.dists <- dist(dna, method = "euclidean") > as.matrix(dna.dists)[1:5, 1:5] > geo <- read.csv(file = "geo_matrix.csv") > geo[1:2, 1:2] > geo.dists <- dist(geo, method = "euclidean") > as.matrix(geo.dists)[1:5, 1:5] > mantelresult<-mantel.rtest(dna.dists, geo.dists, nrepet = 9999) > cor.test(geo.dists, dna.dists) > plot(mantelresult <- mantel.rtest(dna.dists, geo.dists), main = > "Mantel's test") > mantelresult > > From my plot it looks like there should be isolation by distance and a > correlation test shows a significant association but my p value for the > Monte Carlo test = 1 > > Does anyone have any ideas about this contradiction? I have attached the > plot to this email > > Thank you very much, > > Kelly > > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -- Miguel NAVASCU?S, PhD Charg? de Recherche (CR2) INRA UMR CBGP Centre de Biologie pour la Gestion des Populations Institut National de la Recherche Agronomique Campus International de Baillarguet, CS 30016 34988 Montferrier-sur-Lez (France) phone: +33(0)4.99.62.33.70 fax: +33(0)4.99.62.33.45 e-mail: miguel.navascues AT supagro.inra.fr e-mail: m.navascues AT gmail.com Skype: m.navascues web: http://www1.montpellier.inra.fr/cbgp/ web: http://sites.google.com/site/navascuesresearch/ From schwarcz.kaiser at gmail.com Wed Jun 25 21:27:41 2014 From: schwarcz.kaiser at gmail.com (Kaiser Schwarcz) Date: Wed, 25 Jun 2014 16:27:41 -0300 Subject: [adegenet-forum] adegenet with chloroplast Message-ID: Is that a way to analyse chloroplast microssatellite data with adegenet? I have a str file with my data for STRUCTURE but I don't know how import it to genind because my data is not "codom" nor a "PA" Is thare a way to do it? *Kaiser Dias Schwarcz* Me. Biologia Molecular e Evolu??o Unicamp - Brasil -------------- next part -------------- An HTML attachment was scrubbed... URL: From sonofvin at gmail.com Wed Jun 25 23:17:09 2014 From: sonofvin at gmail.com (Vinson Doyle) Date: Wed, 25 Jun 2014 17:17:09 -0400 Subject: [adegenet-forum] adegenet with chloroplast In-Reply-To: References: Message-ID: Treat it as codom and import using read.table. Then convert to genind with df2genind. -Vinson On Wed, Jun 25, 2014 at 3:27 PM, Kaiser Schwarcz wrote: > Is that a way to analyse chloroplast microssatellite data with adegenet? > I have a str file with my data for STRUCTURE but I don't know how import > it to genind because my data is not "codom" nor a "PA" > > Is thare a way to do it? > > *Kaiser Dias Schwarcz* > Me. Biologia Molecular e Evolu??o > Unicamp - Brasil > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bonanomi.sara85 at gmail.com Thu Jun 26 15:52:20 2014 From: bonanomi.sara85 at gmail.com (Sara Bonanomi) Date: Thu, 26 Jun 2014 15:52:20 +0200 Subject: [adegenet-forum] convert genind object in data frame (A:C A:T T:G...) Message-ID: Dear Thibaut, I don?t get how you could convert genepop file or a genind object into a dataframe, so I could get for instance a csv table with my genotypes in bases (e.g A:G , A:C...). Thank you, Best regards Sara -------------- next part -------------- An HTML attachment was scrubbed... URL: From vojta at trapa.cz Thu Jun 26 15:59:24 2014 From: vojta at trapa.cz (=?utf-8?B?Vm9qdMSbY2g=?= Zeisek) Date: Thu, 26 Jun 2014 15:59:24 +0200 Subject: [adegenet-forum] convert genind object in data frame (A:C A:T T:G...) In-Reply-To: References: Message-ID: <2448216.6dLDrNeryg@veles.site> Hello Dne ?t 26. ?ervna 2014 15:52:20, Sara Bonanomi napsal(a): > Dear Thibaut, > > I don?t get how you could convert genepop file or a genind object into a > dataframe, so I could get for instance a csv table with my genotypes in > bases (e.g A:G , A:C...). Might be I miss something, but I'd guess You convert Your data from data frame to genind object, right? Then I'd just pick those original data. If this is not Your case, check functions genind2genotype and genind2df. I don't think there is way how to reconstruct genind back from genpop as genpop (as far as I know) doesn't store all information needed to correctly assign alleles to original individuals. > Thank you, > > Best regards > > Sara All the best, Vojt?ch -- Vojt?ch Zeisek http://trapa.cz/en/ Department of Botany, Faculty of Science Charles University in Prague Ben?tsk? 2, Prague, 12801, CZ http://botany.natur.cuni.cz/en/ Institute of Botany, Academy of Science Z?mek 1, Pr?honice, 25243, CZ http://www.ibot.cas.cz/en/ Czech Republic -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 490 bytes Desc: This is a digitally signed message part. URL: From t.jombart at imperial.ac.uk Sun Jun 29 19:30:51 2014 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Sun, 29 Jun 2014 17:30:51 +0000 Subject: [adegenet-forum] convert genind object in data frame (A:C A:T T:G...) In-Reply-To: References: Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12E30F@icexch-m1.ic.ac.uk> Hello, check out genind2df. All in the basics tutorial Cheers Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Sara Bonanomi [bonanomi.sara85 at gmail.com] Sent: 26 June 2014 14:52 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] convert genind object in data frame (A:C A:T T:G...) Dear Thibaut, I don?t get how you could convert genepop file or a genind object into a dataframe, so I could get for instance a csv table with my genotypes in bases (e.g A:G , A:C...). Thank you, Best regards Sara