From nathan.truelove at manchester.ac.uk Tue Sep 3 14:44:06 2013 From: nathan.truelove at manchester.ac.uk (Nathan Truelove) Date: Tue, 3 Sep 2013 12:44:06 +0000 Subject: [adegenet-forum] $li in sPCA analysis In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk> References: <2CB2DA8E426F3541AB1907F98ABA6570638B5234@icexch-m1.ic.ac.uk>, <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk> Message-ID: Hi Adegenet Forum, Thanks in advance to anyone who has some advice to share with the forum on SPCA. If you're in a rush just read the parts in bold. I've been using SPCA to look at spatial genetics patterns among lobster populations. I found positive local structure with the function local.rest and no global structure using global.rtest. I've followed Thibaut's advice in his previous sPCA email to forum and used $li to interpret local structure. I selected the local eigenvalue that had the highest levels of negative spatial autocorrelation and genetic variance for interpretation using the screeplot function. The $li values from this eigenvalue were then used to create an interpolated map. My question for the forum is: What do the positive and negative $li values associated with the local eigenvalue mean? Do they correspond to levels of local (positive) and global (negative) scores at each location? Or are the $li values associated with the local eigenvalues simply a score for detecting local spatial genetic structure among sites and have nothing to do with global structure? Best Wishes, Nate On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote: Hello, I think you attached the wrong file. Negative values and local structure are not related. Local structure = sharp differences between neighours. These would be overlooked by the lagged vector. If the structure is clear enough, use $li. As you have many overlapping points, s.value is suboptimal. You should consider using the colorplot, or interpolated maps. See the tutorial on sPCA for some example: http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf Best Thibaut ________________________________________ From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [hans at tauex.tau.ac.il] Sent: 11 August 2013 12:19 To: Jombart, Thibaut Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis Hello Thibaut, Thank you for the response. In the file I have attached I see that with the $li variable there are no negative values in the southern sites while with the $ls values there are negative values in the south. It seems that I see more local spatial structure with $ls than with $li . When I tested the data with local test I got significant results. Which variable is better to present in a paper. Thank you Hanan Mr. Hanan Sela Ph.D. Curator of the Lieberman Cereal Germplasm Bank The Institute for Cereal Crops Improvement Tel-Aviv University P.O. Box 39040 Tel Aviv 69978 Israel hans at tauex.tau.ac.il Phone: 972-3-6405773 Cell: 972-50-5727458 , local U.S 17203600603 Fax: 972-3-6407857 On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut > wrote: Hello, the lagged vector is the spatially weighted average of the original vector. That is, the value of the score at a given location is the weighted average of the neighbouring values. This basically smooths the patterns so that they can be detected / visualized more easily. Cheers Thibaut. -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Hanan Sela [hans at tauex.tau.ac.il] Sent: 11 August 2013 06:21 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] li vs. ls in sPCA analysis Hello I have plotted the first PC of sPCA analysis using s.value once with z=my.pca$li[,1] and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached file). I do not understand what the lagged PC is representing. What is the meaning of "denoisified" in the practical day presentation (Google does not know). How do i interpent the difference. Please explain. Thank you Mr. Hanan Sela Ph.D. Curator of the Lieberman Cereal Germplasm Bank The Institute for Cereal Crops Improvement Tel-Aviv University P.O. Box 39040 Tel Aviv 69978 Israel hans at tauex.tau.ac.il> Phone: 972-3-6405773 Cell: 972-50-5727458 , local U.S 17203600603 Fax: 972-3-6407857 On Thu, Aug 1, 2013 at 7:15 PM, >> wrote: Send adegenet-forum mailing list submissions to adegenet-forum at lists.r-forge.r-project.org> To subscribe or unsubscribe via the World Wide Web, visit https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum or, via email, send a message with subject or body 'help' to adegenet-forum-request at lists.r-forge.r-project.org> You can reach the person managing the list at adegenet-forum-owner at lists.r-forge.r-project.org> When replying, please edit your Subject line so it is more specific than "Re: Contents of adegenet-forum digest..." Today's Topics: 1. Fwd: Question about pre-processing of SNP data for machine learning (Daniel Murrell) 2. Re: Fwd: Question about pre-processing of SNP data for machine learning (Jombart, Thibaut) 3. Re: Fwd: Question about pre-processing of SNP data for machine learning (Daniel Murrell) ---------------------------------------------------------------------- Message: 1 Date: Thu, 1 Aug 2013> 15:26:00 +0100 From: Daniel Murrell >> To: adegenet-forum at lists.r-forge.r-project.org> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Message-ID: >> Content-Type: text/plain; charset="windows-1252" Hi All This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object). So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly? Thank you Daniel ---------- Forwarded message ---------- From: Jombart, Thibaut >> Date: Fri, Jul 19, 2013> at 4:27 PM Subject: RE: Question about pre-processing of SNP data for machine learning To: Daniel Murrell >> Dear Daniel, yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on: http://adegenet.r-forge.r-project.org/ Don't hesitate to use the adegenet-forum for further questions (see contacts on the website). Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk> http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: dsmurrell at gmail.com> [dsmurrell at gmail.com>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>] Sent: 19 July 2013 16:23 To: Jombart, Thibaut Subject: Question about pre-processing of SNP data for machine learning Dear Thibaut I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help? I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality. Thank you Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 1 Aug 2013 15:22:27 +0000 From: "Jombart, Thibaut" >> To: Daniel Murrell >>, "adegenet-forum at lists.r-forge.r-project.org>" >> Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk>> Content-Type: text/plain; charset="Windows-1252" Dear Daniel, the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it? https://sourceforge.net/p/adegenet/tickets/ You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable. All the best Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>] Sent: 01 August 2013 15:26 To: adegenet-forum at lists.r-forge.r-project.org> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Hi All This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object). So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly? Thank you Daniel ---------- Forwarded message ---------- From: Jombart, Thibaut >>>> Date: Fri, Jul 19, 2013 at 4:27 PM Subject: RE: Question about pre-processing of SNP data for machine learning To: Daniel Murrell >>>> Dear Daniel, yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on: http://adegenet.r-forge.r-project.org/ Don't hesitate to use the adegenet-forum for further questions (see contacts on the website). Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk>>> http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: dsmurrell at gmail.com>>> [dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>>>] Sent: 19 July 2013 16:23 To: Jombart, Thibaut Subject: Question about pre-processing of SNP data for machine learning Dear Thibaut I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help? I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality. Thank you Daniel ------------------------------ Message: 3 Date: Thu, 1 Aug 2013 17:14:37 +0100 From: Daniel Murrell >> To: "Jombart, Thibaut" >> Cc: "adegenet-forum at lists.r-forge.r-project.org>" >> Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Message-ID: >> Content-Type: text/plain; charset="windows-1252" Dear Thibaut Ok, I could try that. I could also try and use the genlight object in a transposed manner just for the purposes of holding the data so that I can access individual SNPs easily. I mean nothing else would work expect the containment. Thanks for the help Regards Daniel On Thu, Aug 1, 2013 at 4:22 PM, Jombart, Thibaut >>wrote: Dear Daniel, the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it? https://sourceforge.net/p/adegenet/tickets/ You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable. All the best Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org> [ adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>] Sent: 01 August 2013 15:26 To: adegenet-forum at lists.r-forge.r-project.org> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Hi All This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object). So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly? Thank you Daniel ---------- Forwarded message ---------- From: Jombart, Thibaut >>>> Date: Fri, Jul 19, 2013 at 4:27 PM Subject: RE: Question about pre-processing of SNP data for machine learning To: Daniel Murrell >>>> Dear Daniel, yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on: http://adegenet.r-forge.r-project.org/ Don't hesitate to use the adegenet-forum for further questions (see contacts on the website). Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk>>> http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: dsmurrell at gmail.com>>> [dsmurrell at gmail.com> >>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk> >>] Sent: 19 July 2013 16:23 To: Jombart, Thibaut Subject: Question about pre-processing of SNP data for machine learning Dear Thibaut I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help? I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality. Thank you Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum End of adegenet-forum Digest, Vol 60, Issue 2 ********************************************* _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jutta.Geismar at senckenberg.de Wed Sep 4 15:03:35 2013 From: Jutta.Geismar at senckenberg.de (Jutta Geismar) Date: Wed, 04 Sep 2013 15:03:35 +0200 Subject: [adegenet-forum] Question about genetic structure in admixed populations Message-ID: <52274BC7020000CB0000539A@snggwia.senckenberg.de> Dear Mr Jombart and DAPC users, I used DAPC to analyze genetic structure in a small region with 20 microsatellite markers. I analyzed 330 individuals (14 sampling sites) and found little genetic differences (FST, D Jost), but a significant isolation by distance pattern. A cluster analysis in STRUCTURE resulted in four clusters (STRUCTURE Harvester) but all individuals had more or less equal posterior probability in all of the four inferred clusters. Therefore I assume a panmictic population structure. Since STRUCTURE is known for some problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in both cases these were randomly distributed among all individuals without a geographic context. Only 94 individuals were not assigned to one cluster with more than 90% and therefore would be counted as ?admixed? (example in DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to each other, but I don?t know how a panmictic population would look like in DAPC. Distances between sites are small and it is very likely that gene flow occurs among my sampling points, which might cause problems in genetic cluster analyses. I don?t know if I made any mistake in my thinking, that?s why I want to explain my procedure briefly: 1. I used dapc and chose 1/3 of the sample size as PC (as suggested) and counted DAs in the plot (100% of the variability was included, 110 PC, 13 DA) 2. To reduce variability I used optim.a.score (smart FALSE). The best a-score was around 0.2 (PC 61) 3. After that I wanted to estimate the number of clusters by find.clusters and used the a-score as number of PCs and repeated the dapc (conserved variance was still 98%, 61 PCs, 2 DA) I chose k in the BIC values after which the decrease was less compared to the previous, but not the lowest k. If I have some mistakes in my procedure I would appreciate some advice. But also if the procedure is okay I cannot explain the contrariness of these two analyses. Thanks a lot in advance for some help. Jutta Geismar PhD student Germany -------------- next part -------------- An HTML attachment was scrubbed... URL: From mirainoshojo at gmail.com Wed Sep 4 16:45:40 2013 From: mirainoshojo at gmail.com (Valeria Montano) Date: Wed, 4 Sep 2013 16:45:40 +0200 Subject: [adegenet-forum] $li in sPCA analysis In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA6570638B5234@icexch-m1.ic.ac.uk> <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk> Message-ID: Hi Nate, the $li scores are the scores of each locality onto a given component, the same that you have in classic PCA, that is they are simply the coordinates of the entities on the component you are interested in. As the component is centred on zero, the values are both positive and negative and represent the position of a specific location along that component. That is valid for both positive and negative eigenvalues, respectively associated to global and local spatial structure. A significant structure, whether global (positive) or local (negative), is currently evaluated by the global and local rtests on the basis of the overall genetic correlation with the spatial distribution of the localities. Each positive and negative component (with its own amount of genetic variance and moran Index explained) is thus a partial representation of the global and local spatial structure. So in your case, since you have a significant local structure, you may plot one by one the first, second, third etc negative component and see what the pattern looks like according to each component. Sometimes there's interesting info in smaller cp. Ehm, as usual it's a bit messy explanation (I am not good at explaining), but I hope this helps. Otherwise I hope you will get better replies. Ciao Valeria On 3 September 2013 14:44, Nathan Truelove wrote: > Hi Adegenet Forum, > > Thanks in advance to anyone who has some advice to share with the forum > on SPCA. If you're in a rush just read the parts in bold. > > *I've been using SPCA to look at spatial genetics patterns among lobster > populations*. I found positive local structure with the function > local.rest and no global structure using global.rtest. I've followed > Thibaut's advice in his previous sPCA email to forum and used $li to > interpret local structure. I selected the local eigenvalue that had the > highest levels of negative spatial autocorrelation and genetic variance for > interpretation using the screeplot function. The $li values from this > eigenvalue were then used to create an interpolated map. > > *My question for the forum is*: *What do the positive and negative $li > values associated with the local eigenvalue mean? *Do they correspond to > levels of local (positive) and global (negative) scores at each location? > Or are the $li values associated with the local eigenvalues simply a score > for detecting local spatial genetic structure among sites and have nothing > to do with global structure? > > Best Wishes, > > Nate > > On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote: > > > Hello, > > I think you attached the wrong file. > > Negative values and local structure are not related. Local structure = > sharp differences between neighours. These would be overlooked by the > lagged vector. > > If the structure is clear enough, use $li. > > As you have many overlapping points, s.value is suboptimal. You should > consider using the colorplot, or interpolated maps. See the tutorial on > sPCA for some example: > http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf > > Best > Thibaut > ________________________________________ > From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [ > hans at tauex.tau.ac.il] > Sent: 11 August 2013 12:19 > To: Jombart, Thibaut > Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis > > Hello Thibaut, > Thank you for the response. > In the file I have attached I see that with the $li variable there are no > negative values in the southern sites while with the $ls values there are > negative values in the south. It seems that I see more local spatial > structure with $ls than with $li . When I tested the data with local test I > got significant results. Which variable is better to present in a paper. > Thank you > Hanan > Mr. Hanan Sela Ph.D. > Curator of the Lieberman Cereal Germplasm Bank > The Institute for Cereal Crops Improvement > Tel-Aviv University > P.O. Box 39040 > Tel Aviv 69978 > Israel > > hans at tauex.tau.ac.il > Phone: 972-3-6405773 > Cell: 972-50-5727458 , local U.S 17203600603 > Fax: 972-3-6407857 > > > On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut < > t.jombart at imperial.ac.uk> wrote: > Hello, > > the lagged vector is the spatially weighted average of the original > vector. That is, the value of the score at a given location is the weighted > average of the neighbouring values. This basically smooths the patterns so > that they can be detected / visualized more easily. > > Cheers > Thibaut. > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary?s Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > t.jombart at imperial.ac.uk > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org> [ > adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Hanan > Sela [hans at tauex.tau.ac.il] > Sent: 11 August 2013 06:21 > To: adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> > Subject: [adegenet-forum] li vs. ls in sPCA analysis > > Hello > I have plotted the first PC of sPCA analysis using s.value once with > z=my.pca$li[,1] > and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached > file). I do not understand what the lagged PC is representing. What is the > meaning of "denoisified" in the practical day presentation (Google does > not know). How do i interpent the difference. Please explain. > Thank you > > Mr. Hanan Sela Ph.D. > Curator of the Lieberman Cereal Germplasm Bank > The Institute for Cereal Crops Improvement > Tel-Aviv University > P.O. Box 39040 > Tel Aviv 69978 > Israel > > hans at tauex.tau.ac.il hans at tauex.tau.ac.il> > Phone: 972-3-6405773 > Cell: 972-50-5727458 , local U.S 17203600603 > Fax: 972-3-6407857 > > > On Thu, Aug 1, 2013 at 7:15 PM, < > adegenet-forum-request at lists.r-forge.r-project.org adegenet-forum-request at lists.r-forge.r-project.org> adegenet-forum-request at lists.r-forge.r-project.org adegenet-forum-request at lists.r-forge.r-project.org>>> wrote: > Send adegenet-forum mailing list submissions to > adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>> > > To subscribe or unsubscribe via the World Wide Web, visit > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > or, via email, send a message with subject or body 'help' to > adegenet-forum-request at lists.r-forge.r-project.org adegenet-forum-request at lists.r-forge.r-project.org> adegenet-forum-request at lists.r-forge.r-project.org adegenet-forum-request at lists.r-forge.r-project.org>> > > You can reach the person managing the list at > adegenet-forum-owner at lists.r-forge.r-project.org adegenet-forum-owner at lists.r-forge.r-project.org> adegenet-forum-owner at lists.r-forge.r-project.org adegenet-forum-owner at lists.r-forge.r-project.org>> > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of adegenet-forum digest..." > > > Today's Topics: > > 1. Fwd: Question about pre-processing of SNP data for machine > learning (Daniel Murrell) > 2. Re: Fwd: Question about pre-processing of SNP data for > machine learning (Jombart, Thibaut) > 3. Re: Fwd: Question about pre-processing of SNP data for > machine learning (Daniel Murrell) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 1 Aug 2013> 15:26:00 +0100 > From: Daniel Murrell dsm38 at cam.ac.uk>> > To: adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>> > Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP > data for machine learning > Message-ID: > >> > Content-Type: text/plain; charset="windows-1252" > > Hi All > > This is my first time using adegenet. I'm trying to perform some > pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a > machine learning task. My data was stored in a format which had to be > converted to a genlight object. The data was split so that the information > for the SNPs in each chromosome was in a separate file. I've read each file > in, converted that to a genlight object and then concatenated the genlight > objects using cbind. All of that seems to work ok (except the position and > chromosome data went back to NULL during the concatenation and I had to > reset it on the combined genlight object). > > So, now I want to do my own processing on each SNP and when I try to access > the information for this SNP over the 800 individuals, it takes ages to > extract. Is this because the encoding is done row wise, and so the whole > object needs to be decoded for me to get out the information I require? Is > there a way to transpose this genlight object so that I can access the data > for a single SNP across all individual quickly? > > Thank you > Daniel > > ---------- Forwarded message ---------- > From: Jombart, Thibaut t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk>>> > Date: Fri, Jul 19, 2013> at 4:27 PM > Subject: RE: Question about pre-processing of SNP data for machine learning > To: Daniel Murrell dsm38 at cam.ac.uk>> > > > Dear Daniel, > > yes, adegenet is designed for that kind of task. Please look at the > tutorial on adegenet-basics where you'll find examples of dimension > reduction for SNP data, to be found on: > http://adegenet.r-forge.r-project.org/ > > Don't hesitate to use the adegenet-forum for further questions (see > contacts on the website). > Best > Thibaut > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary?s Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > > t.jombart at imperial.ac.uk t.jombart at imperial.ac.uk> > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From: dsmurrell at gmail.com dsmurrell at gmail.com> [dsmurrell at gmail.com > dsmurrell at gmail.com>>] on behalf of Daniel Murrell > [dsm38 at cam.ac.uk dsm38 at cam.ac.uk>>] > Sent: 19 July 2013 16:23 > To: Jombart, Thibaut > Subject: Question about pre-processing of SNP data for machine learning > > Dear Thibaut > > I'm trying to build a model that uses SNP data as input. The problem I have > is that there is too much of it and I need a way to reduce the number or > the dimensionality of the data points so that I can use them as input to > machine learning algorithms (genome wide, 1.3 million SNPs, 800 > individuals). I've done some searching and found this paper: > http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). > > I also found your adegenet package and wondered if it's designed for doing > something like this? I'm not from this field and I'm having some trouble > working this out. Can you point me to anything that might help? > > I'm not sure whether I should be keeping a subset of SNPs and how to find > that subset from the 1.3 million, or whether I should be reducing the > dimensionality. > > Thank you > Daniel > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/a331daec/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Thu, 1 Aug 2013 15:22:27 +0000 > From: "Jombart, Thibaut" t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk>>> > To: Daniel Murrell dsm38 at cam.ac.uk>>, > "adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>>" > adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>>> > Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of > SNP data for machine learning > Message-ID: > <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk > > >> > Content-Type: text/plain; charset="Windows-1252" > > > Dear Daniel, > > the loss of attributes after cbind indeed is a glitch. Would you mind > creating a ticket about it? > https://sourceforge.net/p/adegenet/tickets/ > > You're right about the issue. The encoding is indeed done row-wise so the > conversion is done many times over. There's no option for transposing the > data, but one solution would be converting your data to integers by blocks > so that conversion takes place less often, while still keep RAM > requirements reasonable. > > All the best > > Thibaut > > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org>> [ > adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel > Murrell [dsm38 at cam.ac.uk >] > Sent: 01 August 2013 15:26 > To: adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>> > Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data > for machine learning > > Hi All > > This is my first time using adegenet. I'm trying to perform some > pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a > machine learning task. My data was stored in a format which had to be > converted to a genlight object. The data was split so that the information > for the SNPs in each chromosome was in a separate file. I've read each file > in, converted that to a genlight object and then concatenated the genlight > objects using cbind. All of that seems to work ok (except the position and > chromosome data went back to NULL during the concatenation and I had to > reset it on the combined genlight object). > > So, now I want to do my own processing on each SNP and when I try to > access the information for this SNP over the 800 individuals, it takes ages > to extract. Is this because the encoding is done row wise, and so the whole > object needs to be decoded for me to get out the information I require? Is > there a way to transpose this genlight object so that I can access the data > for a single SNP across all individual quickly? > > Thank you > Daniel > > ---------- Forwarded message ---------- > From: Jombart, Thibaut t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk>> t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk>>>> > Date: Fri, Jul 19, 2013 at 4:27 PM > Subject: RE: Question about pre-processing of SNP data for machine learning > To: Daniel Murrell dsm38 at cam.ac.uk> dsm38 at cam.ac.uk>>>> > > > Dear Daniel, > > yes, adegenet is designed for that kind of task. Please look at the > tutorial on adegenet-basics where you'll find examples of dimension > reduction for SNP data, to be found on: > http://adegenet.r-forge.r-project.org/ > > Don't hesitate to use the adegenet-forum for further questions (see > contacts on the website). > Best > Thibaut > > -- > ###################################### > Dr Thibaut JOMBART > MRC Centre for Outbreak Analysis and Modelling > Department of Infectious Disease Epidemiology > Imperial College - School of Public Health > St Mary?s Campus > Norfolk Place > London W2 1PG > United Kingdom > Tel. : 0044 (0)20 7594 3658 > > t.jombart at imperial.ac.uk t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk t.jombart at imperial.ac.uk>> > http://sites.google.com/site/thibautjombart/ > http://adegenet.r-forge.r-project.org/ > ________________________________________ > From: dsmurrell at gmail.com dsmurrell at gmail.com> dsmurrell at gmail.com >> [dsmurrell at gmail.com dsmurrell at gmail.com> >> dsmurrell at gmail.com>>] on behalf of Daniel > Murrell [dsm38 at cam.ac.uk > >>>] > Sent: 19 July 2013 16:23 > To: Jombart, Thibaut > Subject: Question about pre-processing of SNP data for machine learning > > Dear Thibaut > > I'm trying to build a model that uses SNP data as input. The problem I > have is that there is too much of it and I need a way to reduce the number > or the dimensionality of the data points so that I can use them as input to > machine learning algorithms (genome wide, 1.3 million SNPs, 800 > individuals). I've done some searching and found this paper: > http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). > > I also found your adegenet package and wondered if it's designed for doing > something like this? I'm not from this field and I'm having some trouble > working this out. Can you point me to anything that might help? > > I'm not sure whether I should be keeping a subset of SNPs and how to find > that subset from the 1.3 million, or whether I should be reducing the > dimensionality. > > Thank you > Daniel > > > ------------------------------ > > Message: 3 > Date: Thu, 1 Aug 2013 17:14:37 +0100 > From: Daniel Murrell dsm38 at cam.ac.uk>> > To: "Jombart, Thibaut" t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk>>> > Cc: "adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>>" > adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>>> > Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of > SNP data for machine learning > Message-ID: > qWD%2BEO5vOBihA at mail.gmail.com>> > Content-Type: text/plain; charset="windows-1252" > > Dear Thibaut > > Ok, I could try that. I could also try and use the genlight object in a > transposed manner just for the purposes of holding the data so that I can > access individual SNPs easily. I mean nothing else would work expect the > containment. > > Thanks for the help > Regards > Daniel > > On Thu, Aug 1, 2013 at 4:22 PM, Jombart, Thibaut > t.jombart at imperial.ac.uk>>wrote: > > > Dear Daniel, > > > the loss of attributes after cbind indeed is a glitch. Would you mind > > creating a ticket about it? > > https://sourceforge.net/p/adegenet/tickets/ > > > You're right about the issue. The encoding is indeed done row-wise so the > > conversion is done many times over. There's no option for transposing the > > data, but one solution would be converting your data to integers by blocks > > so that conversion takes place less often, while still keep RAM > > requirements reasonable. > > > All the best > > > Thibaut > > > ________________________________________ > > From: adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org>> [ > > adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org adegenet-forum-bounces at lists.r-forge.r-project.org>>] on behalf of Daniel > > Murrell [dsm38 at cam.ac.uk >] > > Sent: 01 August 2013 15:26 > > To: adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>> > > Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data > > for machine learning > > > Hi All > > > This is my first time using adegenet. I'm trying to perform some > > pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a > > machine learning task. My data was stored in a format which had to be > > converted to a genlight object. The data was split so that the information > > for the SNPs in each chromosome was in a separate file. I've read each file > > in, converted that to a genlight object and then concatenated the genlight > > objects using cbind. All of that seems to work ok (except the position and > > chromosome data went back to NULL during the concatenation and I had to > > reset it on the combined genlight object). > > > So, now I want to do my own processing on each SNP and when I try to > > access the information for this SNP over the 800 individuals, it takes ages > > to extract. Is this because the encoding is done row wise, and so the whole > > object needs to be decoded for me to get out the information I require? Is > > there a way to transpose this genlight object so that I can access the data > > for a single SNP across all individual quickly? > > > Thank you > > Daniel > > > ---------- Forwarded message ---------- > > From: Jombart, Thibaut t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk>> > t.jombart at imperial.ac.uk t.jombart at imperial.ac.uk>>> > > Date: Fri, Jul 19, 2013 at 4:27 PM > > Subject: RE: Question about pre-processing of SNP data for machine learning > > To: Daniel Murrell dsm38 at cam.ac.uk> dsm38 at cam.ac.uk>>>> > > > > Dear Daniel, > > > yes, adegenet is designed for that kind of task. Please look at the > > tutorial on adegenet-basics where you'll find examples of dimension > > reduction for SNP data, to be found on: > > http://adegenet.r-forge.r-project.org/ > > > Don't hesitate to use the adegenet-forum for further questions (see > > contacts on the website). > > Best > > Thibaut > > > -- > > ###################################### > > Dr Thibaut JOMBART > > MRC Centre for Outbreak Analysis and Modelling > > Department of Infectious Disease Epidemiology > > Imperial College - School of Public Health > > St Mary?s Campus > > Norfolk Place > > London W2 1PG > > United Kingdom > > Tel. : 0044 (0)20 7594 3658 > > t.jombart at imperial.ac.uk t.jombart at imperial.ac.uk> t.jombart at imperial.ac.uk t.jombart at imperial.ac.uk>> > > http://sites.google.com/site/thibautjombart/ > > http://adegenet.r-forge.r-project.org/ > > ________________________________________ > > From: dsmurrell at gmail.com dsmurrell at gmail.com> dsmurrell at gmail.com >> [dsmurrell at gmail.com dsmurrell at gmail.com> >> > > dsmurrell at gmail.com>>] on behalf of Daniel > Murrell [dsm38 at cam.ac.uk > > > >>] > > Sent: 19 July 2013 16:23 > > To: Jombart, Thibaut > > Subject: Question about pre-processing of SNP data for machine learning > > > Dear Thibaut > > > I'm trying to build a model that uses SNP data as input. The problem I > > have is that there is too much of it and I need a way to reduce the number > > or the dimensionality of the data points so that I can use them as input to > > machine learning algorithms (genome wide, 1.3 million SNPs, 800 > > individuals). I've done some searching and found this paper: > > http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). > > > I also found your adegenet package and wondered if it's designed for > doing > > something like this? I'm not from this field and I'm having some trouble > > working this out. Can you point me to anything that might help? > > > I'm not sure whether I should be keeping a subset of SNPs and how to find > > that subset from the 1.3 million, or whether I should be reducing the > > dimensionality. > > > Thank you > > Daniel > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20130801/4373022c/attachment.html > > > > ------------------------------ > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org adegenet-forum at lists.r-forge.r-project.org>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > End of adegenet-forum Digest, Vol 60, Issue 2 > ********************************************* > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mirainoshojo at gmail.com Thu Sep 5 10:59:43 2013 From: mirainoshojo at gmail.com (Valeria Montano) Date: Thu, 5 Sep 2013 10:59:43 +0200 Subject: [adegenet-forum] Question about genetic structure in admixed populations In-Reply-To: <52274BC7020000CB0000539A@snggwia.senckenberg.de> References: <52274BC7020000CB0000539A@snggwia.senckenberg.de> Message-ID: Dear Jutta, cluster analysis can be tricky when the samples analysed are distributed along a gradient and if there is no clear-cut subdivision, this can lead to contradictory results (have a look at this paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf). You may want to consider using TESS or BAPS with the admixture model option. These two software allow including the geographic coordinates as a prior information and the admixture model is a way to model spatial gradients. If you tested the IBD with a Mantel test, just be careful that a significant mantel test is not directly due to IBD, geo to gen correlation can be significant for different spatial/migratory schemes. I think your DAPC is ok, a part from the fact that there is no need to use the find.clusters with the number of PCs indicated by the optim.a.score. This procedure is used to optimize the discriminant space among clusters in the DAPC. To assign individuals to clusters you can simply retrieve all the variance (even though in your case is almost the same given that you have 98%). Only thing, I would try with max number of clusters around 20, more than your sampling locations. You can also give sPCA a try. Hope this helps Ciao Valeria On 4 September 2013 15:03, Jutta Geismar wrote: > Dear Mr Jombart and DAPC users,****** > > ** ** > > I used DAPC to analyze genetic structure in a small region with 20 > microsatellite markers. I analyzed 330 individuals (14 sampling sites) and > found little genetic differences (FST, D Jost), but a significant isolation > by distance pattern. A cluster analysis in STRUCTURE resulted in four > clusters (STRUCTURE Harvester) but all individuals had more or less equal > posterior probability in all of the four inferred clusters. Therefore I > assume a panmictic population structure. Since STRUCTURE is known for some > problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC > resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in > both cases these were randomly distributed among all individuals without a > geographic context. Only 94 individuals were not assigned to one cluster > with more than 90% and therefore would be counted as ?admixed? (example in > DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to > each other, but I don?t know how a panmictic population would look like in > DAPC. Distances between sites are small and it is very likely that gene > flow occurs among my sampling points, which might cause problems in genetic > cluster analyses. I don?t know if I made any mistake in my thinking, that?s > why I want to explain my procedure briefly:**** > > 1. I used dapc and chose 1/3 of the sample size as PC (as > suggested) and counted DAs in the plot (100% of the variability was > included, 110 PC, 13 DA)**** > > 2. To reduce variability I used optim.a.score (smart FALSE). The > best a-score was around 0.2 (PC 61)**** > > 3. After that I wanted to estimate the number of clusters by > find.clusters and used the a-score as number of PCs and repeated the dapc > (conserved variance was still 98%, 61 PCs, 2 DA) **** > > I chose k in the BIC values after which the decrease was less compared to > the previous, but not the lowest k.**** > > If I have some mistakes in my procedure I would appreciate some advice. > But also if the procedure is okay I cannot explain the contrariness of > these two analyses. **** > > Thanks a lot in advance for some help.**** > > Jutta Geismar **** > > PhD student > > Germany**** > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mirainoshojo at gmail.com Sun Sep 8 20:44:07 2013 From: mirainoshojo at gmail.com (Valeria Montano) Date: Sun, 8 Sep 2013 20:44:07 +0200 Subject: [adegenet-forum] Question about genetic structure in admixed populations In-Reply-To: <522C9934020000CB000053D5@snggwia.senckenberg.de> References: <52274BC7020000CB0000539A@snggwia.senckenberg.de> <522C9934020000CB000053D5@snggwia.senckenberg.de> Message-ID: Hi Jutta! well, ehm...sooo, you already know about the limitations of Structure and, in general, bayesian approaches to cluster analysis. For what matters, I can give you my opinion/suggestion in brief: 1) Structure and DAPC can give different results in several cases, depending on the evolutionary processes ongoing among specific inds/pops. I wish they always agreed - that would make our lives happier. In general, relying on a method rather than another is a decision that can be made based on the knowledge of the models assumed in different approaches and their limitations, and certainly the feeling you have about your case study given all the results you already got. Personally, I never take a best k out of the find.clusters unless the BIC shows a very clear cut-off (i.e. the curve nicely rising up after a certain K), but this is really a personal standard. 2) My understanding of the distribution of continuous populations (as this is seems to be the case of your data) is that there is actually no best clustering one can do. When the spatial distribution of the allele frequencies is organized in gradients or clines, the clusters are not the best tool to use to describe the data. That is why a method such as BAPS is useful. GENELAND is cool too, but there is no explicit modelling of gradients, plus the integration of the spatial info has never been totally clear to me. I find BAPS and TESS more straightforward. In this sense, they are good approaches to optimize a number of "clusters" although what you find out cannot be really called clusters (in the structure or dapc meaning). It took me a while to learn how to manage the sense of panic/disorientation provoked by the absence of best clustering in some genetic datasets, but afterwards I even developed a preference for gradients, although I admit clusters are very useful. Hope this is somehow useful Best wishes Valeria On 8 September 2013 15:35, Jutta Geismar wrote: > Dear Valeria,****** > > **** > > thank you very much for your quick answer. I?m aware of the problems > STUCTURE has to analyze genetic data of continuous populations (see also > http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2008.01606.x/pdf). > That is one reason I don?t want to use STUCTURE as the only cluster > analysis. I haven?t attempted to use BAPS yet, but I gave GENELAND a > trial to include spatial information. Besides testing for IBD with a Mantel > test, I also modified the geographic distances by resistance values etc. I > inferred from a SDM. A spatial autocorrelations didn?t show a clear pattern > of spatial relation (also in different distance classes). A PCA > indicates a big cloud around the center point. Each of the first two axes > explained about 19 % of the variance.**** > > Thanks to assure the correctness of my DAPC script. I set the maximum > number of clusters to 50 to exclude a missing of structural shifts.**** > > Nonetheless, I cannot explain the contrary results of structure indicating > a panmictic population (4 parallel stripes) and DAPC assigning most > individuals to one specific cluster. **** > > Thanks again for your comments. I will have a look at BAPS.**** > > Best wishes, **** > > Jutta**** > >>> Valeria Montano 9/5/2013 10:59 >>> > Dear Jutta, > > cluster analysis can be tricky when the samples analysed are distributed > along a gradient and if there is no clear-cut subdivision, this can lead to > contradictory results (have a look at this paper > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf). > You may want to consider using TESS or BAPS with the admixture model > option. These two software allow including the geographic coordinates as a > prior information and the admixture model is a way to model spatial > gradients. If you tested the IBD with a Mantel test, just be careful that a > significant mantel test is not directly due to IBD, geo to gen correlation > can be significant for different spatial/migratory schemes. I think your > DAPC is ok, a part from the fact that there is no need to use the > find.clusters with the number of PCs indicated by the optim.a.score. This > procedure is used to optimize the discriminant space among clusters in the > DAPC. To assign individuals to clusters you can simply retrieve all the > variance (even though in your case is almost the same given that you have > 98%). Only thing, I would try with max number of clusters around 20, more > than your sampling locations. You can also give sPCA a try. > > Hope this helps > > Ciao > > Valeria > > > On 4 September 2013 15:03, Jutta Geismar wrote: > >> Dear Mr Jombart and DAPC users,****** >> >> **** >> >> I used DAPC to analyze genetic structure in a small region with 20 >> microsatellite markers. I analyzed 330 individuals (14 sampling sites) and >> found little genetic differences (FST, D Jost), but a significant isolation >> by distance pattern. A cluster analysis in STRUCTURE resulted in four >> clusters (STRUCTURE Harvester) but all individuals had more or less equal >> posterior probability in all of the four inferred clusters. Therefore I >> assume a panmictic population structure. Since STRUCTURE is known for some >> problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC >> resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in >> both cases these were randomly distributed among all individuals without a >> geographic context. Only 94 individuals were not assigned to one cluster >> with more than 90% and therefore would be counted as ?admixed? (example in >> DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to >> each other, but I don?t know how a panmictic population would look like in >> DAPC. Distances between sites are small and it is very likely that gene >> flow occurs among my sampling points, which might cause problems in genetic >> cluster analyses. I don?t know if I made any mistake in my thinking, that?s >> why I want to explain my procedure briefly:**** >> >> 1. I used dapc and chose 1/3 of the sample size as PC (as suggested) and >> counted DAs in the plot (100% of the variability was included, 110 PC, 13 >> DA)**** >> >> 2. To reduce variability I used optim.a.score (smart FALSE). The best >> a-score was around 0.2 (PC 61)**** >> >> 3. After that I wanted to estimate the number of clusters by >> find.clusters and used the a-score as number of PCs and repeated the dapc >> (conserved variance was still 98%, 61 PCs, 2 DA) **** >> >> I chose k in the BIC values after which the decrease was less compared to >> the previous, but not the lowest k.**** >> >> If I have some mistakes in my procedure I would appreciate some advice. >> But also if the procedure is okay I cannot explain the contrariness of >> these two analyses. **** >> >> Thanks a lot in advance for some help.**** >> >> Jutta Geismar **** >> >> PhD student >> >> Germany**** >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Frederik.VandenBroeck at bio.kuleuven.be Mon Sep 9 13:33:43 2013 From: Frederik.VandenBroeck at bio.kuleuven.be (Frederik Van den Broeck) Date: Mon, 9 Sep 2013 11:33:43 +0000 Subject: [adegenet-forum] adegenet-forum Digest, Vol 61, Issue 4 In-Reply-To: References: Message-ID: <02E355FCF1052B4B9BBB570769EDF76F10497DB2@ICTS-S-MBX13.luna.kuleuven.be> Dear Jutta, Did you already try to use individual based distance methods (which I prefer in most cases) such as the inverse proportion of shared alleles or euclidean distances? Did you try to do a PCA analysis? All this can be quickly done in adegenet and will give you major insight in the structure of your data. Another software to study genetic structure I also like a lot is SPAGeDi (http://ebe.ulb.ac.be/ebe/SPAGeDi.html). I know this doesn't answer your questions, but I merely wanted to mention some alternatives to cluster analysis that could also give you insight into population structure. Kind regards Frederik ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of adegenet-forum-request at lists.r-forge.r-project.org [adegenet-forum-request at lists.r-forge.r-project.org] Sent: Monday, September 09, 2013 12:00 PM To: adegenet-forum at lists.r-forge.r-project.org Subject: adegenet-forum Digest, Vol 61, Issue 4 Send adegenet-forum mailing list submissions to adegenet-forum at lists.r-forge.r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum or, via email, send a message with subject or body 'help' to adegenet-forum-request at lists.r-forge.r-project.org You can reach the person managing the list at adegenet-forum-owner at lists.r-forge.r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of adegenet-forum digest..." Today's Topics: 1. Re: Question about genetic structure in admixed populations (Valeria Montano) ---------------------------------------------------------------------- Message: 1 Date: Sun, 8 Sep 2013 20:44:07 +0200 From: Valeria Montano To: Jutta Geismar Cc: "adegenet-forum at lists.r-forge.r-project.org" Subject: Re: [adegenet-forum] Question about genetic structure in admixed populations Message-ID: Content-Type: text/plain; charset="windows-1252" Hi Jutta! well, ehm...sooo, you already know about the limitations of Structure and, in general, bayesian approaches to cluster analysis. For what matters, I can give you my opinion/suggestion in brief: 1) Structure and DAPC can give different results in several cases, depending on the evolutionary processes ongoing among specific inds/pops. I wish they always agreed - that would make our lives happier. In general, relying on a method rather than another is a decision that can be made based on the knowledge of the models assumed in different approaches and their limitations, and certainly the feeling you have about your case study given all the results you already got. Personally, I never take a best k out of the find.clusters unless the BIC shows a very clear cut-off (i.e. the curve nicely rising up after a certain K), but this is really a personal standard. 2) My understanding of the distribution of continuous populations (as this is seems to be the case of your data) is that there is actually no best clustering one can do. When the spatial distribution of the allele frequencies is organized in gradients or clines, the clusters are not the best tool to use to describe the data. That is why a method such as BAPS is useful. GENELAND is cool too, but there is no explicit modelling of gradients, plus the integration of the spatial info has never been totally clear to me. I find BAPS and TESS more straightforward. In this sense, they are good approaches to optimize a number of "clusters" although what you find out cannot be really called clusters (in the structure or dapc meaning). It took me a while to learn how to manage the sense of panic/disorientation provoked by the absence of best clustering in some genetic datasets, but afterwards I even developed a preference for gradients, although I admit clusters are very useful. Hope this is somehow useful Best wishes Valeria On 8 September 2013 15:35, Jutta Geismar wrote: > Dear Valeria,****** > > **** > > thank you very much for your quick answer. I?m aware of the problems > STUCTURE has to analyze genetic data of continuous populations (see also > http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2008.01606.x/pdf). > That is one reason I don?t want to use STUCTURE as the only cluster > analysis. I haven?t attempted to use BAPS yet, but I gave GENELAND a > trial to include spatial information. Besides testing for IBD with a Mantel > test, I also modified the geographic distances by resistance values etc. I > inferred from a SDM. A spatial autocorrelations didn?t show a clear pattern > of spatial relation (also in different distance classes). A PCA > indicates a big cloud around the center point. Each of the first two axes > explained about 19 % of the variance.**** > > Thanks to assure the correctness of my DAPC script. I set the maximum > number of clusters to 50 to exclude a missing of structural shifts.**** > > Nonetheless, I cannot explain the contrary results of structure indicating > a panmictic population (4 parallel stripes) and DAPC assigning most > individuals to one specific cluster. **** > > Thanks again for your comments. I will have a look at BAPS.**** > > Best wishes, **** > > Jutta**** > >>> Valeria Montano 9/5/2013 10:59 >>> > Dear Jutta, > > cluster analysis can be tricky when the samples analysed are distributed > along a gradient and if there is no clear-cut subdivision, this can lead to > contradictory results (have a look at this paper > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf). > You may want to consider using TESS or BAPS with the admixture model > option. These two software allow including the geographic coordinates as a > prior information and the admixture model is a way to model spatial > gradients. If you tested the IBD with a Mantel test, just be careful that a > significant mantel test is not directly due to IBD, geo to gen correlation > can be significant for different spatial/migratory schemes. I think your > DAPC is ok, a part from the fact that there is no need to use the > find.clusters with the number of PCs indicated by the optim.a.score. This > procedure is used to optimize the discriminant space among clusters in the > DAPC. To assign individuals to clusters you can simply retrieve all the > variance (even though in your case is almost the same given that you have > 98%). Only thing, I would try with max number of clusters around 20, more > than your sampling locations. You can also give sPCA a try. > > Hope this helps > > Ciao > > Valeria > > > On 4 September 2013 15:03, Jutta Geismar wrote: > >> Dear Mr Jombart and DAPC users,****** >> >> **** >> >> I used DAPC to analyze genetic structure in a small region with 20 >> microsatellite markers. I analyzed 330 individuals (14 sampling sites) and >> found little genetic differences (FST, D Jost), but a significant isolation >> by distance pattern. A cluster analysis in STRUCTURE resulted in four >> clusters (STRUCTURE Harvester) but all individuals had more or less equal >> posterior probability in all of the four inferred clusters. Therefore I >> assume a panmictic population structure. Since STRUCTURE is known for some >> problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC >> resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in >> both cases these were randomly distributed among all individuals without a >> geographic context. Only 94 individuals were not assigned to one cluster >> with more than 90% and therefore would be counted as ?admixed? (example in >> DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to >> each other, but I don?t know how a panmictic population would look like in >> DAPC. Distances between sites are small and it is very likely that gene >> flow occurs among my sampling points, which might cause problems in genetic >> cluster analyses. I don?t know if I made any mistake in my thinking, that?s >> why I want to explain my procedure briefly:**** >> >> 1. I used dapc and chose 1/3 of the sample size as PC (as suggested) and >> counted DAs in the plot (100% of the variability was included, 110 PC, 13 >> DA)**** >> >> 2. To reduce variability I used optim.a.score (smart FALSE). The best >> a-score was around 0.2 (PC 61)**** >> >> 3. After that I wanted to estimate the number of clusters by >> find.clusters and used the a-score as number of PCs and repeated the dapc >> (conserved variance was still 98%, 61 PCs, 2 DA) **** >> >> I chose k in the BIC values after which the decrease was less compared to >> the previous, but not the lowest k.**** >> >> If I have some mistakes in my procedure I would appreciate some advice. >> But also if the procedure is okay I cannot explain the contrariness of >> these two analyses. **** >> >> Thanks a lot in advance for some help.**** >> >> Jutta Geismar **** >> >> PhD student >> >> Germany**** >> >> _______________________________________________ >> adegenet-forum mailing list >> adegenet-forum at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum End of adegenet-forum Digest, Vol 61, Issue 4 ********************************************* From Jutta.Geismar at senckenberg.de Sun Sep 8 15:35:16 2013 From: Jutta.Geismar at senckenberg.de (Jutta Geismar) Date: Sun, 08 Sep 2013 15:35:16 +0200 Subject: [adegenet-forum] Antw: Re: Question about genetic structure in admixed populations In-Reply-To: References: <52274BC7020000CB0000539A@snggwia.senckenberg.de> Message-ID: <522C9934020000CB000053D5@snggwia.senckenberg.de> Dear Valeria, thank you very much for your quick answer. I?m aware of the problems STUCTURE has to analyze genetic data of continuous populations (see also http://onlinelibrary.wiley.com/doi/10.1111/j.1365-2664.2008.01606.x/pdf). That is one reason I don?t want to use STUCTURE as the only cluster analysis. I haven?t attempted to use BAPS yet, but I gave GENELAND a trial to include spatial information. Besides testing for IBD with a Mantel test, I also modified the geographic distances by resistance values etc. I inferred from a SDM. A spatial autocorrelations didn?t show a clear pattern of spatial relation (also in different distance classes). A PCA indicates a big cloud around the center point. Each of the first two axes explained about 19 % of the variance. Thanks to assure the correctness of my DAPC script. I set the maximum number of clusters to 50 to exclude a missing of structural shifts. Nonetheless, I cannot explain the contrary results of structure indicating a panmictic population (4 parallel stripes) and DAPC assigning most individuals to one specific cluster. Thanks again for your comments. I will have a look at BAPS. Best wishes, Jutta >>> Valeria Montano 9/5/2013 10:59 >>> Dear Jutta, cluster analysis can be tricky when the samples analysed are distributed along a gradient and if there is no clear-cut subdivision, this can lead to contradictory results (have a look at this paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.3029&rep=rep1&type=pdf). You may want to consider using TESS or BAPS with the admixture model option. These two software allow including the geographic coordinates as a prior information and the admixture model is a way to model spatial gradients. If you tested the IBD with a Mantel test, just be careful that a significant mantel test is not directly due to IBD, geo to gen correlation can be significant for different spatial/migratory schemes. I think your DAPC is ok, a part from the fact that there is no need to use the find.clusters with the number of PCs indicated by the optim.a.score. This procedure is used to optimize the discriminant space among clusters in the DAPC. To assign individuals to clusters you can simply retrieve all the variance (even though in your case is almost the same given that you have 98%). Only thing, I would try with max number of clusters around 20, more than your sampling locations. You can also give sPCA a try. Hope this helps Ciao Valeria On 4 September 2013 15:03, Jutta Geismar wrote: Dear Mr Jombart and DAPC users, I used DAPC to analyze genetic structure in a small region with 20 microsatellite markers. I analyzed 330 individuals (14 sampling sites) and found little genetic differences (FST, D Jost), but a significant isolation by distance pattern. A cluster analysis in STRUCTURE resulted in four clusters (STRUCTURE Harvester) but all individuals had more or less equal posterior probability in all of the four inferred clusters. Therefore I assume a panmictic population structure. Since STRUCTURE is known for some problems analyzing datasets under IBD I analyzed the data with DAPC. DAPC resulted in 3 or 4 clusters (and tested up until K=7 to be sure), but in both cases these were randomly distributed among all individuals without a geographic context. Only 94 individuals were not assigned to one cluster with more than 90% and therefore would be counted as ?admixed? (example in DAPC tutorial). For me the results of STRUCTURE and DAPC are in conflict to each other, but I don?t know how a panmictic population would look like in DAPC. Distances between sites are small and it is very likely that gene flow occurs among my sampling points, which might cause problems in genetic cluster analyses. I don?t know if I made any mistake in my thinking, that?s why I want to explain my procedure briefly: 1. I used dapc and chose 1/3 of the sample size as PC (as suggested) and counted DAs in the plot (100% of the variability was included, 110 PC, 13 DA) 2. To reduce variability I used optim.a.score (smart FALSE). The best a-score was around 0.2 (PC 61) 3. After that I wanted to estimate the number of clusters by find.clusters and used the a-score as number of PCs and repeated the dapc (conserved variance was still 98%, 61 PCs, 2 DA) I chose k in the BIC values after which the decrease was less compared to the previous, but not the lowest k. If I have some mistakes in my procedure I would appreciate some advice. But also if the procedure is okay I cannot explain the contrariness of these two analyses. Thanks a lot in advance for some help. Jutta Geismar PhD student Germany _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jutta.Geismar at senckenberg.de Wed Sep 11 12:05:58 2013 From: Jutta.Geismar at senckenberg.de (Jutta Geismar) Date: Wed, 11 Sep 2013 12:05:58 +0200 Subject: [adegenet-forum] Digest, Vol 61, Issue 4 Message-ID: <52305CA6020000CB00005435@snggwia.senckenberg.de> Dear Frederik, thank you for your comments. I did a PCA based on individual genetic distances which showed a big cloud of points. I mentioned it in my last answer. Did you mean, I should try it with genetic relatedness? What else methods of individual based distance do you think of? I worked also with SPAGeDi, but the results were more or less the same I got with a spatial autocorrelation, which was expectable. Since I recieved no clear information in these analyses, I hoped to find more explicit structure information in a cluster approach. Kind regards Jutta -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Wed Sep 11 17:58:52 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Wed, 11 Sep 2013 15:58:52 +0000 Subject: [adegenet-forum] $li in sPCA analysis In-Reply-To: References: <2CB2DA8E426F3541AB1907F98ABA6570638B5234@icexch-m1.ic.ac.uk>, <2CB2DA8E426F3541AB1907F98ABA6570638B5287@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638DF333@icexch-m1.ic.ac.uk> Hello, the values in $li have arbitrary signs. They are simply scores synthesizing the spatial structures in the data (linear combinations of variables optimizing the variance and Moran's I). Cheers Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Nathan Truelove [nathan.truelove at manchester.ac.uk] Sent: 03 September 2013 13:44 To: adegenet-forum at lists.r-forge.r-project.org Subject: Re: [adegenet-forum] $li in sPCA analysis Hi Adegenet Forum, Thanks in advance to anyone who has some advice to share with the forum on SPCA. If you're in a rush just read the parts in bold. I've been using SPCA to look at spatial genetics patterns among lobster populations. I found positive local structure with the function local.rest and no global structure using global.rtest. I've followed Thibaut's advice in his previous sPCA email to forum and used $li to interpret local structure. I selected the local eigenvalue that had the highest levels of negative spatial autocorrelation and genetic variance for interpretation using the screeplot function. The $li values from this eigenvalue were then used to create an interpolated map. My question for the forum is: What do the positive and negative $li values associated with the local eigenvalue mean? Do they correspond to levels of local (positive) and global (negative) scores at each location? Or are the $li values associated with the local eigenvalues simply a score for detecting local spatial genetic structure among sites and have nothing to do with global structure? Best Wishes, Nate On Aug 11, 2013, at 4:35 PM, Jombart, Thibaut wrote: Hello, I think you attached the wrong file. Negative values and local structure are not related. Local structure = sharp differences between neighours. These would be overlooked by the lagged vector. If the structure is clear enough, use $li. As you have many overlapping points, s.value is suboptimal. You should consider using the colorplot, or interpolated maps. See the tutorial on sPCA for some example: http://cran.r-project.org/web/packages/adegenet/vignettes/adegenet-spca.pdf Best Thibaut ________________________________________ From: dooshra at gmail.com [dooshra at gmail.com] on behalf of Hanan Sela [hans at tauex.tau.ac.il] Sent: 11 August 2013 12:19 To: Jombart, Thibaut Subject: Re: [adegenet-forum] li vs. ls in sPCA analysis Hello Thibaut, Thank you for the response. In the file I have attached I see that with the $li variable there are no negative values in the southern sites while with the $ls values there are negative values in the south. It seems that I see more local spatial structure with $ls than with $li . When I tested the data with local test I got significant results. Which variable is better to present in a paper. Thank you Hanan Mr. Hanan Sela Ph.D. Curator of the Lieberman Cereal Germplasm Bank The Institute for Cereal Crops Improvement Tel-Aviv University P.O. Box 39040 Tel Aviv 69978 Israel hans at tauex.tau.ac.il Phone: 972-3-6405773 Cell: 972-50-5727458 , local U.S 17203600603 Fax: 972-3-6407857 On Sun, Aug 11, 2013 at 12:37 PM, Jombart, Thibaut > wrote: Hello, the lagged vector is the spatially weighted average of the original vector. That is, the value of the score at a given location is the weighted average of the neighbouring values. This basically smooths the patterns so that they can be detected / visualized more easily. Cheers Thibaut. -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Hanan Sela [hans at tauex.tau.ac.il] Sent: 11 August 2013 06:21 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] li vs. ls in sPCA analysis Hello I have plotted the first PC of sPCA analysis using s.value once with z=my.pca$li[,1] and once with z=my.pca$ls[,1]. The patterns seems to differ (see attached file). I do not understand what the lagged PC is representing. What is the meaning of "denoisified" in the practical day presentation (Google does not know). How do i interpent the difference. Please explain. Thank you Mr. Hanan Sela Ph.D. Curator of the Lieberman Cereal Germplasm Bank The Institute for Cereal Crops Improvement Tel-Aviv University P.O. Box 39040 Tel Aviv 69978 Israel hans at tauex.tau.ac.il> Phone: 972-3-6405773 Cell: 972-50-5727458 , local U.S 17203600603 Fax: 972-3-6407857 On Thu, Aug 1, 2013 at 7:15 PM, >> wrote: Send adegenet-forum mailing list submissions to adegenet-forum at lists.r-forge.r-project.org> To subscribe or unsubscribe via the World Wide Web, visit https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum or, via email, send a message with subject or body 'help' to adegenet-forum-request at lists.r-forge.r-project.org> You can reach the person managing the list at adegenet-forum-owner at lists.r-forge.r-project.org> When replying, please edit your Subject line so it is more specific than "Re: Contents of adegenet-forum digest..." Today's Topics: 1. Fwd: Question about pre-processing of SNP data for machine learning (Daniel Murrell) 2. Re: Fwd: Question about pre-processing of SNP data for machine learning (Jombart, Thibaut) 3. Re: Fwd: Question about pre-processing of SNP data for machine learning (Daniel Murrell) ---------------------------------------------------------------------- Message: 1 Date: Thu, 1 Aug 2013> 15:26:00 +0100 From: Daniel Murrell >> To: adegenet-forum at lists.r-forge.r-project.org> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Message-ID: >> Content-Type: text/plain; charset="windows-1252" Hi All This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object). So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly? Thank you Daniel ---------- Forwarded message ---------- From: Jombart, Thibaut >> Date: Fri, Jul 19, 2013> at 4:27 PM Subject: RE: Question about pre-processing of SNP data for machine learning To: Daniel Murrell >> Dear Daniel, yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on: http://adegenet.r-forge.r-project.org/ Don't hesitate to use the adegenet-forum for further questions (see contacts on the website). Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk> http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: dsmurrell at gmail.com> [dsmurrell at gmail.com>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>] Sent: 19 July 2013 16:23 To: Jombart, Thibaut Subject: Question about pre-processing of SNP data for machine learning Dear Thibaut I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help? I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality. Thank you Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 1 Aug 2013 15:22:27 +0000 From: "Jombart, Thibaut" >> To: Daniel Murrell >>, "adegenet-forum at lists.r-forge.r-project.org>" >> Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638ABF4F at icexch-m1.ic.ac.uk>> Content-Type: text/plain; charset="Windows-1252" Dear Daniel, the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it? https://sourceforge.net/p/adegenet/tickets/ You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable. All the best Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>] Sent: 01 August 2013 15:26 To: adegenet-forum at lists.r-forge.r-project.org> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Hi All This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object). So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly? Thank you Daniel ---------- Forwarded message ---------- From: Jombart, Thibaut >>>> Date: Fri, Jul 19, 2013 at 4:27 PM Subject: RE: Question about pre-processing of SNP data for machine learning To: Daniel Murrell >>>> Dear Daniel, yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on: http://adegenet.r-forge.r-project.org/ Don't hesitate to use the adegenet-forum for further questions (see contacts on the website). Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk>>> http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: dsmurrell at gmail.com>>> [dsmurrell at gmail.com>>>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>>>] Sent: 19 July 2013 16:23 To: Jombart, Thibaut Subject: Question about pre-processing of SNP data for machine learning Dear Thibaut I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help? I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality. Thank you Daniel ------------------------------ Message: 3 Date: Thu, 1 Aug 2013 17:14:37 +0100 From: Daniel Murrell >> To: "Jombart, Thibaut" >> Cc: "adegenet-forum at lists.r-forge.r-project.org>" >> Subject: Re: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Message-ID: >> Content-Type: text/plain; charset="windows-1252" Dear Thibaut Ok, I could try that. I could also try and use the genlight object in a transposed manner just for the purposes of holding the data so that I can access individual SNPs easily. I mean nothing else would work expect the containment. Thanks for the help Regards Daniel On Thu, Aug 1, 2013 at 4:22 PM, Jombart, Thibaut >>wrote: Dear Daniel, the loss of attributes after cbind indeed is a glitch. Would you mind creating a ticket about it? https://sourceforge.net/p/adegenet/tickets/ You're right about the issue. The encoding is indeed done row-wise so the conversion is done many times over. There's no option for transposing the data, but one solution would be converting your data to integers by blocks so that conversion takes place less often, while still keep RAM requirements reasonable. All the best Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org> [ adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk>] Sent: 01 August 2013 15:26 To: adegenet-forum at lists.r-forge.r-project.org> Subject: [adegenet-forum] Fwd: Question about pre-processing of SNP data for machine learning Hi All This is my first time using adegenet. I'm trying to perform some pre-processing on 1.3M SNPs (~800 individuals) so that I can use them for a machine learning task. My data was stored in a format which had to be converted to a genlight object. The data was split so that the information for the SNPs in each chromosome was in a separate file. I've read each file in, converted that to a genlight object and then concatenated the genlight objects using cbind. All of that seems to work ok (except the position and chromosome data went back to NULL during the concatenation and I had to reset it on the combined genlight object). So, now I want to do my own processing on each SNP and when I try to access the information for this SNP over the 800 individuals, it takes ages to extract. Is this because the encoding is done row wise, and so the whole object needs to be decoded for me to get out the information I require? Is there a way to transpose this genlight object so that I can access the data for a single SNP across all individual quickly? Thank you Daniel ---------- Forwarded message ---------- From: Jombart, Thibaut >>>> Date: Fri, Jul 19, 2013 at 4:27 PM Subject: RE: Question about pre-processing of SNP data for machine learning To: Daniel Murrell >>>> Dear Daniel, yes, adegenet is designed for that kind of task. Please look at the tutorial on adegenet-basics where you'll find examples of dimension reduction for SNP data, to be found on: http://adegenet.r-forge.r-project.org/ Don't hesitate to use the adegenet-forum for further questions (see contacts on the website). Best Thibaut -- ###################################### Dr Thibaut JOMBART MRC Centre for Outbreak Analysis and Modelling Department of Infectious Disease Epidemiology Imperial College - School of Public Health St Mary?s Campus Norfolk Place London W2 1PG United Kingdom Tel. : 0044 (0)20 7594 3658 t.jombart at imperial.ac.uk>>> http://sites.google.com/site/thibautjombart/ http://adegenet.r-forge.r-project.org/ ________________________________________ From: dsmurrell at gmail.com>>> [dsmurrell at gmail.com> >>] on behalf of Daniel Murrell [dsm38 at cam.ac.uk> >>] Sent: 19 July 2013 16:23 To: Jombart, Thibaut Subject: Question about pre-processing of SNP data for machine learning Dear Thibaut I'm trying to build a model that uses SNP data as input. The problem I have is that there is too much of it and I need a way to reduce the number or the dimensionality of the data points so that I can use them as input to machine learning algorithms (genome wide, 1.3 million SNPs, 800 individuals). I've done some searching and found this paper: http://www.ncbi.nlm.nih.gov/pubmed/18076475 (pdf attached). I also found your adegenet package and wondered if it's designed for doing something like this? I'm not from this field and I'm having some trouble working this out. Can you point me to anything that might help? I'm not sure whether I should be keeping a subset of SNPs and how to find that subset from the 1.3 million, or whether I should be reducing the dimensionality. Thank you Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum End of adegenet-forum Digest, Vol 60, Issue 2 ********************************************* _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum From danica_714 at hotmail.com Wed Sep 18 12:03:30 2013 From: danica_714 at hotmail.com (Danica Fabrigar) Date: Wed, 18 Sep 2013 11:03:30 +0100 Subject: [adegenet-forum] help with scaleGEN Message-ID: Hi adegenet users, I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA. I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object): A) obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean") pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) B) obj 2<- scaleGen(mosquitoind, missing="mean") pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency? Thanks in advance, Danica -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Wed Sep 18 16:53:53 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Wed, 18 Sep 2013 14:53:53 +0000 Subject: [adegenet-forum] help with scaleGEN In-Reply-To: References: Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk> Hello, I think some clarification should help here. "scaling" means transforming a variable to that its variance is 1. It is usually used to remove the effects of variances inherently different across a bunch of variables (typically because of different units). In genetics, most of the time, I think scaling is a bad idea: all variable have the same unit, and differences in variances are probably meaningful. missing="mean" refers to the procedure for replacing missing data. They are set to the origin, which is the mean of the corresponding allele frequencies (typically the 'non-informative' point in PCA). Best Thibaut ________________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Danica Fabrigar [danica_714 at hotmail.com] Sent: 18 September 2013 11:03 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] help with scaleGEN Hi adegenet users, I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA. I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object): A) obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean") pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) B) obj 2<- scaleGen(mosquitoind, missing="mean") pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency? Thanks in advance, Danica From danica_714 at hotmail.com Thu Sep 19 10:57:46 2013 From: danica_714 at hotmail.com (Danica Fabrigar) Date: Thu, 19 Sep 2013 09:57:46 +0100 Subject: [adegenet-forum] help with scaleGEN In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk> References: , <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk> Message-ID: Hi Thibaut, Thank you for the clarification. I got confused myself there. What you've said made a lot of sense, are there cases in genetics in which scaling would be a good idea? Regards,Danica ________________________________________ > From: t.jombart at imperial.ac.uk > To: danica_714 at hotmail.com; adegenet-forum at lists.r-forge.r-project.org > Subject: RE: [adegenet-forum] help with scaleGEN > Date: Wed, 18 Sep 2013 14:53:53 +0000 > > Hello, > > I think some clarification should help here. > > "scaling" means transforming a variable to that its variance is 1. It is usually used to remove the effects of variances inherently different across a bunch of variables (typically because of different units). In genetics, most of the time, I think scaling is a bad idea: all variable have the same unit, and differences in variances are probably meaningful. > > missing="mean" refers to the procedure for replacing missing data. They are set to the origin, which is the mean of the corresponding allele frequencies (typically the 'non-informative' point in PCA). > > Best > Thibaut > > > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Danica Fabrigar [danica_714 at hotmail.com] > Sent: 18 September 2013 11:03 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] help with scaleGEN > > Hi adegenet users, > > I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA. > > I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object): > > A) > obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean") > pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) > > B) > obj 2<- scaleGen(mosquitoind, missing="mean") > pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) > > > I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency? > > > Thanks in advance, > Danica -------------- next part -------------- An HTML attachment was scrubbed... URL: From t.jombart at imperial.ac.uk Thu Sep 19 13:41:06 2013 From: t.jombart at imperial.ac.uk (Jombart, Thibaut) Date: Thu, 19 Sep 2013 11:41:06 +0000 Subject: [adegenet-forum] help with scaleGEN In-Reply-To: References: , <2CB2DA8E426F3541AB1907F98ABA6570638EF17F@icexch-m1.ic.ac.uk>, Message-ID: <2CB2DA8E426F3541AB1907F98ABA6570638EF5A0@icexch-m1.ic.ac.uk> I haven't seen many, but one can think of a few cases, yes. In multialllelic markers such as microsatellites, one may want to give the same 'weight' to each marker, and thus use a scaling so that the total variance (ie summed over alleles) would be the same for all markers. But this is already a bit different from standardizing alleles, at least in practice (on a theoretical level, the procedure is nearly identical, we divide vectors/matrices by their norm). Same idea could apply to SNPs of different genes. Cheers Thibaut ________________________________________ From: Danica Fabrigar [danica_714 at hotmail.com] Sent: 19 September 2013 09:57 To: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org Subject: RE: [adegenet-forum] help with scaleGEN Hi Thibaut, Thank you for the clarification. I got confused myself there. What you've said made a lot of sense, are there cases in genetics in which scaling would be a good idea? Regards, Danica ________________________________________ > From: t.jombart at imperial.ac.uk > To: danica_714 at hotmail.com; adegenet-forum at lists.r-forge.r-project.org > Subject: RE: [adegenet-forum] help with scaleGEN > Date: Wed, 18 Sep 2013 14:53:53 +0000 > > Hello, > > I think some clarification should help here. > > "scaling" means transforming a variable to that its variance is 1. It is usually used to remove the effects of variances inherently different across a bunch of variables (typically because of different units). In genetics, most of the time, I think scaling is a bad idea: all variable have the same unit, and differences in variances are probably meaningful. > > missing="mean" refers to the procedure for replacing missing data. They are set to the origin, which is the mean of the corresponding allele frequencies (typically the 'non-informative' point in PCA). > > Best > Thibaut > > > ________________________________________ > From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Danica Fabrigar [danica_714 at hotmail.com] > Sent: 18 September 2013 11:03 > To: adegenet-forum at lists.r-forge.r-project.org > Subject: [adegenet-forum] help with scaleGEN > > Hi adegenet users, > > I am having some trouble interpreting how scaleGEN is supposed to be used when plotting a PCA. > > I get very different results when running the following two commands (note: "scale=FALSE" is omitted in the second object): > > A) > obj <- scaleGen(mosquitoind, scale=FALSE, missing="mean") > pca.obj <- dudi.pca(obj,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) > > B) > obj 2<- scaleGen(mosquitoind, missing="mean") > pca.obj2 <- dudi.pca(obj2,cent=FALSE,scale=FALSE,scannf=FALSE,nf=3) > > > I guess my question is, what is the appropriate way of using scaleGEN if I want to scale my missing data to the mean allele frequency? > > > Thanks in advance, > Danica