From alangarcia87 at hotmail.com Sat Oct 1 23:30:50 2016 From: alangarcia87 at hotmail.com (Alan Garcia-Elfring) Date: Sat, 1 Oct 2016 21:30:50 +0000 Subject: [adegenet-forum] find.clusters() freezes on DAPC Message-ID: Hi everyone, I'm wondering if anyone has gotten stuck on the find.clusters function? I did a DAPC on this exact dataset in some months back and now I want that I redo it to check a different K value and change the colours, but it keeps getting stuck on find.clusters. Any idea what may be causing this? If I remember correctly, this step doesn't take long, definitely not more than a day. I've tried on a new mac and also on a PC using the parallel = FALSE argument. Any help is appreciated. >pldata = read.PLINK("batch_1_recode.raw") >grp = find.clusters(pldata, max.n.clust = 15) ##GETS STUCK ON THIS /// GENLIGHT OBJECT ///////// // 229 genotypes, 62,236 binary SNPs, size: 15.3 Mb // Basic content @gen: list of 229 SNPbin @ploidy: ploidy of each individual (range: 2-2) // Optional content @ind.names: 229 individual labels @loc.names: 62236 locus labels @pop: population of each individual (group size range: 8-20) @other: a list containing: sex phenotype pat mat -------------- next part -------------- An HTML attachment was scrubbed... URL: From alangarcia87 at hotmail.com Sun Oct 2 00:02:11 2016 From: alangarcia87 at hotmail.com (Alan Garcia-Elfring) Date: Sat, 1 Oct 2016 22:02:11 +0000 Subject: [adegenet-forum] find.clusters() freezes on DAPC In-Reply-To: References: Message-ID: Nevermind! Finally got it to work on the mac, although the optimal K changed when assessing different ranges of K. (Hopefully when I redo it with the original range of K I will get the same optimal K as I'm getting ready to publish!) Cheers Alan ________________________________ From: adegenet-forum-bounces at lists.r-forge.r-project.org on behalf of Alan Garcia-Elfring Sent: 01 October 2016 17:30 To: adegenet-forum at lists.r-forge.r-project.org Subject: [adegenet-forum] find.clusters() freezes on DAPC Hi everyone, I'm wondering if anyone has gotten stuck on the find.clusters function? I did a DAPC on this exact dataset in some months back and now I want that I redo it to check a different K value and change the colours, but it keeps getting stuck on find.clusters. Any idea what may be causing this? If I remember correctly, this step doesn't take long, definitely not more than a day. I've tried on a new mac and also on a PC using the parallel = FALSE argument. Any help is appreciated. >pldata = read.PLINK("batch_1_recode.raw") >grp = find.clusters(pldata, max.n.clust = 15) ##GETS STUCK ON THIS /// GENLIGHT OBJECT ///////// // 229 genotypes, 62,236 binary SNPs, size: 15.3 Mb // Basic content @gen: list of 229 SNPbin @ploidy: ploidy of each individual (range: 2-2) // Optional content @ind.names: 229 individual labels @loc.names: 62236 locus labels @pop: population of each individual (group size range: 8-20) @other: a list containing: sex phenotype pat mat -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Mon Oct 3 12:59:33 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 3 Oct 2016 11:59:33 +0100 Subject: [adegenet-forum] xcal/optim.a.score consistency In-Reply-To: References: Message-ID: Hi Alexandre, I would not trust the automatic selection of the optimal space dimension unless you are looking at simulated data and you need to run the analysis 100s of times. There are 2 questions here: # stability of xvalDapc output As this is a stochastic process, changing results are to be expected. It may be the case that you need to increase the number of replicates for results to stabilise a bit. If you haven't yet, check the tutorials for some guidelines on this, but basically you want to select the smallest number of dimensions that gives the best classification outcome (i.e. the 'elbow' in the curve). If there is no elbow, there may be no structure in the data - check that the % successful re-assignment is better than expected at random. If the % successful re-assignment plateaus, various numbers of PCs might lead to equivalent solutions, but at the very least the structures should remain stable. # cross validation vs optim.a.score Simple: go with cross validation. The 'a-score' was meant as a crude measure of goodness of fit of DAPC results, but cross-validation makes more sense. Hope this helps Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: https://repidemicsconsortium.org https://sites.google.com/site/thibautjombart/ https://github.com/thibautjombart Twitter: @TeebzR On 29 September 2016 at 10:02, Alexandre Lemo < alexandros.lemopoulos at gmail.com> wrote: > Dear Dr. Jombart and *adegenet* users, > > I am trying to run a DPCA on a dataset of 3975 SNPS obtained through RAD > sequencing. Tere are 11 populations and 306 individuals examined here > (minmum 16 ind /pop). Note that I am not using the find.cluster function. > > My problem is that I can't get any consistency in the number of PC that I > should use for the DPCA. Actually, everytime I run *optim.a.score* or > *xval*, I get different results. I tried changing the training set (tried > 0.7, 0.8 and 0.9) but still the optimal PC retained change in each run. > > > Here is an example of my script: > > #str is a genind object > > > > *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9), > pop(str), n.pca = 5:100, n.rep = 1000, > parallel = "snow", ncpus = 4L* > > > > > > > *optim_PC_2<- xvalDapc(tab(str, NA.method = "mean", training.set =0.9), > pop(str), n.pca = 5:100, n.rep = 1000, > parallel = "snow", ncpus = 4L*What happens > here is that optim_PC will give me an optimal PC of (e.g) 76 while > optim_PC_2 will give me 16. I tried running this several times and > everytime results are different. > > > I also tried using optim.a.score() : > > > > *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca = 100,n.da > = NULL)* > *optim.a.score (dapc.str)* > > Here, the number of PC will change everytime I run the function. > > > Does anyone have an idea of why this is happening or had several issues? I > am quite confused as results obviously change a lot depending on how many > PC are used... > > Thanks for your help and for this great adegenet package! > > Best, > > Alexandre > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Mon Oct 3 13:12:58 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 3 Oct 2016 12:12:58 +0100 Subject: [adegenet-forum] find.clusters() freezes on DAPC In-Reply-To: References: Message-ID: Hi, it is probably not stuck, it asks you for a number of PCs to retain. Cheers Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: https://repidemicsconsortium.org https://sites.google.com/site/thibautjombart/ https://github.com/thibautjombart Twitter: @TeebzR On 1 October 2016 at 22:30, Alan Garcia-Elfring wrote: > Hi everyone, > > > I'm wondering if anyone has gotten stuck on the find.clusters function? > > > I did a DAPC on this exact dataset in some months back and now I want that > I redo it to check a different K value and change the colours, but it keeps > getting stuck on find.clusters. > > > Any idea what may be causing this? If I remember correctly, this step > doesn't take long, definitely not more than a day. > > > I've tried on a new mac and also on a PC using the parallel = FALSE > argument. > > Any help is appreciated. > > >pldata = read.PLINK("batch_1_recode.raw") > > >grp = find.clusters(pldata, max.n.clust = 15) ##GETS STUCK ON > THIS > > > > /// GENLIGHT OBJECT ///////// > > // 229 genotypes, 62,236 binary SNPs, size: 15.3 Mb > > // Basic content > @gen: list of 229 SNPbin > @ploidy: ploidy of each individual (range: 2-2) > > // Optional content > @ind.names: 229 individual labels > @loc.names: 62236 locus labels > @pop: population of each individual (group size range: 8-20) > @other: a list containing: sex phenotype pat mat > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Mon Oct 3 13:20:20 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 3 Oct 2016 12:20:20 +0100 Subject: [adegenet-forum] xcal/optim.a.score consistency In-Reply-To: References: Message-ID: Hi Alexandre, thanks for the figure, it is very useful. Yes, around 15. To fine-tune it, I would run the analysis for all numbers of PCA axes between 1 and 20, and increase the number of replicates (30 or more if you can, maybe up to 50). Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: https://repidemicsconsortium.org https://sites.google.com/site/thibautjombart/ https://github.com/thibautjombart Twitter: @TeebzR On 3 October 2016 at 12:14, Alexandre Lemo wrote: > Dear Dr Jombart, > > Thanks a lot for your answer! I think it does help indeed as I understand > better why there is fluctuation in the xval's results. > > When I run xval, my curve will often looke like that, with Mean Successful > Assignment varying between 0.81 and 0.82 when reaching the plateau. > [image: Images int?gr?es 1] > If I understand correctly, all solutions in the plateau are more or less > equivalent . Structure of the DAPC should not change drastically if I > select 79 or 20 PC in that case. However, from what I understand it is > still better to select the fewest PC possible. That would mean that the > optimal PCs to select would be at the beggining of the plateau (in this > exemple it would grossely be around 15). Is this correct? > > Thanks a lot again, > > Best, > > Alexandre > > 2016-10-03 13:59 GMT+03:00 Thibaut Jombart : > >> Hi Alexandre, >> >> I would not trust the automatic selection of the optimal space dimension >> unless you are looking at simulated data and you need to run the analysis >> 100s of times. There are 2 questions here: >> >> # stability of xvalDapc output >> As this is a stochastic process, changing results are to be expected. It >> may be the case that you need to increase the number of replicates for >> results to stabilise a bit. >> >> If you haven't yet, check the tutorials for some guidelines on this, but >> basically you want to select the smallest number of dimensions that gives >> the best classification outcome (i.e. the 'elbow' in the curve). If there >> is no elbow, there may be no structure in the data - check that the % >> successful re-assignment is better than expected at random. If the % >> successful re-assignment plateaus, various numbers of PCs might lead to >> equivalent solutions, but at the very least the structures should remain >> stable. >> >> # cross validation vs optim.a.score >> Simple: go with cross validation. The 'a-score' was meant as a crude >> measure of goodness of fit of DAPC results, but cross-validation makes more >> sense. >> >> Hope this helps >> >> Thibaut >> >> >> -- >> Dr Thibaut Jombart >> Lecturer, Department of Infectious Disease Epidemiology, Imperial >> College London >> Head of RECON: https://repidemicsconsortium.org >> https://sites.google.com/site/thibautjombart/ >> https://github.com/thibautjombart >> Twitter: @TeebzR >> >> On 29 September 2016 at 10:02, Alexandre Lemo < >> alexandros.lemopoulos at gmail.com> wrote: >> >>> Dear Dr. Jombart and *adegenet* users, >>> >>> I am trying to run a DPCA on a dataset of 3975 SNPS obtained through RAD >>> sequencing. Tere are 11 populations and 306 individuals examined here >>> (minmum 16 ind /pop). Note that I am not using the find.cluster function. >>> >>> My problem is that I can't get any consistency in the number of PC that >>> I should use for the DPCA. Actually, everytime I run *optim.a.score* or >>> *xval*, I get different results. I tried changing the training set >>> (tried 0.7, 0.8 and 0.9) but still the optimal PC retained change in each >>> run. >>> >>> >>> Here is an example of my script: >>> >>> #str is a genind object >>> >>> >>> >>> *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9), >>> pop(str), n.pca = 5:100, n.rep = 1000, >>> parallel = "snow", ncpus = 4L* >>> >>> >>> >>> >>> >>> >>> *optim_PC_2<- xvalDapc(tab(str, NA.method = "mean", training.set =0.9), >>> pop(str), n.pca = 5:100, n.rep = 1000, >>> parallel = "snow", ncpus = 4L*What >>> happens here is that optim_PC will give me an optimal PC of (e.g) 76 while >>> optim_PC_2 will give me 16. I tried running this several times and >>> everytime results are different. >>> >>> >>> I also tried using optim.a.score() : >>> >>> >>> >>> *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca = >>> 100,n.da = NULL)* >>> *optim.a.score (dapc.str)* >>> >>> Here, the number of PC will change everytime I run the function. >>> >>> >>> Does anyone have an idea of why this is happening or had several issues? >>> I am quite confused as results obviously change a lot depending on how many >>> PC are used... >>> >>> Thanks for your help and for this great adegenet package! >>> >>> Best, >>> >>> Alexandre >>> >>> >>> _______________________________________________ >>> adegenet-forum mailing list >>> adegenet-forum at lists.r-forge.r-project.org >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo >>> /adegenet-forum >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dapcxval.png Type: image/png Size: 35757 bytes Desc: not available URL: From thibautjombart at gmail.com Mon Oct 3 14:25:34 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 3 Oct 2016 13:25:34 +0100 Subject: [adegenet-forum] xcal/optim.a.score consistency In-Reply-To: References: Message-ID: Hi Please keep the forum in when replying. Yes, I would keep all of the DA axes for the cross validation. As for the DAPC itself, keep as many axes as you want / need to look at. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: https://repidemicsconsortium.org https://sites.google.com/site/thibautjombart/ https://github.com/thibautjombart Twitter: @TeebzR On 3 October 2016 at 12:31, Alexandre Lemo wrote: > Hi, > > Thanks a lot for the help,it has been very usefull. I will try exactly > that. > > One last questions if I may ask: > > How does n.da influence xvaL? I ran several replicates and I think that > there was more stability when using a really high number of da. For > example, I felt xval was more consistent when using n.da= 100 (meaning all > of them). > > Also, if I use a number of da for xval, I should logically use the same > number for the actual dapc. Am I right on this one? > > Here is the code for more clarity > > > > > > *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9), > pop(str), n.da=100, n.pca = 5:100, n.rep = > 1000, parallel = "snow", ncpus = 4L* > # So, if 15 is the best PC obtained, then: > > > > *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca = 13,n.da > = 100)* > But could I run it *n.da = 3 or* would I need to perform xval again with > *n.da=3* and then perform the dapc again? > > I hope I am clear enough with my question. > > Thanks again, > > Best, > > Alexandre > > > > 2016-10-03 14:20 GMT+03:00 Thibaut Jombart : > >> Hi Alexandre, >> >> thanks for the figure, it is very useful. Yes, around 15. To fine-tune >> it, I would run the analysis for all numbers of PCA axes between 1 and 20, >> and increase the number of replicates (30 or more if you can, maybe up to >> 50). >> >> Best >> Thibaut >> >> >> -- >> Dr Thibaut Jombart >> Lecturer, Department of Infectious Disease Epidemiology, Imperial >> College London >> Head of RECON: https://repidemicsconsortium.org >> https://sites.google.com/site/thibautjombart/ >> https://github.com/thibautjombart >> Twitter: @TeebzR >> >> On 3 October 2016 at 12:14, Alexandre Lemo > om> wrote: >> >>> Dear Dr Jombart, >>> >>> Thanks a lot for your answer! I think it does help indeed as I >>> understand better why there is fluctuation in the xval's results. >>> >>> When I run xval, my curve will often looke like that, with Mean >>> Successful Assignment varying between 0.81 and 0.82 when reaching the >>> plateau. >>> [image: Images int?gr?es 1] >>> If I understand correctly, all solutions in the plateau are more or less >>> equivalent . Structure of the DAPC should not change drastically if I >>> select 79 or 20 PC in that case. However, from what I understand it is >>> still better to select the fewest PC possible. That would mean that the >>> optimal PCs to select would be at the beggining of the plateau (in this >>> exemple it would grossely be around 15). Is this correct? >>> >>> Thanks a lot again, >>> >>> Best, >>> >>> Alexandre >>> >>> 2016-10-03 13:59 GMT+03:00 Thibaut Jombart : >>> >>>> Hi Alexandre, >>>> >>>> I would not trust the automatic selection of the optimal space >>>> dimension unless you are looking at simulated data and you need to run the >>>> analysis 100s of times. There are 2 questions here: >>>> >>>> # stability of xvalDapc output >>>> As this is a stochastic process, changing results are to be expected. >>>> It may be the case that you need to increase the number of replicates for >>>> results to stabilise a bit. >>>> >>>> If you haven't yet, check the tutorials for some guidelines on this, >>>> but basically you want to select the smallest number of dimensions that >>>> gives the best classification outcome (i.e. the 'elbow' in the curve). If >>>> there is no elbow, there may be no structure in the data - check that the % >>>> successful re-assignment is better than expected at random. If the % >>>> successful re-assignment plateaus, various numbers of PCs might lead to >>>> equivalent solutions, but at the very least the structures should remain >>>> stable. >>>> >>>> # cross validation vs optim.a.score >>>> Simple: go with cross validation. The 'a-score' was meant as a crude >>>> measure of goodness of fit of DAPC results, but cross-validation makes more >>>> sense. >>>> >>>> Hope this helps >>>> >>>> Thibaut >>>> >>>> >>>> -- >>>> Dr Thibaut Jombart >>>> Lecturer, Department of Infectious Disease Epidemiology, Imperial >>>> College London >>>> Head of RECON: https://repidemicsconsortium.org >>>> https://sites.google.com/site/thibautjombart/ >>>> https://github.com/thibautjombart >>>> Twitter: @TeebzR >>>> >>>> On 29 September 2016 at 10:02, Alexandre Lemo < >>>> alexandros.lemopoulos at gmail.com> wrote: >>>> >>>>> Dear Dr. Jombart and *adegenet* users, >>>>> >>>>> I am trying to run a DPCA on a dataset of 3975 SNPS obtained through >>>>> RAD sequencing. Tere are 11 populations and 306 individuals examined here >>>>> (minmum 16 ind /pop). Note that I am not using the find.cluster function. >>>>> >>>>> My problem is that I can't get any consistency in the number of PC >>>>> that I should use for the DPCA. Actually, everytime I run >>>>> *optim.a.score* or *xval*, I get different results. I tried changing >>>>> the training set (tried 0.7, 0.8 and 0.9) but still the optimal PC retained >>>>> change in each run. >>>>> >>>>> >>>>> Here is an example of my script: >>>>> >>>>> #str is a genind object >>>>> >>>>> >>>>> >>>>> *optim_PC <- xvalDapc(tab(str, NA.method = "mean", training.set =0.9), >>>>> pop(str), n.pca = 5:100, n.rep = 1000, >>>>> parallel = "snow", ncpus = 4L* >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *optim_PC_2<- xvalDapc(tab(str, NA.method = "mean", training.set >>>>> =0.9), pop(str), n.pca = 5:100, n.rep = 1000, >>>>> parallel = "snow", ncpus = 4L*What >>>>> happens here is that optim_PC will give me an optimal PC of (e.g) 76 while >>>>> optim_PC_2 will give me 16. I tried running this several times and >>>>> everytime results are different. >>>>> >>>>> >>>>> I also tried using optim.a.score() : >>>>> >>>>> >>>>> >>>>> *dapc.str <- dapc(str, var.contrib = TRUE, scale = FALSE, n.pca = >>>>> 100,n.da = NULL)* >>>>> *optim.a.score (dapc.str)* >>>>> >>>>> Here, the number of PC will change everytime I run the function. >>>>> >>>>> >>>>> Does anyone have an idea of why this is happening or had several >>>>> issues? I am quite confused as results obviously change a lot depending on >>>>> how many PC are used... >>>>> >>>>> Thanks for your help and for this great adegenet package! >>>>> >>>>> Best, >>>>> >>>>> Alexandre >>>>> >>>>> >>>>> _______________________________________________ >>>>> adegenet-forum mailing list >>>>> adegenet-forum at lists.r-forge.r-project.org >>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo >>>>> /adegenet-forum >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: dapcxval.png Type: image/png Size: 35757 bytes Desc: not available URL: From mstagliamonte at ufl.edu Tue Oct 4 16:33:31 2016 From: mstagliamonte at ufl.edu (Tagliamonte,Massimiliano S) Date: Tue, 4 Oct 2016 14:33:31 +0000 Subject: [adegenet-forum] SNP data and PCA Message-ID: <1475591607954.34987@ufl.edu> Dear Adegenet users, I am trying to perform a PCA on whole genome SNP data, and the results I have so far seem to make sense. I do have a few doubts though, and I may need some help to solve them. Following variant calling, I converted my data to a dataframe, e.g: mysnps<- data.frame('ind_names'= c('s1', 's2', 's3'), 'locus1'=c('A/A', 'N', 'A/G'), 'locus2'=c('G/G', 'A/G', A/A')) Then I used df2genind on the dataframe, and performed the principal component analysis. Was that right? I am still not sure if I should have rather used a genlight object. Should I change it if I decide to do DAPC or PCoA? Thanks for your kind help, Max Massimiliano S. Tagliamonte Graduate Student University of Florida College of Veterinary Medicine Department of Infectious Diseases and Pathology -------------- next part -------------- An HTML attachment was scrubbed... URL: From fhernandeu at uc.cl Sat Oct 15 19:42:46 2016 From: fhernandeu at uc.cl (=?UTF-8?Q?Felipe_Hern=C3=A1ndez?=) Date: Sat, 15 Oct 2016 13:42:46 -0400 Subject: [adegenet-forum] Genind object inquiry Message-ID: Good afternoon, Perhaps this's a pretty basic question, but I have been struggling in order to set the populations which my individuals belong to in my microsat dataset. In other words, I don't understand how to set the slot 'pop' when I create a genind object from a csv file that contains all my genetic data. For example, should I include in my matrix a column containing the population codes which my individuals belong to (beside all my locus columns)? Or, should I just include my loci info? I realised if I include the column with the population codes in my genind object, this is also considered as a locus, and surely, this's no correct. I assume that solving this issue, I may appropriately transform my genind object to a genpop object, which is my next goal to continue with my analyses. Thank you in advance for your valuable input, thanks! Regards, Felipe -- Felipe Hern?ndez M?dico Veterinario (DVM), MSc. PhD. Candidate Interdisciplinary Ecology Program School of Natural Resources and Environment Wildlife Ecology and Conservation Department University of Florida -------------- next part -------------- An HTML attachment was scrubbed... URL: From roman.lustrik at biolitika.si Sun Oct 16 01:04:42 2016 From: roman.lustrik at biolitika.si (Roman =?utf-8?Q?Lu=C5=A1trik?=) Date: Sun, 16 Oct 2016 01:04:42 +0200 (CEST) Subject: [adegenet-forum] Genind object inquiry In-Reply-To: References: Message-ID: <383288461.288311.1476572682331.JavaMail.zimbra@biolitika.si> If I understand correctly, you can use the assign function for ?pop, e.g. data(H3N2) nInd(H3N2) # [1] 1903 pop(H3N2) # NULL # will make each individual in its own population pop(H3N2) <- 1:1903 # see pop(H3N2) Cheers, Roman ---- In god we trust, all others bring data. From: "Felipe Hern?ndez" To: adegenet-forum at lists.r-forge.r-project.org Sent: Saturday, October 15, 2016 7:42:46 PM Subject: [adegenet-forum] Genind object inquiry Good afternoon, Perhaps this's a pretty basic question, but I have been struggling in order to set the populations which my individuals belong to in my microsat dataset. In other words, I don't understand how to set the slot 'pop' when I create a genind object from a csv file that contains all my genetic data. For example, should I include in my matrix a column containing the population codes which my individuals belong to (beside all my locus columns)? Or, should I just include my loci info? I realised if I include the column with the population codes in my genind object, this is also considered as a locus, and surely, this's no correct. I assume that solving this issue, I may appropriately transform my genind object to a genpop object, which is my next goal to continue with my analyses. Thank you in advance for your valuable input, thanks! Regards, Felipe -- Felipe Hern?ndez M?dico Veterinario (DVM), MSc. PhD. Candidate Interdisciplinary Ecology Program School of Natural Resources and Environment Wildlife Ecology and Conservation Department University of Florida _______________________________________________ adegenet-forum mailing list adegenet-forum at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum -------------- next part -------------- An HTML attachment was scrubbed... URL: From mneel at umd.edu Sat Oct 22 20:29:03 2016 From: mneel at umd.edu (Maile C Neel) Date: Sat, 22 Oct 2016 14:29:03 -0400 Subject: [adegenet-forum] Dealing with "Duplicate" Locations Message-ID: I am having the same problem described in the link below in which duplicate locations are erroneously identified when I try to create a CN object using chooseCN in adegenet 2.0.1 to implement Monomonier's algorithm. http://lists.r-forge.r-project.org/pipermail/adegenet-forum/ 2011-October/000427.html I could not find how the previous thread was resolved despite extensive searching. Apologies if it was in the archives and I missed it. I call chooseCN using the following command for which my genind object holds genetic and spatial data for 374 unique genotypes at unique locations. The minimum distance between nearest neighbors is 5 m. The (admittedly non-reproducible) code I use is hudnet <- chooseCN(hudson_noreps.ind$other$latlong, ask=FALSE, type=1) Although all my locations are in fact unique, checking my data with tableXY and sunflowerplot as suggested in the old thread show me that many unique locations are being pooled due to rounding. Even with jittering with default settings (which I don't really want to do) duplicates remain. I do not want to delete the close individuals because they contribute to the local estimates of genetic diversity. The thread indicates it is possible to alter the function chooseCN to force it to pass the proper argument to tableXY. The minimum distances appear to be adjusted in lines 91 & 92 in the code for the chooseCN function at GitHub : 91 d2min <- max(apply(tempmat, 1, function(r) min(r[r>1e-12]))) 92 d2min <- d2min * 1.0001 # to avoid exact number problem Rounding to 1.0001 is collapsing samples that are 7-10 m apart given my latitude, which makes many of my samples ~5 m apart appear to be identical. Is there some way for me to modify this code in chooseCN to prevent the rounding of my spatial coordinates? I think it is not something I can modify on my own. Are there other workarounds or solutions? I understand the problem with duplicate locations, but is there a minimum distance that is acceptable for the algorithms that based on connection networks? Thanks in advance, Maile Neel Associate Professor; Director of the Norton-Brown Herbarium University of Maryland Department of Plant Science and Landscape Architecture & Department of Entomology -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Mon Oct 24 12:08:54 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Mon, 24 Oct 2016 11:08:54 +0100 Subject: [adegenet-forum] Dealing with "Duplicate" Locations In-Reply-To: References: Message-ID: Dear Maile, can you file an issue on github about this? The fix should be quick to make. https://github.com/thibautjombart/adegenet/issues Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org sites.google.com/site/thibautjombart/ github.com/thibautjombart Twitter: @TeebzR On 22 October 2016 at 19:29, Maile C Neel wrote: > I am having the same problem described in the link below in which > duplicate locations are erroneously identified when I try to create a CN > object using chooseCN in adegenet 2.0.1 to implement Monomonier's > algorithm. > > http://lists.r-forge.r-project.org/pipermail/adegenet-forum/ > 2011-October/000427.html > > I could not find how the previous thread was resolved despite extensive > searching. Apologies if it was in the archives and I missed it. > > I call chooseCN using the following command for which my genind object > holds genetic and spatial data for 374 unique genotypes at unique > locations. The minimum distance between nearest neighbors is 5 m. The > (admittedly non-reproducible) code I use is > > hudnet <- chooseCN(hudson_noreps.ind$other$latlong, ask=FALSE, type=1) > > Although all my locations are in fact unique, checking my data with > tableXY and sunflowerplot as suggested in the old thread show me that many > unique locations are being pooled due to rounding. Even with jittering > with default settings (which I don't really want to do) duplicates > remain. I do not want to delete the close individuals because they > contribute to the local estimates of genetic diversity. > > The thread indicates it is possible to alter the function chooseCN to > force it to pass the proper argument to tableXY. The minimum distances > appear to be adjusted in lines 91 & 92 in the code for the chooseCN > function at GitHub > : > > 91 d2min <- max(apply(tempmat, 1, function(r) min(r[r>1e-12]))) > 92 d2min <- d2min * 1.0001 # to avoid exact number problem > > Rounding to 1.0001 is collapsing samples that are 7-10 m apart given my > latitude, which makes many of my samples ~5 m apart appear to be > identical. Is there some way for me to modify this code in chooseCN to > prevent the rounding of my spatial coordinates? I think it is not something > I can modify on my own. Are there other workarounds or solutions? > > I understand the problem with duplicate locations, but is there a minimum > distance that is acceptable for the algorithms that based on connection > networks? > > Thanks in advance, > > Maile Neel > Associate Professor; Director of the Norton-Brown Herbarium > University of Maryland > Department of Plant Science and Landscape Architecture & > Department of Entomology > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bernadette.julier at inra.fr Tue Oct 25 15:10:33 2016 From: bernadette.julier at inra.fr (Bernadette Julier) Date: Tue, 25 Oct 2016 13:10:33 +0000 Subject: [adegenet-forum] allelic dosage for a polyploid Message-ID: <0566cfe879c44cd2a76485a45ab199a8@TLSDCPRIPEXMU03.inra.local> Hello, I have SNP data on an autotetraploid species. They have been coded as dominant markers but with the dose. It means that for each SNP in each individual, I have: 0: absence 1: presence in 1 dose 2: presence in 1 dose 3: presence in 1 dose 4: presence in 1 dose Is it correct if I write: df2genind(file[,-c(1)], ploidy=4, sep="", type="PA", pop=X$pop) Can dapc procedure be used on this dataset ? I could transform the dataset into 0 and 1 (presence in any dose) but it would be a loss of information. Thanks for any advice Bernadette INRA Lusignan, France -------------- next part -------------- An HTML attachment was scrubbed... URL: From bernadette.julier at inra.fr Tue Oct 25 15:18:19 2016 From: bernadette.julier at inra.fr (Bernadette Julier) Date: Tue, 25 Oct 2016 13:18:19 +0000 Subject: [adegenet-forum] allele frequency Message-ID: <165d494677b94761a1e517dc671dcd39@TLSDCPRIPEXMU03.inra.local> Hello, I have used a genotyping method (GBS) on pools of heterozygous individuals so I get a frequency of alleles for each population and each marker, with several replications of the populations. In addition, I am working on an autotetraploid species. I would like to use dapc procedure of adegenet package. How can I enter the data in a file with the genepop format ? Thanks Bernadette INRA Lusignan -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Tue Oct 25 18:19:05 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Tue, 25 Oct 2016 17:19:05 +0100 Subject: [adegenet-forum] allelic dosage for a polyploid In-Reply-To: <0566cfe879c44cd2a76485a45ab199a8@TLSDCPRIPEXMU03.inra.local> References: <0566cfe879c44cd2a76485a45ab199a8@TLSDCPRIPEXMU03.inra.local> Message-ID: Hi there, If I understand well, your data are binary, so you could go directly for a matrix of binary data. Let 'dat' be your matrix of codes (0-4): x <- 1 * dat>0 dapc1 <- dapc(x, ...) # make sure you pass a group here. The genind class would only add an extra complexity here. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org sites.google.com/site/thibautjombart/ github.com/thibautjombart Twitter: @TeebzR On 25 October 2016 at 14:10, Bernadette Julier wrote: > Hello, > > > > I have SNP data on an autotetraploid species. They have been coded as > dominant markers but with the dose. It means that for each SNP in each > individual, I have: > > 0: absence > > 1: presence in 1 dose > > 2: presence in 1 dose > > 3: presence in 1 dose > > 4: presence in 1 dose > > > > Is it correct if I write: > > df2genind(file[,-c(1)], ploidy=4, sep="", type="PA", pop=X$pop) > > > > Can dapc procedure be used on this dataset ? I could transform the dataset > into 0 and 1 (presence in any dose) but it would be a loss of information. > > > > Thanks for any advice > > > > Bernadette > > INRA Lusignan, France > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Tue Oct 25 18:21:33 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Tue, 25 Oct 2016 17:21:33 +0100 Subject: [adegenet-forum] allele frequency In-Reply-To: <165d494677b94761a1e517dc671dcd39@TLSDCPRIPEXMU03.inra.local> References: <165d494677b94761a1e517dc671dcd39@TLSDCPRIPEXMU03.inra.local> Message-ID: If you already have allele frequencies, you can use the DAPC directly on this data matrix. Otherwise, if you have allele counts data pooled by groups of individuals, you can use the genpop constructor to turn this matrix into a genpop object (see ?genpop). Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org sites.google.com/site/thibautjombart/ github.com/thibautjombart Twitter: @TeebzR On 25 October 2016 at 14:18, Bernadette Julier wrote: > Hello, > > > > I have used a genotyping method (GBS) on pools of heterozygous individuals > so I get a frequency of alleles for each population and each marker, with > several replications of the populations. In addition, I am working on an > autotetraploid species. > > > > I would like to use dapc procedure of adegenet package. > > How can I enter the data in a file with the genepop format ? > > > > Thanks > > Bernadette > > INRA Lusignan > > > > > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Laura.Taillebois at cdu.edu.au Fri Oct 28 04:18:24 2016 From: Laura.Taillebois at cdu.edu.au (Laura Taillebois) Date: Fri, 28 Oct 2016 02:18:24 +0000 Subject: [adegenet-forum] df2genind returns wrong number of alleles - genotype code 0(homo), 1(homo), 2(hetero) Message-ID: Hi All adegenet guru, I am having troubles getting df2genind function to find the correct number of alleles in my dataset. My data are SNPs data (2 alleles at each locus). The genotypes are encoded in one single column such as 0=reference homozygote, 1=SNP homozygote and 2=heterozygote. And I importe them as data frame from a comma separated .csv file. When I apply the function df2genind, genind <- df2genind(locus, sep=",", ncode=1, NA.char="NA", ploidy=2) the genind object returned is as follow: /// GENIND OBJECT ///////// // 1 individual; 2,078 loci; 5,752 alleles; size: 1.2 Mb // Basic content @tab: 1 x 5752 matrix of allele counts @loc.n.all: number of alleles per locus (range: 2-3) @loc.fac: locus factor for the 5752 columns of @tab @all.names: list of allele names for each locus @ploidy: ploidy of each individual (range: 2-2) @type: codom @call: .local(x = x, i = i, j = j, drop = drop) // Optional content - empty - There should be only 4,158 alleles in the object and not 5,752. Is there a problem with using this type of 0,1,2 code for the genotypes? Should my input have 2 columns for each genotype ? Thanks for your support! Cheers, Laura -------------- next part -------------- An HTML attachment was scrubbed... URL: From thibautjombart at gmail.com Fri Oct 28 12:13:06 2016 From: thibautjombart at gmail.com (Thibaut Jombart) Date: Fri, 28 Oct 2016 11:13:06 +0100 Subject: [adegenet-forum] df2genind returns wrong number of alleles - genotype code 0(homo), 1(homo), 2(hetero) In-Reply-To: References: Message-ID: Dear Laura, This coding is indeed not compatible with the expected input for df2genind. The function takes in characters coding alleles, not genotypes. Imagine a locus is heterozygote A / T, with A as ref. The input for df2genind would be "A / T" while yours is "1". In fact the coding you describe is the one used in the genlight class, which is a lot more compact. You might want to use it; for instance: > set.seed(1) > m <- matrix(sample(0:2, 30, replace=TRUE), nrow=5) > m [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 2 0 1 2 1 [2,] 1 2 0 2 0 0 [3,] 1 1 2 2 1 1 [4,] 2 1 1 1 0 2 [5,] 0 0 2 2 0 1 > x <- new("genlight", m) > x /// GENLIGHT OBJECT ///////// // 5 genotypes, 6 binary SNPs, size: 9.2 Kb 0 (0 %) missing data // Basic content @gen: list of 5 SNPbin // Optional content @other: a list containing: elements without names > plot(x) Note that of ploidy is constant across individuals, you can also do without - your data format is already compatible with most methods. Best Thibaut -- Dr Thibaut Jombart Lecturer, Department of Infectious Disease Epidemiology, Imperial College London Head of RECON: repidemicsconsortium.org sites.google.com/site/thibautjombart/ github.com/thibautjombart Twitter: @TeebzR On 28 October 2016 at 03:18, Laura Taillebois wrote: > Hi All adegenet guru, > > I am having troubles getting *df2genind* function to find the correct > number of alleles in my dataset. > > My data are SNPs data (2 alleles at each locus). The genotypes are encoded > in one single column such as 0=reference homozygote, 1=SNP homozygote and > 2=heterozygote. And I importe them as data frame from a comma separated > .csv file. > > When I apply the function df2genind, > > genind <- df2genind(locus, sep=",", ncode=1, NA.char="NA", ploidy=2) > > the genind object returned is as follow: > > /// GENIND OBJECT ///////// > > // 1 individual; 2,078 loci; 5,752 alleles; size: 1.2 Mb > > // Basic content > @tab: 1 x 5752 matrix of allele counts > @loc.n.all: number of alleles per locus (range: 2-3) > @loc.fac: locus factor for the 5752 columns of @tab > @all.names: list of allele names for each locus > @ploidy: ploidy of each individual (range: 2-2) > @type: codom > @call: .local(x = x, i = i, j = j, drop = drop) > > // Optional content > - empty - > > There should be only 4,158 alleles in the object and not 5,752. Is there a > problem with using this type of 0,1,2 code for the genotypes? Should my > input have 2 columns for each genotype ? > > Thanks for your support! > > Cheers, Laura > > _______________________________________________ > adegenet-forum mailing list > adegenet-forum at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/ > listinfo/adegenet-forum > -------------- next part -------------- An HTML attachment was scrubbed... URL: