From t.jombart at imperial.ac.uk Mon Jun 2 18:20:53 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Mon, 2 Jun 2014 16:20:53 +0000
Subject: [adegenet-forum] Monmonier algorithm and individual scores
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk>
Hi Manuela,
thanks for re-posting on the forum. In this case, it seems that locations are very aggregated - a lot of genotypes were sampled roughly at the same place. Monmonier is unlikely to do well under such circumstances. The algorithm is very sensitive to local differences, and these are unstable for this kind of spatial distribution. I would recommend other approaches. For instance, if you want to define spatial clusters, you could use a basic clustering algorithm based on the principal components of a PCA (if spatial structure is obvious) or sPCA (if not, but there is still a spatial structure). Assuming 'foo' is your analysis (PCA or sPCA), one example would be using something along the lines of:
h1 <- hclust(dist(foo$li)^2)
plot(h1)
cutree(h1)
Etc.
Check ?hclust for different clustering methods.
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Manuela [manuelacorreia2 at gmail.com]
Sent: 31 May 2014 21:46
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Monmonier algorithm and individual scores
Dear colleagues of Adegenet forum,
First of all I must congratulate Doctor Thimbault for the wonderful work he has been so far developed. And following his own suggestion I'm sharing with you a specific issue raised by the output generated by Monmonier algorithm used for boundary detection.
I have a sample made of 170 individuals, collected on 9 different places and genotyped for 19 SNPs by Realtime PCR.
Before I run this line on the R script I had to explain to you about each one of them:
mon1<- monmonier(xy ,D, gab)
xy ? spatial coordinates UTM/Km) ;
D ? pairwise allele sharing distance (?Prabclus? package);
gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation)
plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?)
>From the output produced, it can be clearly seen that there are 4 clusters of individuals having four scores (50,100,150,200). But, I can't find a way to have access to individual scores. As matter in fact, I consulted in detail all the arguments provided on Plot function but none of them seemed to me to be on the way I could extract the individuals scores (IS).
I?m wondering if you could give me a hint about it. Any help will be appreciated.
Kind regards,
Manuela (Biochemist)
From manuelacorreia2 at gmail.com Tue Jun 3 11:01:18 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Tue, 3 Jun 2014 10:01:18 +0100
Subject: [adegenet-forum] Monmonier algorithm and individual scores
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk>
References:
<2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk>
Message-ID:
Doctor Thibault and dear colleagues,
I would like to thank you for the valuable criticism you made in this
output. The idea behind the IS was, solely, to have a first draft of the
georeferenced clusters because in spatial clusters I'm well-aware that
several different genoypes at the same coordinates in species with a very
low mobility or with no mobility could be a strong indication that the
genetic variability is only due to environment while a great genetic
diversity nearby may result from a short dispersal highly spatial
correlated. To need of further confirmation by sPCA and/or clustering
techniques.
The identification of spatial clusters in PCA, particularly by sPCA is no
doubt more realiable than with Monmonier algoritm in this case. But I'd
rather try to study more deeply each one of the 3 different methods
(distance based-methods, Parsymony and maximum Likelihood) proposed in your
tutorial "Trees" just to check it in first place if they might be
appropriate to this dataset, Secondly, if they would gave different
information perhaps with higher resolution when compared to classic NJ
Tree, after validation by bootstrap. Eventually, if none is appropriate I
always be able to rely on several clustering techniques more adequate for
qualitative data, available at the "Cluster" package and to perform the
validation by "cl Valid" following several criteria.
>From a very simplistic point of view, PCA analysis (not scaled) might
provides us with information of the genetic variability whereas sPCA about
the significance of local and global structures. But, on the whole, the
information provided by these two analysis: Moran's Index , variance and
allele loadings, enable us to discriminate the loci more informative on
genetic variability but not spatially structured from those whose
variability its spatial structured. To be further confirmed through
biplots.
Another challenge ahead. To figure out the way to select the PC's having
biological meaning and most probably not associated to the highest
eigenvalues. Particularly, in the absence of traits or phenotype
information.
Please, feel free to make more comments or to give another suggestion(s).
Cheers,
Manuela
2014-06-02 17:20 GMT+01:00 Jombart, Thibaut :
> Hi Manuela,
>
> thanks for re-posting on the forum. In this case, it seems that locations
> are very aggregated - a lot of genotypes were sampled roughly at the same
> place. Monmonier is unlikely to do well under such circumstances. The
> algorithm is very sensitive to local differences, and these are unstable
> for this kind of spatial distribution. I would recommend other approaches.
> For instance, if you want to define spatial clusters, you could use a basic
> clustering algorithm based on the principal components of a PCA (if spatial
> structure is obvious) or sPCA (if not, but there is still a spatial
> structure). Assuming 'foo' is your analysis (PCA or sPCA), one example
> would be using something along the lines of:
>
> h1 <- hclust(dist(foo$li)^2)
> plot(h1)
> cutree(h1)
>
> Etc.
> Check ?hclust for different clustering methods.
>
> Cheers
> Thibaut
>
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [
> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Manuela [
> manuelacorreia2 at gmail.com]
> Sent: 31 May 2014 21:46
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] Monmonier algorithm and individual scores
>
> Dear colleagues of Adegenet forum,
>
> First of all I must congratulate Doctor Thimbault for the wonderful work
> he has been so far developed. And following his own suggestion I'm sharing
> with you a specific issue raised by the output generated by Monmonier
> algorithm used for boundary detection.
> I have a sample made of 170 individuals, collected on 9 different places
> and genotyped for 19 SNPs by Realtime PCR.
> Before I run this line on the R script I had to explain to you about each
> one of them:
> mon1<- monmonier(xy ,D, gab)
>
> xy ? spatial coordinates UTM/Km) ;
> D ? pairwise allele sharing distance (?Prabclus? package);
> gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation)
>
> plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?)
> From the output produced, it can be clearly seen that there are 4 clusters
> of individuals having four scores (50,100,150,200). But, I can't find a way
> to have access to individual scores. As matter in fact, I consulted in
> detail all the arguments provided on Plot function but none of them seemed
> to me to be on the way I could extract the individuals scores (IS).
> I?m wondering if you could give me a hint about it. Any help will be
> appreciated.
> Kind regards,
> Manuela (Biochemist)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Jun 3 11:26:20 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 3 Jun 2014 09:26:20 +0000
Subject: [adegenet-forum] Monmonier algorithm and individual scores
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk>,
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEE2E3@icexch-m1.ic.ac.uk>
Hi there,
I would not recommend using all three phylogenetic reconstruction methods, even if with 19 SNPs there shouldn't be major differences. I covered the maximum parsimony for historical reasons, but I can't see it being useful here.
Other clustering approaches sounds like a good idea. If you ever fancy documenting how to use them on genetic data in a small tutorial, I think that would be a very handy to others ;)
As for your last question, it makes a lot of sense, but you will need external information for this. Eigenvalue selection procedures based on inertia will basically fail to detect the structures you talk about. So you will need to be able to test e.g. the correlation of your PCs to a set of traits, or their spatial distribution, etc.
Cheers
Thibaut
________________________________________
From: Manuela [manuelacorreia2 at gmail.com]
Sent: 03 June 2014 10:01
To: Jombart, Thibaut
Cc: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] Monmonier algorithm and individual scores
Doctor Thibault and dear colleagues,
I would like to thank you for the valuable criticism you made in this output. The idea behind the IS was, solely, to have a first draft of the georeferenced clusters because in spatial clusters I'm well-aware that several different genoypes at the same coordinates in species with a very low mobility or with no mobility could be a strong indication that the genetic variability is only due to environment while a great genetic diversity nearby may result from a short dispersal highly spatial correlated. To need of further confirmation by sPCA and/or clustering techniques.
The identification of spatial clusters in PCA, particularly by sPCA is no doubt more realiable than with Monmonier algoritm in this case. But I'd rather try to study more deeply each one of the 3 different methods (distance based-methods, Parsymony and maximum Likelihood) proposed in your tutorial "Trees" just to check it in first place if they might be appropriate to this dataset, Secondly, if they would gave different information perhaps with higher resolution when compared to classic NJ Tree, after validation by bootstrap. Eventually, if none is appropriate I always be able to rely on several clustering techniques more adequate for qualitative data, available at the "Cluster" package and to perform the validation by "cl Valid" following several criteria.
>From a very simplistic point of view, PCA analysis (not scaled) might provides us with information of the genetic variability whereas sPCA about the significance of local and global structures. But, on the whole, the information provided by these two analysis: Moran's Index , variance and allele loadings, enable us to discriminate the loci more informative on genetic variability but not spatially structured from those whose variability its spatial structured. To be further confirmed through biplots.
Another challenge ahead. To figure out the way to select the PC's having biological meaning and most probably not associated to the highest eigenvalues. Particularly, in the absence of traits or phenotype information.
Please, feel free to make more comments or to give another suggestion(s).
Cheers,
Manuela
2014-06-02 17:20 GMT+01:00 Jombart, Thibaut >:
Hi Manuela,
thanks for re-posting on the forum. In this case, it seems that locations are very aggregated - a lot of genotypes were sampled roughly at the same place. Monmonier is unlikely to do well under such circumstances. The algorithm is very sensitive to local differences, and these are unstable for this kind of spatial distribution. I would recommend other approaches. For instance, if you want to define spatial clusters, you could use a basic clustering algorithm based on the principal components of a PCA (if spatial structure is obvious) or sPCA (if not, but there is still a spatial structure). Assuming 'foo' is your analysis (PCA or sPCA), one example would be using something along the lines of:
h1 <- hclust(dist(foo$li)^2)
plot(h1)
cutree(h1)
Etc.
Check ?hclust for different clustering methods.
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Manuela [manuelacorreia2 at gmail.com]
Sent: 31 May 2014 21:46
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Monmonier algorithm and individual scores
Dear colleagues of Adegenet forum,
First of all I must congratulate Doctor Thimbault for the wonderful work he has been so far developed. And following his own suggestion I'm sharing with you a specific issue raised by the output generated by Monmonier algorithm used for boundary detection.
I have a sample made of 170 individuals, collected on 9 different places and genotyped for 19 SNPs by Realtime PCR.
Before I run this line on the R script I had to explain to you about each one of them:
mon1<- monmonier(xy ,D, gab)
xy ? spatial coordinates UTM/Km) ;
D ? pairwise allele sharing distance (?Prabclus? package);
gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation)
plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?)
>From the output produced, it can be clearly seen that there are 4 clusters of individuals having four scores (50,100,150,200). But, I can't find a way to have access to individual scores. As matter in fact, I consulted in detail all the arguments provided on Plot function but none of them seemed to me to be on the way I could extract the individuals scores (IS).
I?m wondering if you could give me a hint about it. Any help will be appreciated.
Kind regards,
Manuela (Biochemist)
From manuelacorreia2 at gmail.com Tue Jun 3 15:27:32 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Tue, 3 Jun 2014 14:27:32 +0100
Subject: [adegenet-forum] Monmonier algorithm and individual scores
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA657087BEDFC7@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA657087BEE2E3@icexch-m1.ic.ac.uk>
Message-ID:
Doctor Thibault and dear colleagues,
Deal:). I'll do my best.
About the PC's with biological meaning but not possessing
traits/phenotypic information. Later on, I'll explain to you why I think
this"crazy" idea might be feasible, in this case.
Thank you once more for the helpful suggestions.
Cheers,
Manuela
2014-06-03 12:47 GMT+01:00 Manuela :
> Doctor Thibault and dear colleagues,
>
> Deal:). I'll do my best.
>
> About the PC's with biological meaning but not having traits/phenotipic
> information. Later I'll explain to you the reason why I insist on using the
> softwares you have developed for PCA and sPCA to go on with this "crazy"
> idea.
>
> Thank you once more for the helpful suggestions.
>
> Cheers,
> Manuela
>
>
> 2014-06-03 10:26 GMT+01:00 Jombart, Thibaut :
>
>
>> Hi there,
>>
>> I would not recommend using all three phylogenetic reconstruction
>> methods, even if with 19 SNPs there shouldn't be major differences. I
>> covered the maximum parsimony for historical reasons, but I can't see it
>> being useful here.
>>
>> Other clustering approaches sounds like a good idea. If you ever fancy
>> documenting how to use them on genetic data in a small tutorial, I think
>> that would be a very handy to others ;)
>>
>> As for your last question, it makes a lot of sense, but you will need
>> external information for this. Eigenvalue selection procedures based on
>> inertia will basically fail to detect the structures you talk about. So you
>> will need to be able to test e.g. the correlation of your PCs to a set of
>> traits, or their spatial distribution, etc.
>>
>> Cheers
>> Thibaut
>>
>>
>> ________________________________________
>> From: Manuela [manuelacorreia2 at gmail.com]
>> Sent: 03 June 2014 10:01
>> To: Jombart, Thibaut
>> Cc: adegenet-forum at lists.r-forge.r-project.org
>> Subject: Re: [adegenet-forum] Monmonier algorithm and individual scores
>>
>> Doctor Thibault and dear colleagues,
>>
>> I would like to thank you for the valuable criticism you made in this
>> output. The idea behind the IS was, solely, to have a first draft of the
>> georeferenced clusters because in spatial clusters I'm well-aware that
>> several different genoypes at the same coordinates in species with a very
>> low mobility or with no mobility could be a strong indication that the
>> genetic variability is only due to environment while a great genetic
>> diversity nearby may result from a short dispersal highly spatial
>> correlated. To need of further confirmation by sPCA and/or clustering
>> techniques.
>>
>> The identification of spatial clusters in PCA, particularly by sPCA is no
>> doubt more realiable than with Monmonier algoritm in this case. But I'd
>> rather try to study more deeply each one of the 3 different methods
>> (distance based-methods, Parsymony and maximum Likelihood) proposed in your
>> tutorial "Trees" just to check it in first place if they might be
>> appropriate to this dataset, Secondly, if they would gave different
>> information perhaps with higher resolution when compared to classic NJ
>> Tree, after validation by bootstrap. Eventually, if none is appropriate I
>> always be able to rely on several clustering techniques more adequate for
>> qualitative data, available at the "Cluster" package and to perform the
>> validation by "cl Valid" following several criteria.
>>
>> From a very simplistic point of view, PCA analysis (not scaled) might
>> provides us with information of the genetic variability whereas sPCA about
>> the significance of local and global structures. But, on the whole, the
>> information provided by these two analysis: Moran's Index , variance and
>> allele loadings, enable us to discriminate the loci more informative on
>> genetic variability but not spatially structured from those whose
>> variability its spatial structured. To be further confirmed through biplots.
>>
>> Another challenge ahead. To figure out the way to select the PC's having
>> biological meaning and most probably not associated to the highest
>> eigenvalues. Particularly, in the absence of traits or phenotype
>> information.
>>
>> Please, feel free to make more comments or to give another suggestion(s).
>>
>> Cheers,
>> Manuela
>>
>>
>> 2014-06-02 17:20 GMT+01:00 Jombart, Thibaut > >:
>> Hi Manuela,
>>
>> thanks for re-posting on the forum. In this case, it seems that locations
>> are very aggregated - a lot of genotypes were sampled roughly at the same
>> place. Monmonier is unlikely to do well under such circumstances. The
>> algorithm is very sensitive to local differences, and these are unstable
>> for this kind of spatial distribution. I would recommend other approaches.
>> For instance, if you want to define spatial clusters, you could use a basic
>> clustering algorithm based on the principal components of a PCA (if spatial
>> structure is obvious) or sPCA (if not, but there is still a spatial
>> structure). Assuming 'foo' is your analysis (PCA or sPCA), one example
>> would be using something along the lines of:
>>
>> h1 <- hclust(dist(foo$li)^2)
>> plot(h1)
>> cutree(h1)
>>
>> Etc.
>> Check ?hclust for different clustering methods.
>>
>> Cheers
>> Thibaut
>>
>>
>> ________________________________________
>> From: adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org> [
>> adegenet-forum-bounces at lists.r-forge.r-project.org> adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of
>> Manuela [manuelacorreia2 at gmail.com]
>> Sent: 31 May 2014 21:46
>> To: adegenet-forum at lists.r-forge.r-project.org> adegenet-forum at lists.r-forge.r-project.org>
>> Subject: [adegenet-forum] Monmonier algorithm and individual scores
>>
>> Dear colleagues of Adegenet forum,
>>
>> First of all I must congratulate Doctor Thimbault for the wonderful work
>> he has been so far developed. And following his own suggestion I'm sharing
>> with you a specific issue raised by the output generated by Monmonier
>> algorithm used for boundary detection.
>> I have a sample made of 170 individuals, collected on 9 different places
>> and genotyped for 19 SNPs by Realtime PCR.
>> Before I run this line on the R script I had to explain to you about each
>> one of them:
>> mon1<- monmonier(xy ,D, gab)
>>
>> xy ? spatial coordinates UTM/Km) ;
>> D ? pairwise allele sharing distance (?Prabclus? package);
>> gab <-chooseCN(xy,ask=FALSE,type=1) (Delaunay Triangulation)
>>
>> plot(mon1,1:170,method=?greylevel?,add.arr=FALSE,bwd=6,col=?red?)
>> From the output produced, it can be clearly seen that there are 4
>> clusters of individuals having four scores (50,100,150,200). But, I can't
>> find a way to have access to individual scores. As matter in fact, I
>> consulted in detail all the arguments provided on Plot function but none of
>> them seemed to me to be on the way I could extract the individuals scores
>> (IS).
>> I?m wondering if you could give me a hint about it. Any help will be
>> appreciated.
>> Kind regards,
>> Manuela (Biochemist)
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From apcar at deakin.edu.au Fri Jun 6 04:44:07 2014
From: apcar at deakin.edu.au (ADAM PETER CARDILINI)
Date: Fri, 6 Jun 2014 02:44:07 +0000
Subject: [adegenet-forum] read.PLINK error
Message-ID:
G'day Everyone,
I have recently produced a .vcf file for a set of SNPs obtained using Genotype-by-sequencing. The .vcf file is the final output from the TASSEL pipeline which takes in fastq sequence files. I converted my .vcf file to a .ped and .map files using vcftools and then converted the .ped file to .raw so that I could load it into R using 'adegenet' function 'read.PLINK'. The linux vcftools and plink code was as follows:
vcftools --vcf myfile.vcf --out myfile.plink --plink
plink --file myfile.plink --out myfile.plink --recodeA
I successfully loaded my unaltered file into R using 'adegenet', however it has way many SNPs that I am not interested in (because it has only been sequenced for a couple of individuals) so I thought I would filter my .vcf snp file using vcftools. I filtered my original file so that only SNPs that were sequenced from >90% of samples remained. This significantly reduced the number of SNPs I had and produced a new .vcf file. I then converted this file to .ped and .map, and then .ped to .raw so I could bring it into R and have a quick look.
When I tried to import the new, filtered .raw file using 'read.PLINK' I got the following error:
Reading PLINK raw format into a genlight object...
Reading loci information...
Reading and converting genotypes...
.Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'nLoc' for signature '"try-error"'
In addition: Warning message:
In mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), :
9 function calls resulted in an error
It seems as if something has gone wrong when I have produced the new .vcf file during filtering. I was wondering if anyone might know what I have done wrong, what these error messages mean and whether there is a fix I can try?
Thanks in advance for your time and help, I appreciate it.
Kind regards,
Adam Cardilini
PhD Candidate
Schools of Life and Environmental Sciences,
Deakin University, 75 Pigdons Rd,
Waurn Ponds, Vic, Australia, 3217
Mob: 0431 566 340
Email: apcar at deakin.edu.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From guillaumelouvel at hotmail.fr Sun Jun 8 14:57:39 2014
From: guillaumelouvel at hotmail.fr (Guillaume Louvel)
Date: Sun, 8 Jun 2014 14:57:39 +0200
Subject: [adegenet-forum] relevant way to compare posterior probabilities
between DAPC with the same prior groups and the same individuals
Message-ID:
Hi everyone,
I have performed DAPC on a set of 934 individuals, using 10 predefined
groups.
I did this with different sets of SNPs (coming from epigenetics assays
in different tissues);
now I would like to compare the posterior assignments, to know if the
tissue has an effect, and I don't know what would be the best way.
I have thought about the following:
1- compare the slot assign.per.pop of the summary(dapc), which is the
percentage of individuals a posteriori assigned to their original prior
group, for each group. for me a vector of 10 values.
To make it clearer, what I want to compare is sthg like that:
prior1 prior2 ... prior j ... prior10
tissue 1 p1,1 p1,2 ... p1,10
... ...
tissue i pi,j
where pi,j is the proportion of individuals from prior j correctly
assigned to j, using tissue i.
I cannot really use anova, because I have only one value per group per
tissue.
I think it is useless to repeat the dapc in order to get several value
for each categorie to be able to do an anova, because if the results
come from multiple simulations, they would be really close I suppose.
So I don't know what would be the error values of this proportion of
correct reassignment. Maybe if I knew what is the error associated with
these proportions I could conclude.
I started doing chi-squared tests on the posterior group sizes, but this
is not really relevant because the posterior groups are a mix of the
correct and the wrong assignments.
2- compare at the level of the individual the probabilities of assignment.
That is, create a table with those fields :
individual - priorgrp - post proba of assignment to prior grp - tissue
And then do something like a glm( post proba ~ priorgrp + tissue ).
I cannot do an anova because for one cluster and for one tissue the
proba doesn't have a normal distribution, so I assume it is better with
the generalized linear model.
Or, use a manova: same than the glm, except that instead of taking only
the posterior proba of assignment to the prior grp, I take the vector of
proba of assignment to every group. For now I haven't clearly found the
conditions to apply a manova, so I am not sure if I can apply it with
the distribution I have.
How would you compare posterior probabilities of DAPC ?
Hope this not too unclear.
Thank you in advance,
Guillaume
PS: I have not be able to find the information, but how are established
the posterior probabilities of assignment ? by simulation or
analytically ? If by simulation, how many iterations are performed ?
From t.jombart at imperial.ac.uk Sun Jun 8 19:41:26 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 8 Jun 2014 17:41:26 +0000
Subject: [adegenet-forum] read.PLINK error
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEF8DB@icexch-m1.ic.ac.uk>
Hello,
what command line did you use to read the data?
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of ADAM PETER CARDILINI [apcar at deakin.edu.au]
Sent: 06 June 2014 03:44
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] read.PLINK error
G?day Everyone,
I have recently produced a .vcf file for a set of SNPs obtained using Genotype-by-sequencing. The .vcf file is the final output from the TASSEL pipeline which takes in fastq sequence files. I converted my .vcf file to a .ped and .map files using vcftools and then converted the .ped file to .raw so that I could load it into R using ?adegenet? function ?read.PLINK?. The linux vcftools and plink code was as follows:
vcftools --vcf myfile.vcf --out myfile.plink --plink
plink --file myfile.plink --out myfile.plink --recodeA
I successfully loaded my unaltered file into R using ?adegenet?, however it has way many SNPs that I am not interested in (because it has only been sequenced for a couple of individuals) so I thought I would filter my .vcf snp file using vcftools. I filtered my original file so that only SNPs that were sequenced from >90% of samples remained. This significantly reduced the number of SNPs I had and produced a new .vcf file. I then converted this file to .ped and .map, and then .ped to .raw so I could bring it into R and have a quick look.
When I tried to import the new, filtered .raw file using ?read.PLINK? I got the following error:
Reading PLINK raw format into a genlight object...
Reading loci information...
Reading and converting genotypes...
.Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ?nLoc? for signature ?"try-error"?
In addition: Warning message:
In mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), :
9 function calls resulted in an error
It seems as if something has gone wrong when I have produced the new .vcf file during filtering. I was wondering if anyone might know what I have done wrong, what these error messages mean and whether there is a fix I can try?
Thanks in advance for your time and help, I appreciate it.
Kind regards,
Adam Cardilini
PhD Candidate
Schools of Life and Environmental Sciences,
Deakin University, 75 Pigdons Rd,
Waurn Ponds, Vic, Australia, 3217
Mob: 0431 566 340
Email: apcar at deakin.edu.au
From t.jombart at imperial.ac.uk Sun Jun 8 19:44:09 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 8 Jun 2014 17:44:09 +0000
Subject: [adegenet-forum] relevant way to compare posterior
probabilities between DAPC with the same prior groups and the same
individuals
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEF8EB@icexch-m1.ic.ac.uk>
Hello,
I don't have time for a long answer now and had to go through the question quickly, but it will probably be useful to have a look at the DAPC tutorial, and the following functions for dapc objects:
summary, predict, a.score, xvalDapc
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Guillaume Louvel [guillaumelouvel at hotmail.fr]
Sent: 08 June 2014 13:57
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] relevant way to compare posterior probabilities between DAPC with the same prior groups and the same individuals
Hi everyone,
I have performed DAPC on a set of 934 individuals, using 10 predefined
groups.
I did this with different sets of SNPs (coming from epigenetics assays
in different tissues);
now I would like to compare the posterior assignments, to know if the
tissue has an effect, and I don't know what would be the best way.
I have thought about the following:
1- compare the slot assign.per.pop of the summary(dapc), which is the
percentage of individuals a posteriori assigned to their original prior
group, for each group. for me a vector of 10 values.
To make it clearer, what I want to compare is sthg like that:
prior1 prior2 ... prior j ... prior10
tissue 1 p1,1 p1,2 ... p1,10
... ...
tissue i pi,j
where pi,j is the proportion of individuals from prior j correctly
assigned to j, using tissue i.
I cannot really use anova, because I have only one value per group per
tissue.
I think it is useless to repeat the dapc in order to get several value
for each categorie to be able to do an anova, because if the results
come from multiple simulations, they would be really close I suppose.
So I don't know what would be the error values of this proportion of
correct reassignment. Maybe if I knew what is the error associated with
these proportions I could conclude.
I started doing chi-squared tests on the posterior group sizes, but this
is not really relevant because the posterior groups are a mix of the
correct and the wrong assignments.
2- compare at the level of the individual the probabilities of assignment.
That is, create a table with those fields :
individual - priorgrp - post proba of assignment to prior grp - tissue
And then do something like a glm( post proba ~ priorgrp + tissue ).
I cannot do an anova because for one cluster and for one tissue the
proba doesn't have a normal distribution, so I assume it is better with
the generalized linear model.
Or, use a manova: same than the glm, except that instead of taking only
the posterior proba of assignment to the prior grp, I take the vector of
proba of assignment to every group. For now I haven't clearly found the
conditions to apply a manova, so I am not sure if I can apply it with
the distribution I have.
How would you compare posterior probabilities of DAPC ?
Hope this not too unclear.
Thank you in advance,
Guillaume
PS: I have not be able to find the information, but how are established
the posterior probabilities of assignment ? by simulation or
analytically ? If by simulation, how many iterations are performed ?
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From apcar at deakin.edu.au Mon Jun 9 00:24:53 2014
From: apcar at deakin.edu.au (ADAM PETER CARDILINI)
Date: Sun, 8 Jun 2014 22:24:53 +0000
Subject: [adegenet-forum] read.PLINK error
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657087BEF8DB@icexch-m1.ic.ac.uk>
References: ,
<2CB2DA8E426F3541AB1907F98ABA657087BEF8DB@icexch-m1.ic.ac.uk>
Message-ID:
G'day Thibaut,
Sorry I should have included that in the original email.
The code I use to read the data was:
dat <- read.PLINK('myfiltered_plinkconvertedfile.raw', map.file = 'myfiltered_plinkconvertedfile.map')
This command line worked on the unfiltered data files, just not the ones I got after filtering in vcftools.
Cheers,
Adam
Sent from my iPad
> On 9 Jun 2014, at 3:42 am, "Jombart, Thibaut" wrote:
>
>
> Hello,
>
> what command line did you use to read the data?
>
> Cheers
> Thibaut
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of ADAM PETER CARDILINI [apcar at deakin.edu.au]
> Sent: 06 June 2014 03:44
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] read.PLINK error
>
> G?day Everyone,
>
> I have recently produced a .vcf file for a set of SNPs obtained using Genotype-by-sequencing. The .vcf file is the final output from the TASSEL pipeline which takes in fastq sequence files. I converted my .vcf file to a .ped and .map files using vcftools and then converted the .ped file to .raw so that I could load it into R using ?adegenet? function ?read.PLINK?. The linux vcftools and plink code was as follows:
>
> vcftools --vcf myfile.vcf --out myfile.plink --plink
> plink --file myfile.plink --out myfile.plink --recodeA
>
> I successfully loaded my unaltered file into R using ?adegenet?, however it has way many SNPs that I am not interested in (because it has only been sequenced for a couple of individuals) so I thought I would filter my .vcf snp file using vcftools. I filtered my original file so that only SNPs that were sequenced from >90% of samples remained. This significantly reduced the number of SNPs I had and produced a new .vcf file. I then converted this file to .ped and .map, and then .ped to .raw so I could bring it into R and have a quick look.
>
> When I tried to import the new, filtered .raw file using ?read.PLINK? I got the following error:
>
>
> Reading PLINK raw format into a genlight object...
>
> Reading loci information...
>
> Reading and converting genotypes...
> .Error in (function (classes, fdef, mtable) :
> unable to find an inherited method for function ?nLoc? for signature ?"try-error"?
> In addition: Warning message:
> In mclapply(txt, function(e) new("SNPbin", snp = e, ploidy = 2), :
> 9 function calls resulted in an error
>
>
>
> It seems as if something has gone wrong when I have produced the new .vcf file during filtering. I was wondering if anyone might know what I have done wrong, what these error messages mean and whether there is a fix I can try?
>
> Thanks in advance for your time and help, I appreciate it.
>
> Kind regards,
>
> Adam Cardilini
> PhD Candidate
> Schools of Life and Environmental Sciences,
> Deakin University, 75 Pigdons Rd,
> Waurn Ponds, Vic, Australia, 3217
> Mob: 0431 566 340
> Email: apcar at deakin.edu.au
>
From emmanuel.wicker at cirad.fr Mon Jun 9 17:23:47 2014
From: emmanuel.wicker at cirad.fr (Emmanuel WICKER)
Date: Mon, 9 Jun 2014 19:23:47 +0400 (RET)
Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to
Genlight
In-Reply-To: <1464298593.12352.1402326964959.JavaMail.root@cirad.fr>
Message-ID: <1340384164.12422.1402327427518.JavaMail.root@cirad.fr>
Hi all
I tried and convert a fasta alignment to a genlight object, and I had the following message:
> toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command
Converting FASTA alignment into a genlight object...
Loading required package: parallel
Looking for polymorphic positions...
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), :
'mc.cores' > 1 is not supported on Windows
ANy help ?
I run R under Windows 7, adegenet version 1.4.2
Thank you
Manu
From t.jombart at imperial.ac.uk Mon Jun 9 17:35:55 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Mon, 9 Jun 2014 15:35:55 +0000
Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement
to Genlight
In-Reply-To: <1340384164.12422.1402327427518.JavaMail.root@cirad.fr>
References: <1464298593.12352.1402326964959.JavaMail.root@cirad.fr>,
<1340384164.12422.1402327427518.JavaMail.root@cirad.fr>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BEFDB0@icexch-m1.ic.ac.uk>
Hi
can you try
parallel = FALSE
as argument?
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Emmanuel WICKER [emmanuel.wicker at cirad.fr]
Sent: 09 June 2014 16:23
To: adegenet-forum at lists.r-forge.r-project.org
Cc: wicker at cirad.fr
Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight
Hi all
I tried and convert a fasta alignment to a genlight object, and I had the following message:
> toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command
Converting FASTA alignment into a genlight object...
Loading required package: parallel
Looking for polymorphic positions...
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
..........
Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), :
'mc.cores' > 1 is not supported on Windows
ANy help ?
I run R under Windows 7, adegenet version 1.4.2
Thank you
Manu
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From emmanuel.wicker at cirad.fr Mon Jun 9 18:02:40 2014
From: emmanuel.wicker at cirad.fr (Emmanuel WICKER)
Date: Mon, 9 Jun 2014 20:02:40 +0400 (RET)
Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement
to Genlight
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA657087BEFDB0@icexch-m1.ic.ac.uk>
Message-ID: <636378527.12663.1402329760033.JavaMail.root@cirad.fr>
Hi Thibaut
I already tested that, but still it doesn't work.
For that command, and also for read.snp of a DNAbin object (same error message)
Cheers
Manu
----- Mail original -----
De: "Thibaut Jombart"
?: "Emmanuel WICKER" , adegenet-forum at lists.r-forge.r-project.org
Cc: wicker at cirad.fr
Envoy?: Lundi 9 Juin 2014 19:35:55
Objet: RE: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight
Hi
can you try
parallel = FALSE
as argument?
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Emmanuel WICKER [emmanuel.wicker at cirad.fr]
Sent: 09 June 2014 16:23
To: adegenet-forum at lists.r-forge.r-project.org
Cc: wicker at cirad.fr
Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to Genlight
Hi all
I tried and convert a fasta alignment to a genlight object, and I had the following message:
> toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command
Converting FASTA alignment into a genlight object...
Loading required package: parallel
Looking for polymorphic positions...
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
..........
Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), :
'mc.cores' > 1 is not supported on Windows
ANy help ?
I run R under Windows 7, adegenet version 1.4.2
Thank you
Manu
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From caitiecollins at gmail.com Mon Jun 9 19:32:19 2014
From: caitiecollins at gmail.com (Caitlin Collins)
Date: Mon, 9 Jun 2014 18:32:19 +0100
Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to
Genlight
In-Reply-To: <636378527.12663.1402329760033.JavaMail.root@cirad.fr>
References: <2CB2DA8E426F3541AB1907F98ABA657087BEFDB0@icexch-m1.ic.ac.uk>
<636378527.12663.1402329760033.JavaMail.root@cirad.fr>
Message-ID:
Hi Emmanuel,
I'm running adegenet on a Windows computer, and I've previously had the
same error message that you're currently experiencing (see below, first
example). For all the instances you have mentioned, however, I usually find
that adding the argument parallel=FALSE does the trick. Would you be able
to copy and paste the following example (the line below starting with
myPath, and then the line from the second example starting with obj) and
then reporting back with the outcome? Thanks very much.
myPath <- system.file("files/usflu.fasta",package="adegenet")
# without the parallel arguement --> same error message you are getting:
> obj <- fasta2genlight(myPath, chunk=10) # process 10 sequences at a time
Converting FASTA alignment into a genlight object...
Loading required package: parallel
Looking for polymorphic positions...
..........
Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""),
:
'mc.cores' > 1 is not supported on Windows
*# WITH the parallel=FALSE argument: *
obj <- fasta2genlight(myPath, chunk=10, parallel=FALSE) # process 10
sequences at a time
Converting FASTA alignment into a genlight object...
Looking for polymorphic positions...
........................................................................................................................................................................................................................................................................................................................................................................
Extracting SNPs from the alignment...
........................................................................................................................................................................................................................................................................................................................................................................
Building final object...
...done.
Cheers,
Caitlin.
On Mon, Jun 9, 2014 at 5:02 PM, Emmanuel WICKER
wrote:
> Hi Thibaut
> I already tested that, but still it doesn't work.
> For that command, and also for read.snp of a DNAbin object (same error
> message)
> Cheers
> Manu
>
> ----- Mail original -----
> De: "Thibaut Jombart"
> ?: "Emmanuel WICKER" ,
> adegenet-forum at lists.r-forge.r-project.org
> Cc: wicker at cirad.fr
> Envoy?: Lundi 9 Juin 2014 19:35:55
> Objet: RE: [adegenet-forum] Help: pbm conversion of a fasta alignement to
> Genlight
>
>
> Hi
>
> can you try
> parallel = FALSE
>
> as argument?
>
> Cheers
> Thibaut
>
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [
> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Emmanuel
> WICKER [emmanuel.wicker at cirad.fr]
> Sent: 09 June 2014 16:23
> To: adegenet-forum at lists.r-forge.r-project.org
> Cc: wicker at cirad.fr
> Subject: [adegenet-forum] Help: pbm conversion of a fasta alignement to
> Genlight
>
> Hi all
> I tried and convert a fasta alignment to a genlight object, and I had the
> following message:
>
>
> > toto=fasta2genlight("EGL_ARB_originaux_160913_TRIM.fas")#my command
>
> Converting FASTA alignment into a genlight object...
>
> Loading required package: parallel
>
> Looking for polymorphic positions...
>
> ..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
> ..........
> Error in mclapply(txt, function(e) strsplit(paste(e[-1], collapse = ""), :
> 'mc.cores' > 1 is not supported on Windows
>
> ANy help ?
> I run R under Windows 7, adegenet version 1.4.2
> Thank you
> Manu
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From patriciasalerno at gmail.com Fri Jun 13 23:27:10 2014
From: patriciasalerno at gmail.com (Patricia Salerno)
Date: Fri, 13 Jun 2014 16:27:10 -0500
Subject: [adegenet-forum] DAPC: loadings of original variables as table?
Message-ID:
Hi everyone,
I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting
different results with the two approaches, and the DAPC results are much
more logical, biologically speaking (some individuals of a very
well-supported cluster in DAPC are being assigned to the other cluster,
even though the separation in PC1 is enormous!). Thus, I want to see if the
discrepancies of population assignment in STRUCTURE are due to the fact
that the DAPC initially transforms the data into vectors that maximize
variation, thus effectively weighing my variables differently, while
STRUCTURE weighs all SNPs equally. The only strategy I've come up with to
investigate this issue further is to generate a table of the loadings of
the SNP variables (the original, not the transformed variables after PCA),
and prune my matrix to only keep the SNPs with sufficient contributions
(setting some post-hoc cutoff). However, I cannot figure out how to print a
table of the SNP loadings after the DAPC, or if it's even possible. What I
would want is a printed matrix of two columns, one with the SNP names, and
another with the contributions/loadings. Could anyone help me with this?
Or, does anyone have another suggestion for approaching this issue?
Thank you!!
~patricia.
--
Patricia Salerno
PhD Candidate
Ecology Evolution and Behavior
Section of Integrative Biology
University of Texas at Austin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From goatsrunfaster at gmail.com Sat Jun 14 15:11:55 2014
From: goatsrunfaster at gmail.com (Spencer Bruce)
Date: Sat, 14 Jun 2014 09:11:55 -0400
Subject: [adegenet-forum] Identifying clusters / Error in row names
Message-ID:
Hello All,
I am trying to run a DAPC on some microsatellite data, and have had no
problems going through the tutorial using the tutorial data, but I am
immediately running into problems after converting my STRUCTURE file to a
genind object. Given that as a first step I would like to identify clusters
using my entire data set, I do the following, and receive the following
error message:
> x <- obj1
> x
#####################
### Genind object ###
#####################
- genotypes of individuals -
S4 class: genind
@call: read.structure(file = file, missing = missing, quiet = quiet)
@tab: 990 x 118 matrix of genotypes
@ind.names: vector of 990 individual names
@loc.names: vector of 11 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the 118 columns of @tab
@all.names: list of 11 components yielding allele names for each locus
@ploidy: 2
@type: codom
Optional contents:
@pop: - empty -
@pop.names: - empty -
@other: - empty -
> grp <- find.clusters(x, max.n.clust=41)
Error in `row.names<-.data.frame`(`*tmp*`, value = c("001", "003", "005",
:
duplicate 'row.names' are not allowed
In addition: Warning messages:
1: In data.row.names(row.names, rowsi, i) :
some row.names duplicated:
497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,7
[... truncated]
2: non-unique values when setting 'row.names':
This is what my original data set looks like in the STRUCTURE file (a first
row of loci names, and then 2 rows of fragment lengths for each individual
with no labels):
SfoB52 SfoC24 SfoC28 SfoC38 SfoC86 SfoC88 SfoC113 SfoC129 SfoD75 SfoD91
SfoD100
203 113 179 143 101 181 133 221 188 228 230
225 113 191 143 116 184 139 230 208 236 238
215 113 183 143 110 184 133 230 180 212 214
219 122 191 143 116 184 139 230 188 220 214
211 113 179 143 101 184 142 230 180 212 214
219 113 191 143 110 190 151 230 204 228 214
etc.
Any help would be very greatly appreciated, as I'm new to using R, but am
excited about the possibilities!
Best,
Spencer
--
Spencer A Bruce
200 Washington St.
Troy, NY 12180
518 225 0787
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From manuelacorreia2 at gmail.com Sat Jun 14 17:56:52 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Sat, 14 Jun 2014 16:56:52 +0100
Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7
In-Reply-To:
References:
Message-ID:
Patr?cia,
I made a small test with example suggested on sPCA tutorial (
http://adegenet.r-forge.r-project.org/) and apparently it seems that you
can get the SNP loadings after modelling yout dataset by DAPC. The values
you want are stored in the slot pca.loadings.
Just try these two command lines:
A<-dapc1$pca.loadings
write.table(A,file=?A?)
And afterwards open it in Excel. By default a file named ?A? is saved on
MyDocuments folder. But if you have any trouble on open it please let me
now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm
this information.
Hoping to be helpful,
M.
2014-06-14 11:00 GMT+01:00 <
adegenet-forum-request at lists.r-forge.r-project.org>:
> Send adegenet-forum mailing list submissions to
> adegenet-forum at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> or, via email, send a message with subject or body 'help' to
> adegenet-forum-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
> adegenet-forum-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of adegenet-forum digest..."
>
>
> Today's Topics:
>
> 1. DAPC: loadings of original variables as table? (Patricia Salerno)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 13 Jun 2014 16:27:10 -0500
> From: Patricia Salerno
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: [adegenet-forum] DAPC: loadings of original variables as
> table?
> Message-ID:
> 531Ejp3VQEw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi everyone,
>
> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting
> different results with the two approaches, and the DAPC results are much
> more logical, biologically speaking (some individuals of a very
> well-supported cluster in DAPC are being assigned to the other cluster,
> even though the separation in PC1 is enormous!). Thus, I want to see if the
> discrepancies of population assignment in STRUCTURE are due to the fact
> that the DAPC initially transforms the data into vectors that maximize
> variation, thus effectively weighing my variables differently, while
> STRUCTURE weighs all SNPs equally. The only strategy I've come up with to
> investigate this issue further is to generate a table of the loadings of
> the SNP variables (the original, not the transformed variables after PCA),
> and prune my matrix to only keep the SNPs with sufficient contributions
> (setting some post-hoc cutoff). However, I cannot figure out how to print a
> table of the SNP loadings after the DAPC, or if it's even possible. What I
> would want is a printed matrix of two columns, one with the SNP names, and
> another with the contributions/loadings. Could anyone help me with this?
> Or, does anyone have another suggestion for approaching this issue?
>
> Thank you!!
>
> ~patricia.
>
>
> --
> Patricia Salerno
> PhD Candidate
> Ecology Evolution and Behavior
> Section of Integrative Biology
> University of Texas at Austin
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html
> >
>
> ------------------------------
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> End of adegenet-forum Digest, Vol 70, Issue 7
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From manuelacorreia2 at gmail.com Sat Jun 14 18:00:03 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Sat, 14 Jun 2014 17:00:03 +0100
Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7
In-Reply-To:
References:
Message-ID:
Sorry, I meant DAPC tutorial (March 24,2014).
Cheers,
M.
2014-06-14 16:56 GMT+01:00 Manuela :
> Patr?cia,
>
>
>
> I made a small test with example suggested on sPCA tutorial (
> http://adegenet.r-forge.r-project.org/) and apparently it seems that you
> can get the SNP loadings after modelling yout dataset by DAPC. The values
> you want are stored in the slot pca.loadings.
>
>
> Just try these two command lines:
>
> A<-dapc1$pca.loadings
>
> write.table(A,file=?A?)
>
>
> And afterwards open it in Excel. By default a file named ?A? is saved on
> MyDocuments folder. But if you have any trouble on open it please let me
> now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm
> this information.
>
>
> Hoping to be helpful,
>
> M.
>
>
> 2014-06-14 11:00 GMT+01:00 <
> adegenet-forum-request at lists.r-forge.r-project.org>:
>
> Send adegenet-forum mailing list submissions to
>> adegenet-forum at lists.r-forge.r-project.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>
>> or, via email, send a message with subject or body 'help' to
>> adegenet-forum-request at lists.r-forge.r-project.org
>>
>> You can reach the person managing the list at
>> adegenet-forum-owner at lists.r-forge.r-project.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of adegenet-forum digest..."
>>
>>
>> Today's Topics:
>>
>> 1. DAPC: loadings of original variables as table? (Patricia Salerno)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Fri, 13 Jun 2014 16:27:10 -0500
>> From: Patricia Salerno
>> To: adegenet-forum at lists.r-forge.r-project.org
>> Subject: [adegenet-forum] DAPC: loadings of original variables as
>> table?
>> Message-ID:
>> > 531Ejp3VQEw at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi everyone,
>>
>> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting
>> different results with the two approaches, and the DAPC results are much
>> more logical, biologically speaking (some individuals of a very
>> well-supported cluster in DAPC are being assigned to the other cluster,
>> even though the separation in PC1 is enormous!). Thus, I want to see if
>> the
>> discrepancies of population assignment in STRUCTURE are due to the fact
>> that the DAPC initially transforms the data into vectors that maximize
>> variation, thus effectively weighing my variables differently, while
>> STRUCTURE weighs all SNPs equally. The only strategy I've come up with to
>> investigate this issue further is to generate a table of the loadings of
>> the SNP variables (the original, not the transformed variables after PCA),
>> and prune my matrix to only keep the SNPs with sufficient contributions
>> (setting some post-hoc cutoff). However, I cannot figure out how to print
>> a
>> table of the SNP loadings after the DAPC, or if it's even possible. What I
>> would want is a printed matrix of two columns, one with the SNP names, and
>> another with the contributions/loadings. Could anyone help me with this?
>> Or, does anyone have another suggestion for approaching this issue?
>>
>> Thank you!!
>>
>> ~patricia.
>>
>>
>> --
>> Patricia Salerno
>> PhD Candidate
>> Ecology Evolution and Behavior
>> Section of Integrative Biology
>> University of Texas at Austin
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>
>> End of adegenet-forum Digest, Vol 70, Issue 7
>> *********************************************
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From manuelacorreia2 at gmail.com Sat Jun 14 18:09:28 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Sat, 14 Jun 2014 17:09:28 +0100
Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7
In-Reply-To:
References:
Message-ID:
Patr?cia,
About this same subject I would like to recommend you an article I've read
some time ago.
Reference:
Kalinowski, ST (2011) "The computer program STRUCTURE does not reliably
identify the main genetic clusters within species: simulations and
implications for human population structure", Heredity, 106 :625-632
Cheers,
M.
2014-06-14 17:00 GMT+01:00 Manuela :
> Sorry, I meant DAPC tutorial (March 24,2014).
>
> Cheers,
> M.
>
>
> 2014-06-14 16:56 GMT+01:00 Manuela :
>
> Patr?cia,
>>
>>
>>
>> I made a small test with example suggested on sPCA tutorial (
>> http://adegenet.r-forge.r-project.org/) and apparently it seems that you
>> can get the SNP loadings after modelling yout dataset by DAPC. The values
>> you want are stored in the slot pca.loadings.
>>
>>
>> Just try these two command lines:
>>
>> A<-dapc1$pca.loadings
>>
>> write.table(A,file=?A?)
>>
>>
>> And afterwards open it in Excel. By default a file named ?A? is saved on
>> MyDocuments folder. But if you have any trouble on open it please let me
>> now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm
>> this information.
>>
>>
>> Hoping to be helpful,
>>
>> M.
>>
>>
>> 2014-06-14 11:00 GMT+01:00 <
>> adegenet-forum-request at lists.r-forge.r-project.org>:
>>
>> Send adegenet-forum mailing list submissions to
>>> adegenet-forum at lists.r-forge.r-project.org
>>>
>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>
>>> or, via email, send a message with subject or body 'help' to
>>> adegenet-forum-request at lists.r-forge.r-project.org
>>>
>>> You can reach the person managing the list at
>>> adegenet-forum-owner at lists.r-forge.r-project.org
>>>
>>> When replying, please edit your Subject line so it is more specific
>>> than "Re: Contents of adegenet-forum digest..."
>>>
>>>
>>> Today's Topics:
>>>
>>> 1. DAPC: loadings of original variables as table? (Patricia Salerno)
>>>
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Fri, 13 Jun 2014 16:27:10 -0500
>>> From: Patricia Salerno
>>> To: adegenet-forum at lists.r-forge.r-project.org
>>> Subject: [adegenet-forum] DAPC: loadings of original variables as
>>> table?
>>> Message-ID:
>>> >> 531Ejp3VQEw at mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>>
>>> Hi everyone,
>>>
>>> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting
>>> different results with the two approaches, and the DAPC results are much
>>> more logical, biologically speaking (some individuals of a very
>>> well-supported cluster in DAPC are being assigned to the other cluster,
>>> even though the separation in PC1 is enormous!). Thus, I want to see if
>>> the
>>> discrepancies of population assignment in STRUCTURE are due to the fact
>>> that the DAPC initially transforms the data into vectors that maximize
>>> variation, thus effectively weighing my variables differently, while
>>> STRUCTURE weighs all SNPs equally. The only strategy I've come up with to
>>> investigate this issue further is to generate a table of the loadings of
>>> the SNP variables (the original, not the transformed variables after
>>> PCA),
>>> and prune my matrix to only keep the SNPs with sufficient contributions
>>> (setting some post-hoc cutoff). However, I cannot figure out how to
>>> print a
>>> table of the SNP loadings after the DAPC, or if it's even possible. What
>>> I
>>> would want is a printed matrix of two columns, one with the SNP names,
>>> and
>>> another with the contributions/loadings. Could anyone help me with this?
>>> Or, does anyone have another suggestion for approaching this issue?
>>>
>>> Thank you!!
>>>
>>> ~patricia.
>>>
>>>
>>> --
>>> Patricia Salerno
>>> PhD Candidate
>>> Ecology Evolution and Behavior
>>> Section of Integrative Biology
>>> University of Texas at Austin
>>> -------------- next part --------------
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html
>>> >
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>
>>> End of adegenet-forum Digest, Vol 70, Issue 7
>>> *********************************************
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From patriciasalerno at gmail.com Sat Jun 14 20:10:04 2014
From: patriciasalerno at gmail.com (Patricia Salerno)
Date: Sat, 14 Jun 2014 13:10:04 -0500
Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 7
In-Reply-To:
References:
Message-ID:
Thank you so much, Manuela, for the tip and for the reference!! very
helpful... worked just fine with my data.
Cheers!!
~patricia.
On Sat, Jun 14, 2014 at 11:09 AM, Manuela wrote:
>
>
> Patr?cia,
>
> About this same subject I would like to recommend you an article I've read
> some time ago.
>
> Reference:
> Kalinowski, ST (2011) "The computer program STRUCTURE does not reliably
> identify the main genetic clusters within species: simulations and
> implications for human population structure", Heredity, 106 :625-632
>
> Cheers,
> M.
>
>
> 2014-06-14 17:00 GMT+01:00 Manuela :
>
> Sorry, I meant DAPC tutorial (March 24,2014).
>>
>> Cheers,
>> M.
>>
>>
>> 2014-06-14 16:56 GMT+01:00 Manuela :
>>
>> Patr?cia,
>>>
>>>
>>>
>>> I made a small test with example suggested on sPCA tutorial (
>>> http://adegenet.r-forge.r-project.org/) and apparently it seems that
>>> you can get the SNP loadings after modelling yout dataset by DAPC. The
>>> values you want are stored in the slot pca.loadings.
>>>
>>>
>>> Just try these two command lines:
>>>
>>> A<-dapc1$pca.loadings
>>>
>>> write.table(A,file=?A?)
>>>
>>>
>>> And afterwards open it in Excel. By default a file named ?A? is saved on
>>> MyDocuments folder. But if you have any trouble on open it please let me
>>> now directly to my e-mail. Anyway, I?m sure Dr. Thimbault will soon confirm
>>> this information.
>>>
>>>
>>> Hoping to be helpful,
>>>
>>> M.
>>>
>>>
>>> 2014-06-14 11:00 GMT+01:00 <
>>> adegenet-forum-request at lists.r-forge.r-project.org>:
>>>
>>> Send adegenet-forum mailing list submissions to
>>>> adegenet-forum at lists.r-forge.r-project.org
>>>>
>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>>
>>>> or, via email, send a message with subject or body 'help' to
>>>> adegenet-forum-request at lists.r-forge.r-project.org
>>>>
>>>> You can reach the person managing the list at
>>>> adegenet-forum-owner at lists.r-forge.r-project.org
>>>>
>>>> When replying, please edit your Subject line so it is more specific
>>>> than "Re: Contents of adegenet-forum digest..."
>>>>
>>>>
>>>> Today's Topics:
>>>>
>>>> 1. DAPC: loadings of original variables as table? (Patricia Salerno)
>>>>
>>>>
>>>> ----------------------------------------------------------------------
>>>>
>>>> Message: 1
>>>> Date: Fri, 13 Jun 2014 16:27:10 -0500
>>>> From: Patricia Salerno
>>>> To: adegenet-forum at lists.r-forge.r-project.org
>>>> Subject: [adegenet-forum] DAPC: loadings of original variables as
>>>> table?
>>>> Message-ID:
>>>> >>> 531Ejp3VQEw at mail.gmail.com>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Hi everyone,
>>>>
>>>> I'm using PCA/DAPC as well as STRUCTURE with a SNP matrix. I'm getting
>>>> different results with the two approaches, and the DAPC results are much
>>>> more logical, biologically speaking (some individuals of a very
>>>> well-supported cluster in DAPC are being assigned to the other cluster,
>>>> even though the separation in PC1 is enormous!). Thus, I want to see if
>>>> the
>>>> discrepancies of population assignment in STRUCTURE are due to the fact
>>>> that the DAPC initially transforms the data into vectors that maximize
>>>> variation, thus effectively weighing my variables differently, while
>>>> STRUCTURE weighs all SNPs equally. The only strategy I've come up with
>>>> to
>>>> investigate this issue further is to generate a table of the loadings of
>>>> the SNP variables (the original, not the transformed variables after
>>>> PCA),
>>>> and prune my matrix to only keep the SNPs with sufficient contributions
>>>> (setting some post-hoc cutoff). However, I cannot figure out how to
>>>> print a
>>>> table of the SNP loadings after the DAPC, or if it's even possible.
>>>> What I
>>>> would want is a printed matrix of two columns, one with the SNP names,
>>>> and
>>>> another with the contributions/loadings. Could anyone help me with this?
>>>> Or, does anyone have another suggestion for approaching this issue?
>>>>
>>>> Thank you!!
>>>>
>>>> ~patricia.
>>>>
>>>>
>>>> --
>>>> Patricia Salerno
>>>> PhD Candidate
>>>> Ecology Evolution and Behavior
>>>> Section of Integrative Biology
>>>> University of Texas at Austin
>>>> -------------- next part --------------
>>>> An HTML attachment was scrubbed...
>>>> URL: <
>>>> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140613/18a1bb5c/attachment-0001.html
>>>> >
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> adegenet-forum mailing list
>>>> adegenet-forum at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>>
>>>> End of adegenet-forum Digest, Vol 70, Issue 7
>>>> *********************************************
>>>>
>>>
>>>
>>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
--
Patricia Salerno
PhD Candidate
Ecology Evolution and Behavior
Section of Integrative Biology
University of Texas at Austin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Sat Jun 14 22:15:06 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sat, 14 Jun 2014 20:15:06 +0000
Subject: [adegenet-forum] Identifying clusters / Error in row names
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA657087BF13E2@icexch-m1.ic.ac.uk>
Hi there,
can you try replacing the individuals labels? Duplications would cause problems there.
E.g.:
indNames(x) <- 1:nInd(x)
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Spencer Bruce [goatsrunfaster at gmail.com]
Sent: 14 June 2014 14:11
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] Identifying clusters / Error in row names
Hello All,
I am trying to run a DAPC on some microsatellite data, and have had no problems going through the tutorial using the tutorial data, but I am immediately running into problems after converting my STRUCTURE file to a genind object. Given that as a first step I would like to identify clusters using my entire data set, I do the following, and receive the following error message:
> x <- obj1
> x
#####################
### Genind object ###
#####################
- genotypes of individuals -
S4 class: genind
@call: read.structure(file = file, missing = missing, quiet = quiet)
@tab: 990 x 118 matrix of genotypes
@ind.names: vector of 990 individual names
@loc.names: vector of 11 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the 118 columns of @tab
@all.names: list of 11 components yielding allele names for each locus
@ploidy: 2
@type: codom
Optional contents:
@pop: - empty -
@pop.names: - empty -
@other: - empty -
> grp <- find.clusters(x, max.n.clust=41)
Error in `row.names<-.data.frame`(`*tmp*`, value = c("001", "003", "005", :
duplicate 'row.names' are not allowed
In addition: Warning messages:
1: In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,7 [... truncated]
2: non-unique values when setting 'row.names':
This is what my original data set looks like in the STRUCTURE file (a first row of loci names, and then 2 rows of fragment lengths for each individual with no labels):
SfoB52 SfoC24 SfoC28 SfoC38 SfoC86 SfoC88 SfoC113 SfoC129 SfoD75 SfoD91 SfoD100
203 113 179 143 101 181 133 221 188 228 230
225 113 191 143 116 184 139 230 208 236 238
215 113 183 143 110 184 133 230 180 212 214
219 122 191 143 116 184 139 230 188 220 214
211 113 179 143 101 184 142 230 180 212 214
219 113 191 143 110 190 151 230 204 228 214
etc.
Any help would be very greatly appreciated, as I'm new to using R, but am excited about the possibilities!
Best,
Spencer
--
Spencer A Bruce
200 Washington St.
Troy, NY 12180
518 225 0787
From neagef at gmail.com Tue Jun 17 11:12:40 2014
From: neagef at gmail.com (Andrea Garavito)
Date: Tue, 17 Jun 2014 11:12:40 +0200
Subject: [adegenet-forum] SNP alleles
Message-ID:
Hi everybody!
I'm currently trying to do a PCA analysis using a SNP matrix from a diploid
organism, most of them are bi-allelic.
Although the results that I obtain are logic in terms of previous knowledge
of the groups, I'm confused with the genind object that I obtain, and I
want to be sure about what's going on with the analysis.
My data file is formatted using the nucleotides as alleles and a "/"
separator, and missing data coded as "NA".
ind mk1 mk2
ind1 G/A C/T
ind2 G/G C/T
After loading my data matrix with the df2genid function my data is stored
as a matrix with for times the number of columns of the original file :
ind mk1.A mk1.G mk1.A mk1.G mk2.C mk2.T mk2.C mk2.T
ind1 0.5 0.0 0 0.5 0.0
0.5 0.5 0
ind2 0.0 0.5 0 0.5 0.0
0.5 0.5 0
Is that correct? I thought I would get two columns per marker loci instead
of 4.
>From there I obtain doubled statistics for each one of the alleles. Since I
don't know the phase, an A/G is the same as a G/A, so how can I have the
unified stats for each allele?
Thank you for your answer
Best regards
Andrea
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From caitiecollins at gmail.com Tue Jun 17 13:36:18 2014
From: caitiecollins at gmail.com (Caitlin Collins)
Date: Tue, 17 Jun 2014 12:36:18 +0100
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References:
Message-ID:
Hi Andrea,
I'm afraid that without seeing the exact code you used to generate the
results you have presented, it is a bit difficult to say for certain what
the origin of your problem is. So please forgive me if the following
suggestion misses the mark. (If so, can I ask you to reply with the
functions and arguments you used to generate that output?)
I notice you've stated that your original data file is formatted using a
"/" separator. One way of getting the df2genind output format you are
experiencing is by neglecting to inform the df2genind function that you are
using that separator. If you have not done so already, try adding the
argument sep="/" to the list of arguments taken by df2genind. Let me know
if that does the trick. If not, please post back with the code you are
using and we can go from there.
Best,
Caitlin.
On Tue, Jun 17, 2014 at 10:12 AM, Andrea Garavito wrote:
> Hi everybody!
>
> I'm currently trying to do a PCA analysis using a SNP matrix from a
> diploid organism, most of them are bi-allelic.
> Although the results that I obtain are logic in terms of previous
> knowledge of the groups, I'm confused with the genind object that I obtain,
> and I want to be sure about what's going on with the analysis.
> My data file is formatted using the nucleotides as alleles and a "/"
> separator, and missing data coded as "NA".
> ind mk1 mk2
> ind1 G/A C/T
> ind2 G/G C/T
> After loading my data matrix with the df2genid function my data is stored
> as a matrix with for times the number of columns of the original file :
>
> ind mk1.A mk1.G mk1.A mk1.G mk2.C mk2.T mk2.C mk2.T
> ind1 0.5 0.0 0 0.5 0.0
> 0.5 0.5 0
> ind2 0.0 0.5 0 0.5 0.0
> 0.5 0.5 0
>
> Is that correct? I thought I would get two columns per marker loci instead
> of 4.
> From there I obtain doubled statistics for each one of the alleles. Since
> I don't know the phase, an A/G is the same as a G/A, so how can I have the
> unified stats for each allele?
>
> Thank you for your answer
>
> Best regards
> Andrea
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Jun 17 13:59:04 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 17 Jun 2014 11:59:04 +0000
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References: ,
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
Hi there,
yes, as Caitlin said, it probably is something wrong about the conversion. I get:
> dat=data.frame(mk1=c("G/A","G/G"), km2=c("C/T","C/T"))
> dat
mk1 km2
1 G/A C/T
2 G/G C/T
> x=df2genind(dat,sep="/",ploidy=2)
> truenames(x)
mk1.A mk1.G km2.C km2.T
1 0.5 0.5 0.5 0.5
2 0.0 1.0 0.5 0.5
>
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Caitlin Collins [caitiecollins at gmail.com]
Sent: 17 June 2014 12:36
To: Andrea Garavito
Cc: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
Hi Andrea,
I'm afraid that without seeing the exact code you used to generate the results you have presented, it is a bit difficult to say for certain what the origin of your problem is. So please forgive me if the following suggestion misses the mark. (If so, can I ask you to reply with the functions and arguments you used to generate that output?)
I notice you've stated that your original data file is formatted using a "/" separator. One way of getting the df2genind output format you are experiencing is by neglecting to inform the df2genind function that you are using that separator. If you have not done so already, try adding the argument sep="/" to the list of arguments taken by df2genind. Let me know if that does the trick. If not, please post back with the code you are using and we can go from there.
Best,
Caitlin.
On Tue, Jun 17, 2014 at 10:12 AM, Andrea Garavito > wrote:
Hi everybody!
I'm currently trying to do a PCA analysis using a SNP matrix from a diploid organism, most of them are bi-allelic.
Although the results that I obtain are logic in terms of previous knowledge of the groups, I'm confused with the genind object that I obtain, and I want to be sure about what's going on with the analysis.
My data file is formatted using the nucleotides as alleles and a "/" separator, and missing data coded as "NA".
ind mk1 mk2
ind1 G/A C/T
ind2 G/G C/T
After loading my data matrix with the df2genid function my data is stored as a matrix with for times the number of columns of the original file :
ind mk1.A mk1.G mk1.A mk1.G mk2.C mk2.T mk2.C mk2.T
ind1 0.5 0.0 0 0.5 0.0 0.5 0.5 0
ind2 0.0 0.5 0 0.5 0.0 0.5 0.5 0
Is that correct? I thought I would get two columns per marker loci instead of 4.
>From there I obtain doubled statistics for each one of the alleles. Since I don't know the phase, an A/G is the same as a G/A, so how can I have the unified stats for each allele?
Thank you for your answer
Best regards
Andrea
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From neagef at gmail.com Tue Jun 17 14:47:13 2014
From: neagef at gmail.com (Andrea Garavito)
Date: Tue, 17 Jun 2014 14:47:13 +0200
Subject: [adegenet-forum] SNP alleles
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
Message-ID:
Hi Caitlin and Thibaut,
Thanks for your answers.
I did used the sep argument. My code to generate the genind object is :
>myData_genid <- df2genind(myData, sep="/")
The weird thing is that when I try the same code with a test object that I
created:
>dat = data.frame(loc1=c("A/A","T/A","T/A","T/T","T/A","A/T"),
loc2=c("C/G","G/C","C/C","G/G","C/G","G/C"))
>x=df2genind(dat, sep="/")
I get the two columns per loci (as Thibaut does):
>truenames(x)
loc1.A loc1.T loc2.C loc2.G
1 1.0 0.0 0.5 0.5
2 0.5 0.5 0.5 0.5
3 0.5 0.5 1.0 0.0
4 0.0 1.0 0.0 1.0
5 0.5 0.5 0.5 0.5
6 0.5 0.5 0.5 0.5
But when I test a subset of my data
>test<-myData[1:10,1:10]
>test
loc_29 loc_7 loc_43 etc...
1 "G / A" "C / T" "T / T"
2 "G / G" "C / T" "T/ T"
etc...
> test_genid <- df2genind(test,sep="/")
I get again three or four columns:
>truenames(test_genid)
loc_29.A loc_29.G loc_29.G loc_7.C loc_7.T loc_7.C loc_43.C
loc_43.T loc_43.C loc_43.T etc..
1 0.5 0.0 0.5 0.0 0.5
0.5 0.0 0.5 0.0 0.5
2 0.0 0.5 0.5 0.0 0.5
0.5 0.0 0.5 0.0 0.5
etc...
When I carry my PCA analysis with all my data:
>X <- scaleGen(myData_genid, scale=F, missing="mean")
>pca_myData<-dudi.pca(X,center=F,scale=F)
I get the following message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3,4,...
I really don't understand what is causing that, is there a hiden character
in my data file that makes the df2genind divide my columns? Does that
affect the results I get thereafter?
By the way, I tried the scale=F and scale=T in the scaleGen function but I
get two radically different results. With scale=T my individuals get
separated into only two groups; while with scale=F, individuals get more
"harmoniously" distributed over the 2 axis. Which one would be more
appropriate according to my data type? Because both seemed in agreement
with the origin of individuals, I'm not sure which one represents the "real
picture".
Thanks for your comments
Andrea
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Jun 17 14:57:33 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 17 Jun 2014 12:57:33 +0000
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>,
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A57D@icexch-m1.ic.ac.uk>
What is "myData"?
BTW it is safer to specify the ploidy when constructing a genind.
Try:
alleles(test_genid) # btw the name is 'genind' - genotype of individuals
to see if it is a problem of empty characters.
Cheers
Thibaut
________________________________________
From: Andrea Garavito [neagef at gmail.com]
Sent: 17 June 2014 13:47
To: Jombart, Thibaut
Cc: Caitlin Collins; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
Hi Caitlin and Thibaut,
Thanks for your answers.
I did used the sep argument. My code to generate the genind object is :
>myData_genid <- df2genind(myData, sep="/")
The weird thing is that when I try the same code with a test object that I created:
>dat = data.frame(loc1=c("A/A","T/A","T/A","T/T","T/A","A/T"), loc2=c("C/G","G/C","C/C","G/G","C/G","G/C"))
>x=df2genind(dat, sep="/")
I get the two columns per loci (as Thibaut does):
>truenames(x)
loc1.A loc1.T loc2.C loc2.G
1 1.0 0.0 0.5 0.5
2 0.5 0.5 0.5 0.5
3 0.5 0.5 1.0 0.0
4 0.0 1.0 0.0 1.0
5 0.5 0.5 0.5 0.5
6 0.5 0.5 0.5 0.5
But when I test a subset of my data
>test<-myData[1:10,1:10]
>test
loc_29 loc_7 loc_43 etc...
1 "G / A" "C / T" "T / T"
2 "G / G" "C / T" "T/ T"
etc...
> test_genid <- df2genind(test,sep="/")
I get again three or four columns:
>truenames(test_genid)
loc_29.A loc_29.G loc_29.G loc_7.C loc_7.T loc_7.C loc_43.C loc_43.T loc_43.C loc_43.T etc..
1 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5
2 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5
etc...
When I carry my PCA analysis with all my data:
>X <- scaleGen(myData_genid, scale=F, missing="mean")
>pca_myData<-dudi.pca(X,center=F,scale=F)
I get the following message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3,4,...
I really don't understand what is causing that, is there a hiden character in my data file that makes the df2genind divide my columns? Does that affect the results I get thereafter?
By the way, I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents the "real picture".
Thanks for your comments
Andrea
From t.jombart at imperial.ac.uk Tue Jun 17 15:24:07 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 17 Jun 2014 13:24:07 +0000
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
<2CB2DA8E426F3541AB1907F98ABA65709A12A57D@icexch-m1.ic.ac.uk>,
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A5ED@icexch-m1.ic.ac.uk>
Me neither. But so far:
- instructions we can verify all behave normally
- we don't have reproducible code for the stated problem
If you can send a small subset of data and command line used to create *myData*, and the commands showing the problem for this dataset, then we can try and figure it out.
Best
Thibaut
________________________________________
From: Andrea Garavito [neagef at gmail.com]
Sent: 17 June 2014 14:15
To: Jombart, Thibaut
Subject: Re: [adegenet-forum] SNP alleles
Hi Thibaut,
my Data is a matrix of 162 individuals with 10806 biallelic SNPs coded as I already mentioned.
I've done the df2genind with the ploidy=as.integer(2) and ploidy=2 parameter and I get exactly the same result.
It doesn't seem to be an empty character problem. I really don't understand.
> alleles(test_genid)
$L01
1 2 3
"A" "G" "G"
$L02
1 2 3
"C" "T" "C"
$L03
1 2
"G" "C"
$L04
1 2 3
"A" "C" "A"
$L05
1 2
"G" "A"
$L06
1 2
"G" "C"
$L07
1 2 3 4
"C" "T" "C" "T"
$L08
1 2 3
"C" "C" "T"
$L09
1 2 3 4
"G" "T" "G" "T"
$L10
1 2 3
"C" "T" "T"
Thanks again
Andrea
2014-06-17 14:57 GMT+02:00 Jombart, Thibaut >:
What is "myData"?
BTW it is safer to specify the ploidy when constructing a genind.
Try:
alleles(test_genid) # btw the name is 'genind' - genotype of individuals
to see if it is a problem of empty characters.
Cheers
Thibaut
________________________________________
From: Andrea Garavito [neagef at gmail.com]
Sent: 17 June 2014 13:47
To: Jombart, Thibaut
Cc: Caitlin Collins; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
Hi Caitlin and Thibaut,
Thanks for your answers.
I did used the sep argument. My code to generate the genind object is :
>myData_genid <- df2genind(myData, sep="/")
The weird thing is that when I try the same code with a test object that I created:
>dat = data.frame(loc1=c("A/A","T/A","T/A","T/T","T/A","A/T"), loc2=c("C/G","G/C","C/C","G/G","C/G","G/C"))
>x=df2genind(dat, sep="/")
I get the two columns per loci (as Thibaut does):
>truenames(x)
loc1.A loc1.T loc2.C loc2.G
1 1.0 0.0 0.5 0.5
2 0.5 0.5 0.5 0.5
3 0.5 0.5 1.0 0.0
4 0.0 1.0 0.0 1.0
5 0.5 0.5 0.5 0.5
6 0.5 0.5 0.5 0.5
But when I test a subset of my data
>test<-myData[1:10,1:10]
>test
loc_29 loc_7 loc_43 etc...
1 "G / A" "C / T" "T / T"
2 "G / G" "C / T" "T/ T"
etc...
> test_genid <- df2genind(test,sep="/")
I get again three or four columns:
>truenames(test_genid)
loc_29.A loc_29.G loc_29.G loc_7.C loc_7.T loc_7.C loc_43.C loc_43.T loc_43.C loc_43.T etc..
1 0.5 0.0 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5
2 0.0 0.5 0.5 0.0 0.5 0.5 0.0 0.5 0.0 0.5
etc...
When I carry my PCA analysis with all my data:
>X <- scaleGen(myData_genid, scale=F, missing="mean")
>pca_myData<-dudi.pca(X,center=F,scale=F)
I get the following message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3,4,...
I really don't understand what is causing that, is there a hiden character in my data file that makes the df2genind divide my columns? Does that affect the results I get thereafter?
By the way, I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents the "real picture".
Thanks for your comments
Andrea
From m.navascues at gmail.com Tue Jun 17 16:01:02 2014
From: m.navascues at gmail.com (=?ISO-8859-1?Q?Miguel_Navascu=E9s?=)
Date: Tue, 17 Jun 2014 16:01:02 +0200
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
Message-ID: <53A04A1E.5040409@gmail.com>
In one of your messages (below) there seem to be spaces in addition to
"/" separating the alleles. May be worth to check if that can cause the
problem.
Best
Miguel
On 17/06/14 14:47, Andrea Garavito wrote:
> >test<-myData[1:10,1:10]
> >test
> loc_29 loc_7 loc_43 etc...
> 1 "G / A" "C / T" "T / T"
> 2 "G / G" "C / T" "T/ T"
> etc...
--
Miguel NAVASCU?S, PhD
Charg? de Recherche (CR2) INRA
UMR CBGP Centre de Biologie pour la Gestion des Populations
Institut National de la Recherche Agronomique
Campus International de Baillarguet, CS 30016
34988 Montferrier-sur-Lez (France)
phone: +33(0)4.99.62.33.70
fax: +33(0)4.99.62.33.45
e-mail: miguel.navascues AT supagro.inra.fr
e-mail: m.navascues AT gmail.com
Skype: m.navascues
web: http://www1.montpellier.inra.fr/cbgp/
web: http://sites.google.com/site/navascuesresearch/
From t.jombart at imperial.ac.uk Tue Jun 17 16:08:16 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 17 Jun 2014 14:08:16 +0000
Subject: [adegenet-forum] SNP alleles
In-Reply-To: <53A04A1E.5040409@gmail.com>
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
,
<53A04A1E.5040409@gmail.com>
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>
Ahah, well spotted! I totally missed it.
Yep, open your file, remove all white spaces, and it should fly.
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Miguel Navascu?s [m.navascues at gmail.com]
Sent: 17 June 2014 15:01
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
In one of your messages (below) there seem to be spaces in addition to
"/" separating the alleles. May be worth to check if that can cause the
problem.
Best
Miguel
On 17/06/14 14:47, Andrea Garavito wrote:
> >test<-myData[1:10,1:10]
> >test
> loc_29 loc_7 loc_43 etc...
> 1 "G / A" "C / T" "T / T"
> 2 "G / G" "C / T" "T/ T"
> etc...
--
Miguel NAVASCU?S, PhD
Charg? de Recherche (CR2) INRA
UMR CBGP Centre de Biologie pour la Gestion des Populations
Institut National de la Recherche Agronomique
Campus International de Baillarguet, CS 30016
34988 Montferrier-sur-Lez (France)
phone: +33(0)4.99.62.33.70
fax: +33(0)4.99.62.33.45
e-mail: miguel.navascues AT supagro.inra.fr
e-mail: m.navascues AT gmail.com
Skype: m.navascues
web: http://www1.montpellier.inra.fr/cbgp/
web: http://sites.google.com/site/navascuesresearch/
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From caitiecollins at gmail.com Tue Jun 17 16:28:17 2014
From: caitiecollins at gmail.com (Caitlin Collins)
Date: Tue, 17 Jun 2014 15:28:17 +0100
Subject: [adegenet-forum] SNP alleles
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
<53A04A1E.5040409@gmail.com>
<2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>
Message-ID:
For this purpose, it would also be adequate to just change sep from "/" to
" / ", but I suppose there may be other reasons to want to remove the white
spaces.
Cheers,
Caitlin.
On Tue, Jun 17, 2014 at 3:08 PM, Jombart, Thibaut
wrote:
>
> Ahah, well spotted! I totally missed it.
>
> Yep, open your file, remove all white spaces, and it should fly.
>
> Cheers
> Thibaut
> ________________________________________
> From: adegenet-forum-bounces at lists.r-forge.r-project.org [
> adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Miguel
> Navascu?s [m.navascues at gmail.com]
> Sent: 17 June 2014 15:01
> To: adegenet-forum at lists.r-forge.r-project.org
> Subject: Re: [adegenet-forum] SNP alleles
>
> In one of your messages (below) there seem to be spaces in addition to
> "/" separating the alleles. May be worth to check if that can cause the
> problem.
>
> Best
>
> Miguel
>
> On 17/06/14 14:47, Andrea Garavito wrote:
> > >test<-myData[1:10,1:10]
> > >test
> > loc_29 loc_7 loc_43 etc...
> > 1 "G / A" "C / T" "T / T"
> > 2 "G / G" "C / T" "T/ T"
> > etc...
>
>
> --
> Miguel NAVASCU?S, PhD
>
> Charg? de Recherche (CR2) INRA
>
> UMR CBGP Centre de Biologie pour la Gestion des Populations
> Institut National de la Recherche Agronomique
> Campus International de Baillarguet, CS 30016
> 34988 Montferrier-sur-Lez (France)
>
> phone: +33(0)4.99.62.33.70
> fax: +33(0)4.99.62.33.45
> e-mail: miguel.navascues AT supagro.inra.fr
> e-mail: m.navascues AT gmail.com
> Skype: m.navascues
> web: http://www1.montpellier.inra.fr/cbgp/
> web: http://sites.google.com/site/navascuesresearch/
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Jun 17 16:36:33 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 17 Jun 2014 14:36:33 +0000
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
<53A04A1E.5040409@gmail.com>
<2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>,
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A69E@icexch-m1.ic.ac.uk>
Yup.
________________________________________
From: Caitlin Collins [caitiecollins at gmail.com]
Sent: 17 June 2014 15:28
To: Jombart, Thibaut
Cc: Miguel Navascu?s; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
For this purpose, it would also be adequate to just change sep from "/" to " / ", but I suppose there may be other reasons to want to remove the white spaces.
Cheers,
Caitlin.
On Tue, Jun 17, 2014 at 3:08 PM, Jombart, Thibaut > wrote:
Ahah, well spotted! I totally missed it.
Yep, open your file, remove all white spaces, and it should fly.
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Miguel Navascu?s [m.navascues at gmail.com]
Sent: 17 June 2014 15:01
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
In one of your messages (below) there seem to be spaces in addition to
"/" separating the alleles. May be worth to check if that can cause the
problem.
Best
Miguel
On 17/06/14 14:47, Andrea Garavito wrote:
> >test<-myData[1:10,1:10]
> >test
> loc_29 loc_7 loc_43 etc...
> 1 "G / A" "C / T" "T / T"
> 2 "G / G" "C / T" "T/ T"
> etc...
--
Miguel NAVASCU?S, PhD
Charg? de Recherche (CR2) INRA
UMR CBGP Centre de Biologie pour la Gestion des Populations
Institut National de la Recherche Agronomique
Campus International de Baillarguet, CS 30016
34988 Montferrier-sur-Lez (France)
phone: +33(0)4.99.62.33.70
fax: +33(0)4.99.62.33.45
e-mail: miguel.navascues AT supagro.inra.fr
e-mail: m.navascues AT gmail.com
Skype: m.navascues
web: http://www1.montpellier.inra.fr/cbgp/
web: http://sites.google.com/site/navascuesresearch/
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
From neagef at gmail.com Tue Jun 17 17:24:57 2014
From: neagef at gmail.com (Andrea Garavito)
Date: Tue, 17 Jun 2014 17:24:57 +0200
Subject: [adegenet-forum] SNP alleles
In-Reply-To: <2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
<53A04A1E.5040409@gmail.com>
<2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>
Message-ID:
Thanks Miguel,
You found the problem! I searched and replaced the space characters, redo
the analysis et voila! I have my two columns per marker.
With all the reformatting needed to obtain the A/T format from the original
excell file, no wonder how those spaces got into the data!
This allows me to rephrase my other question, that got lost in the
discussion:
I tried the scale=F and scale=T in the scaleGen function but I get two
radically different results. With scale=T my individuals get separated into
only two groups; while with scale=F, individuals get more "harmoniously"
distributed over the 2 first PC axis. Which one would be more appropriate
according to my data type? Because both seemed in agreement with the origin
of individuals, I'm not sure which one represents better the "real picture".
Thank you all for the help
Andrea
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From t.jombart at imperial.ac.uk Tue Jun 17 17:38:41 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Tue, 17 Jun 2014 15:38:41 +0000
Subject: [adegenet-forum] SNP alleles
In-Reply-To:
References:
<2CB2DA8E426F3541AB1907F98ABA65709A12A51C@icexch-m1.ic.ac.uk>
<53A04A1E.5040409@gmail.com>
<2CB2DA8E426F3541AB1907F98ABA65709A12A652@icexch-m1.ic.ac.uk>,
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12A754@icexch-m1.ic.ac.uk>
scale=TRUE will give a lot more weight to rare alleles. So it depends on how much you want to trust these.
I usually go for no scaling (scale=FALSE), so that alleles with low variability are not given an exaggerated weight.
Cheers
Thibaut
________________________________________
From: Andrea Garavito [neagef at gmail.com]
Sent: 17 June 2014 16:24
To: Jombart, Thibaut
Cc: Miguel Navascu?s; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] SNP alleles
Thanks Miguel,
You found the problem! I searched and replaced the space characters, redo the analysis et voila! I have my two columns per marker.
With all the reformatting needed to obtain the A/T format from the original excell file, no wonder how those spaces got into the data!
This allows me to rephrase my other question, that got lost in the discussion:
I tried the scale=F and scale=T in the scaleGen function but I get two radically different results. With scale=T my individuals get separated into only two groups; while with scale=F, individuals get more "harmoniously" distributed over the 2 first PC axis. Which one would be more appropriate according to my data type? Because both seemed in agreement with the origin of individuals, I'm not sure which one represents better the "real picture".
Thank you all for the help
Andrea
From manuelacorreia2 at gmail.com Wed Jun 18 19:51:53 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Wed, 18 Jun 2014 18:51:53 +0100
Subject: [adegenet-forum] set.seeds in DAPC
Message-ID:
Hi there,
I'd like to understand the role of set.seeds and the criteria chosen in
the DAPC examples according to the two examples presented in the lattested
version of DAPC tutorial.
I used to see set. seeds(N?) in the context of significance as well as
bootstrap Monte Carlo procedures, but not within multivariate techniques or
even with datasets.
At page 20 from DAPC tutorial there is a set. seed(4) before getting the
loadingplot. Also, another example at page 39, before split the dataset
microbov in two parts. And by the way, what is 20 in the sample(e,20....)?
20 individuals picked at random from all microbov populations?
So, I do have two questions.
One is "why to use them?" here in these particular examples?
The second one "what criteria were behind the choice of the number 4 in the
former case, and the number 2 in the latter?
How do I know which seed will be the best one for my datased in case I need
to have the loadingplot?
Thanks in advance,
M.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From caitiecollins at gmail.com Wed Jun 18 20:48:33 2014
From: caitiecollins at gmail.com (Caitlin Collins)
Date: Wed, 18 Jun 2014 19:48:33 +0100
Subject: [adegenet-forum] set.seeds in DAPC
In-Reply-To:
References:
Message-ID:
Hi,
Glad to see you've been reading the tutorial in such detail!
These are great questions, and the way you have asked them actually hints
at the answer: set.seed() is not inherently linked to multivariate
techniques or datasets, but rather with random number generation (more
specifically, with getting *reproducible* results from "random" processes).
This is probably why you have seen set.seed come up in the context of
bootstrap Monte Carlo procedures!
Essentially, when R is asked to generate a "random" number, it actually
generates a pseudo-random number by taking some input and generating an
output that seems random. Without being given an input, R does this by
using your computer's clock and using the current time as its starting
point, from which it generates a seemingly random number. You would not get
the same random number at a different time, so we find this adequate to
call the process "random" number generation, BUT if in fact you tried to
generate two "random" numbers at the exact same time (down to the
millisecond), you would actually get the exact same "random" number. (Note:
I have glossed over a lot of really interesting things about this process,
so if you want to know more about random number generation, please read on
here:
http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf
).
This potential problem with random number generation can occasionally be
quite useful in cases where we want to run something that requires random
number generation but where we would also like to get the same result each
time.
set.seed() is the way we control this. With set.seed(), the "seed" is used
as the input to our random number generation (instead of the clock), which
allows you to get *reproducible *"random" numbers.
Try this example:
rnorm(3)
rnorm(3)
set.seed(1)
rnorm(3)
set.seed(1) # note: for set.seed() to work, you need to use it before every
instance of random number generation.
rnorm(3)
Neat! Having established this, we can now answer your questions about why
we use set.seed() where we do in the DAPC tutorial.
On page 20, we use it before creating a loading plot. This is just because
we use the argument lab.jitter to move the labels around a bit. Jitter
works by adding random noise, so we can control it with set.seed(). We have
chosen to use set.seed(4) simply because it "randomly" put the labels in a
nice enough place. Arguably, set.seed(6) would have done a better job (next
time!), but it's a good thing we didn't use set.seed(2).
If you would like, you can see for yourself:
data(H3N2)
pop(H3N2) <- factor(H3N2$other$epid)
dapc.flu <- dapc(H3N2, n.pca=30,n.da=10)
set.seed(4)
contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1)
set.seed(6)
contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1)
set.seed(2)
contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1)
Finally, we use set.seed(2) on page 39 to get a "random" sample of 20
individuals (you were right about that) to serve as our "supplementary
individuals" for that exercise. Here, the use of set.seed(2) just ensures
that no matter how many times we edit and re-build that tutorial, we will
always get the same set of 20 individuals, which is useful for
consistency's sake.
All in all, I apologise for the long response that was possibly less
related to DAPC than you might have expected, but I hope that helped answer
your question!
Best,
Caitlin.
On Wed, Jun 18, 2014 at 6:51 PM, Manuela wrote:
> Hi there,
>
>
> I'd like to understand the role of set.seeds and the criteria chosen in
> the DAPC examples according to the two examples presented in the lattested
> version of DAPC tutorial.
>
> I used to see set. seeds(N?) in the context of significance as well as
> bootstrap Monte Carlo procedures, but not within multivariate techniques or
> even with datasets.
>
> At page 20 from DAPC tutorial there is a set. seed(4) before getting the
> loadingplot. Also, another example at page 39, before split the dataset
> microbov in two parts. And by the way, what is 20 in the sample(e,20....)?
> 20 individuals picked at random from all microbov populations?
>
>
> So, I do have two questions.
> One is "why to use them?" here in these particular examples?
> The second one "what criteria were behind the choice of the number 4 in
> the former case, and the number 2 in the latter?
>
> How do I know which seed will be the best one for my datased in case I
> need to have the loadingplot?
>
> Thanks in advance,
> M.
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From manuelacorreia2 at gmail.com Thu Jun 19 01:17:42 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Thu, 19 Jun 2014 00:17:42 +0100
Subject: [adegenet-forum] set.seeds in DAPC
In-Reply-To:
References:
Message-ID:
Dear Caitlin,
Thank you for such a clear response and at same time for being so
knowledgeable. It was quiet interesting to have a glimpse on the way how
the Adegenet team decided to use the set.seeds to obtain consistent
results, as well as (that was just brilliant!) to control the lab. jitter.
As you point up with the 3 examples its better to try several set.seeds in
order to find out the best labels position with our dataset. And when we
reach the final stage of cross-validation we ought to choose one seed to
ensure that the training set of supplementary individuals (no matter the
number (10%, 20%)) will always made up of the same set of individuals.
Thank you. I've learnt so much with this long response.
Cheers,
M.
2014-06-18 19:48 GMT+01:00 Caitlin Collins :
> Hi,
>
> Glad to see you've been reading the tutorial in such detail!
>
> These are great questions, and the way you have asked them actually hints
> at the answer: set.seed() is not inherently linked to multivariate
> techniques or datasets, but rather with random number generation (more
> specifically, with getting *reproducible* results from "random"
> processes). This is probably why you have seen set.seed come up in the
> context of bootstrap Monte Carlo procedures!
>
> Essentially, when R is asked to generate a "random" number, it actually
> generates a pseudo-random number by taking some input and generating an
> output that seems random. Without being given an input, R does this by
> using your computer's clock and using the current time as its starting
> point, from which it generates a seemingly random number. You would not get
> the same random number at a different time, so we find this adequate to
> call the process "random" number generation, BUT if in fact you tried to
> generate two "random" numbers at the exact same time (down to the
> millisecond), you would actually get the exact same "random" number. (Note:
> I have glossed over a lot of really interesting things about this process,
> so if you want to know more about random number generation, please read on
> here:
> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf
> ).
>
> This potential problem with random number generation can occasionally be
> quite useful in cases where we want to run something that requires random
> number generation but where we would also like to get the same result each
> time.
> set.seed() is the way we control this. With set.seed(), the "seed" is used
> as the input to our random number generation (instead of the clock), which
> allows you to get *reproducible *"random" numbers.
>
> Try this example:
>
> rnorm(3)
> rnorm(3)
>
> set.seed(1)
> rnorm(3)
>
> set.seed(1) # note: for set.seed() to work, you need to use it before
> every instance of random number generation.
> rnorm(3)
>
> Neat! Having established this, we can now answer your questions about why
> we use set.seed() where we do in the DAPC tutorial.
>
> On page 20, we use it before creating a loading plot. This is just because
> we use the argument lab.jitter to move the labels around a bit. Jitter
> works by adding random noise, so we can control it with set.seed(). We have
> chosen to use set.seed(4) simply because it "randomly" put the labels in a
> nice enough place. Arguably, set.seed(6) would have done a better job (next
> time!), but it's a good thing we didn't use set.seed(2).
>
> If you would like, you can see for yourself:
>
> data(H3N2)
> pop(H3N2) <- factor(H3N2$other$epid)
> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10)
>
> set.seed(4)
> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1)
>
> set.seed(6)
> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1)
>
> set.seed(2)
> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07, lab.jitter=1)
>
> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20
> individuals (you were right about that) to serve as our "supplementary
> individuals" for that exercise. Here, the use of set.seed(2) just ensures
> that no matter how many times we edit and re-build that tutorial, we will
> always get the same set of 20 individuals, which is useful for
> consistency's sake.
>
> All in all, I apologise for the long response that was possibly less
> related to DAPC than you might have expected, but I hope that helped answer
> your question!
>
> Best,
> Caitlin.
>
>
>
>
> On Wed, Jun 18, 2014 at 6:51 PM, Manuela
> wrote:
>
>> Hi there,
>>
>>
>> I'd like to understand the role of set.seeds and the criteria chosen in
>> the DAPC examples according to the two examples presented in the lattested
>> version of DAPC tutorial.
>>
>> I used to see set. seeds(N?) in the context of significance as well as
>> bootstrap Monte Carlo procedures, but not within multivariate techniques or
>> even with datasets.
>>
>> At page 20 from DAPC tutorial there is a set. seed(4) before getting the
>> loadingplot. Also, another example at page 39, before split the dataset
>> microbov in two parts. And by the way, what is 20 in the sample(e,20....)?
>> 20 individuals picked at random from all microbov populations?
>>
>>
>> So, I do have two questions.
>> One is "why to use them?" here in these particular examples?
>> The second one "what criteria were behind the choice of the number 4 in
>> the former case, and the number 2 in the latter?
>>
>> How do I know which seed will be the best one for my datased in case I
>> need to have the loadingplot?
>>
>> Thanks in advance,
>> M.
>>
>> _______________________________________________
>> adegenet-forum mailing list
>> adegenet-forum at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From caitiecollins at gmail.com Thu Jun 19 02:32:04 2014
From: caitiecollins at gmail.com (Caitlin Collins)
Date: Thu, 19 Jun 2014 01:32:04 +0100
Subject: [adegenet-forum] set.seeds in DAPC
In-Reply-To:
References:
Message-ID:
Hi Manuela,
Glad to hear I could help a bit!
I should stress that our use of set.seed() in the tutorial has been mainly
for the purpose of making the tutorial, as a document, consistent and
identically reproducible. In an experimental context, however, eg. in the
case of selecting supplementary individuals, if you are truly attempting to
test a concept (for example, in validating a model), you would actually
*want* random behaviour (ie. an effectively random sample). This is
particularly the case if you are performing repeated sampling, as one often
does with supplementary individuals. So be careful to only set the seed
when you do NOT want a random sample; otherwise, just leave out set.seed()
from the process and let the computer pick a sample at random.
Best,
Caitlin.
On Thu, Jun 19, 2014 at 12:17 AM, Manuela wrote:
> Dear Caitlin,
>
>
> Thank you for such a clear response and at same time for being so
> knowledgeable. It was quiet interesting to have a glimpse on the way how
> the Adegenet team decided to use the set.seeds to obtain consistent
> results, as well as (that was just brilliant!) to control the lab. jitter.
>
> As you point up with the 3 examples its better to try several set.seeds in
> order to find out the best labels position with our dataset. And when we
> reach the final stage of cross-validation we ought to choose one seed to
> ensure that the training set of supplementary individuals (no matter the
> number (10%, 20%)) will always made up of the same set of individuals.
>
> Thank you. I've learnt so much with this long response.
>
> Cheers,
> M.
>
>
> 2014-06-18 19:48 GMT+01:00 Caitlin Collins :
>
> Hi,
>>
>> Glad to see you've been reading the tutorial in such detail!
>>
>> These are great questions, and the way you have asked them actually hints
>> at the answer: set.seed() is not inherently linked to multivariate
>> techniques or datasets, but rather with random number generation (more
>> specifically, with getting *reproducible* results from "random"
>> processes). This is probably why you have seen set.seed come up in the
>> context of bootstrap Monte Carlo procedures!
>>
>> Essentially, when R is asked to generate a "random" number, it actually
>> generates a pseudo-random number by taking some input and generating an
>> output that seems random. Without being given an input, R does this by
>> using your computer's clock and using the current time as its starting
>> point, from which it generates a seemingly random number. You would not get
>> the same random number at a different time, so we find this adequate to
>> call the process "random" number generation, BUT if in fact you tried to
>> generate two "random" numbers at the exact same time (down to the
>> millisecond), you would actually get the exact same "random" number. (Note:
>> I have glossed over a lot of really interesting things about this process,
>> so if you want to know more about random number generation, please read on
>> here:
>> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf
>> ).
>>
>> This potential problem with random number generation can occasionally be
>> quite useful in cases where we want to run something that requires random
>> number generation but where we would also like to get the same result each
>> time.
>> set.seed() is the way we control this. With set.seed(), the "seed" is
>> used as the input to our random number generation (instead of the clock),
>> which allows you to get *reproducible *"random" numbers.
>>
>> Try this example:
>>
>> rnorm(3)
>> rnorm(3)
>>
>> set.seed(1)
>> rnorm(3)
>>
>> set.seed(1) # note: for set.seed() to work, you need to use it before
>> every instance of random number generation.
>> rnorm(3)
>>
>> Neat! Having established this, we can now answer your questions about why
>> we use set.seed() where we do in the DAPC tutorial.
>>
>> On page 20, we use it before creating a loading plot. This is just
>> because we use the argument lab.jitter to move the labels around a bit.
>> Jitter works by adding random noise, so we can control it with set.seed().
>> We have chosen to use set.seed(4) simply because it "randomly" put the
>> labels in a nice enough place. Arguably, set.seed(6) would have done a
>> better job (next time!), but it's a good thing we didn't use set.seed(2).
>>
>> If you would like, you can see for yourself:
>>
>> data(H3N2)
>> pop(H3N2) <- factor(H3N2$other$epid)
>> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10)
>>
>> set.seed(4)
>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
>> lab.jitter=1)
>>
>> set.seed(6)
>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
>> lab.jitter=1)
>>
>> set.seed(2)
>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
>> lab.jitter=1)
>>
>> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20
>> individuals (you were right about that) to serve as our "supplementary
>> individuals" for that exercise. Here, the use of set.seed(2) just ensures
>> that no matter how many times we edit and re-build that tutorial, we will
>> always get the same set of 20 individuals, which is useful for
>> consistency's sake.
>>
>> All in all, I apologise for the long response that was possibly less
>> related to DAPC than you might have expected, but I hope that helped answer
>> your question!
>>
>> Best,
>> Caitlin.
>>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 6:51 PM, Manuela
>> wrote:
>>
>>> Hi there,
>>>
>>>
>>> I'd like to understand the role of set.seeds and the criteria chosen
>>> in the DAPC examples according to the two examples presented in the
>>> lattested version of DAPC tutorial.
>>>
>>> I used to see set. seeds(N?) in the context of significance as well as
>>> bootstrap Monte Carlo procedures, but not within multivariate techniques or
>>> even with datasets.
>>>
>>> At page 20 from DAPC tutorial there is a set. seed(4) before getting the
>>> loadingplot. Also, another example at page 39, before split the dataset
>>> microbov in two parts. And by the way, what is 20 in the sample(e,20....)?
>>> 20 individuals picked at random from all microbov populations?
>>>
>>>
>>> So, I do have two questions.
>>> One is "why to use them?" here in these particular examples?
>>> The second one "what criteria were behind the choice of the number 4 in
>>> the former case, and the number 2 in the latter?
>>>
>>> How do I know which seed will be the best one for my datased in case I
>>> need to have the loadingplot?
>>>
>>> Thanks in advance,
>>> M.
>>>
>>> _______________________________________________
>>> adegenet-forum mailing list
>>> adegenet-forum at lists.r-forge.r-project.org
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mcialdini at gmail.com Thu Jun 19 14:27:39 2014
From: mcialdini at gmail.com (Manuela)
Date: Thu, 19 Jun 2014 13:27:39 +0100
Subject: [adegenet-forum] adegenet-forum Digest, Vol 70, Issue 16
In-Reply-To:
References:
Message-ID:
Hi Caitlin.
Good point!
In fact, I' didn?t notice this tiny nuance in the rationale behind
cross-validation on using a stratified sampling of 10% of individuals
(validation set sample) in the well-exemplified nancycats datset, through
the ciclic process of PC retention, sampling and DAPC procedures in each
set number of PCAs retained, BUT not the same set of individuals in each
round.
>From the second one based on supplementary individuals used on predicting
results. Also the way they were selected was different. They result from a
split of the original sample into a stratified "testing sample" of X
individuals, BUT using a non-random sample as defined by set.seed()
function.
Later, I'll present you a new set of questions raised by clines for being
thoroughly evaluated on modelling by DAPC.
Cheers,
M.
2014-06-19 11:00 GMT+01:00 <
adegenet-forum-request at lists.r-forge.r-project.org>:
> Send adegenet-forum mailing list submissions to
> adegenet-forum at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> or, via email, send a message with subject or body 'help' to
> adegenet-forum-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
> adegenet-forum-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of adegenet-forum digest..."
>
>
> Today's Topics:
>
> 1. Re: set.seeds in DAPC (Caitlin Collins)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 19 Jun 2014 01:32:04 +0100
> From: Caitlin Collins
> To: Manuela
> Cc: "adegenet-forum at lists.r-forge.r-project.org"
>
> Subject: Re: [adegenet-forum] set.seeds in DAPC
> Message-ID:
> <
> CAMon0MDGDDZmFji6_T2McFtsqTzNmr7ENTE0Fj1rXiFYP_P_9g at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Manuela,
>
> Glad to hear I could help a bit!
>
> I should stress that our use of set.seed() in the tutorial has been mainly
> for the purpose of making the tutorial, as a document, consistent and
> identically reproducible. In an experimental context, however, eg. in the
> case of selecting supplementary individuals, if you are truly attempting to
> test a concept (for example, in validating a model), you would actually
> *want* random behaviour (ie. an effectively random sample). This is
> particularly the case if you are performing repeated sampling, as one often
> does with supplementary individuals. So be careful to only set the seed
> when you do NOT want a random sample; otherwise, just leave out set.seed()
> from the process and let the computer pick a sample at random.
>
> Best,
> Caitlin.
>
>
> On Thu, Jun 19, 2014 at 12:17 AM, Manuela
> wrote:
>
> > Dear Caitlin,
> >
> >
> > Thank you for such a clear response and at same time for being so
> > knowledgeable. It was quiet interesting to have a glimpse on the way how
> > the Adegenet team decided to use the set.seeds to obtain consistent
> > results, as well as (that was just brilliant!) to control the lab.
> jitter.
> >
> > As you point up with the 3 examples its better to try several set.seeds
> in
> > order to find out the best labels position with our dataset. And when we
> > reach the final stage of cross-validation we ought to choose one seed to
> > ensure that the training set of supplementary individuals (no matter the
> > number (10%, 20%)) will always made up of the same set of individuals.
> >
> > Thank you. I've learnt so much with this long response.
> >
> > Cheers,
> > M.
> >
> >
> > 2014-06-18 19:48 GMT+01:00 Caitlin Collins :
> >
> > Hi,
> >>
> >> Glad to see you've been reading the tutorial in such detail!
> >>
> >> These are great questions, and the way you have asked them actually
> hints
> >> at the answer: set.seed() is not inherently linked to multivariate
> >> techniques or datasets, but rather with random number generation (more
> >> specifically, with getting *reproducible* results from "random"
> >> processes). This is probably why you have seen set.seed come up in the
> >> context of bootstrap Monte Carlo procedures!
> >>
> >> Essentially, when R is asked to generate a "random" number, it actually
> >> generates a pseudo-random number by taking some input and generating an
> >> output that seems random. Without being given an input, R does this by
> >> using your computer's clock and using the current time as its starting
> >> point, from which it generates a seemingly random number. You would not
> get
> >> the same random number at a different time, so we find this adequate to
> >> call the process "random" number generation, BUT if in fact you tried to
> >> generate two "random" numbers at the exact same time (down to the
> >> millisecond), you would actually get the exact same "random" number.
> (Note:
> >> I have glossed over a lot of really interesting things about this
> process,
> >> so if you want to know more about random number generation, please read
> on
> >> here:
> >>
> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf
> >> ).
> >>
> >> This potential problem with random number generation can occasionally be
> >> quite useful in cases where we want to run something that requires
> random
> >> number generation but where we would also like to get the same result
> each
> >> time.
> >> set.seed() is the way we control this. With set.seed(), the "seed" is
> >> used as the input to our random number generation (instead of the
> clock),
> >> which allows you to get *reproducible *"random" numbers.
> >>
> >> Try this example:
> >>
> >> rnorm(3)
> >> rnorm(3)
> >>
> >> set.seed(1)
> >> rnorm(3)
> >>
> >> set.seed(1) # note: for set.seed() to work, you need to use it before
> >> every instance of random number generation.
> >> rnorm(3)
> >>
> >> Neat! Having established this, we can now answer your questions about
> why
> >> we use set.seed() where we do in the DAPC tutorial.
> >>
> >> On page 20, we use it before creating a loading plot. This is just
> >> because we use the argument lab.jitter to move the labels around a bit.
> >> Jitter works by adding random noise, so we can control it with
> set.seed().
> >> We have chosen to use set.seed(4) simply because it "randomly" put the
> >> labels in a nice enough place. Arguably, set.seed(6) would have done a
> >> better job (next time!), but it's a good thing we didn't use
> set.seed(2).
> >>
> >> If you would like, you can see for yourself:
> >>
> >> data(H3N2)
> >> pop(H3N2) <- factor(H3N2$other$epid)
> >> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10)
> >>
> >> set.seed(4)
> >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
> >> lab.jitter=1)
> >>
> >> set.seed(6)
> >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
> >> lab.jitter=1)
> >>
> >> set.seed(2)
> >> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
> >> lab.jitter=1)
> >>
> >> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20
> >> individuals (you were right about that) to serve as our "supplementary
> >> individuals" for that exercise. Here, the use of set.seed(2) just
> ensures
> >> that no matter how many times we edit and re-build that tutorial, we
> will
> >> always get the same set of 20 individuals, which is useful for
> >> consistency's sake.
> >>
> >> All in all, I apologise for the long response that was possibly less
> >> related to DAPC than you might have expected, but I hope that helped
> answer
> >> your question!
> >>
> >> Best,
> >> Caitlin.
> >>
> >>
> >>
> >>
> >> On Wed, Jun 18, 2014 at 6:51 PM, Manuela
> >> wrote:
> >>
> >>> Hi there,
> >>>
> >>>
> >>> I'd like to understand the role of set.seeds and the criteria chosen
> >>> in the DAPC examples according to the two examples presented in the
> >>> lattested version of DAPC tutorial.
> >>>
> >>> I used to see set. seeds(N?) in the context of significance as well as
> >>> bootstrap Monte Carlo procedures, but not within multivariate
> techniques or
> >>> even with datasets.
> >>>
> >>> At page 20 from DAPC tutorial there is a set. seed(4) before getting
> the
> >>> loadingplot. Also, another example at page 39, before split the dataset
> >>> microbov in two parts. And by the way, what is 20 in the
> sample(e,20....)?
> >>> 20 individuals picked at random from all microbov populations?
> >>>
> >>>
> >>> So, I do have two questions.
> >>> One is "why to use them?" here in these particular examples?
> >>> The second one "what criteria were behind the choice of the number 4 in
> >>> the former case, and the number 2 in the latter?
> >>>
> >>> How do I know which seed will be the best one for my datased in case I
> >>> need to have the loadingplot?
> >>>
> >>> Thanks in advance,
> >>> M.
> >>>
> >>> _______________________________________________
> >>> adegenet-forum mailing list
> >>> adegenet-forum at lists.r-forge.r-project.org
> >>>
> >>>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
> >>>
> >>
> >>
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20140619/db7b9f27/attachment-0001.html
> >
>
> ------------------------------
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
> End of adegenet-forum Digest, Vol 70, Issue 16
> **********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From manuelacorreia2 at gmail.com Thu Jun 19 14:48:54 2014
From: manuelacorreia2 at gmail.com (Manuela)
Date: Thu, 19 Jun 2014 13:48:54 +0100
Subject: [adegenet-forum] set.seeds in DAPC
In-Reply-To:
References:
Message-ID:
Hi Caitlin.
Good point!
In fact, I' didn?t notice this tiny nuance in the rationale behind
cross-validation on using a stratified sampling of 10% of individuals
(validation set sample) in the well-exemplified nancycats datset, through
the ciclic process of PC retention, sampling and DAPC procedures in each
set number of PCAs retained, BUT not the same set of individuals in each
round.
>From the second one based on supplementary individuals used on predicting
results. Also the way they were selected was different. They result from a
split of the original sample into a stratified "testing sample" of X
individuals, BUT using a non-random sample as defined by set.seed()
function.
Later, I'll present you a new set of questions raised by clines for being
thoroughly evaluated on modelling by DAPC.
Cheers,
M.
2014-06-19 1:32 GMT+01:00 Caitlin Collins :
> Hi Manuela,
>
> Glad to hear I could help a bit!
>
> I should stress that our use of set.seed() in the tutorial has been mainly
> for the purpose of making the tutorial, as a document, consistent and
> identically reproducible. In an experimental context, however, eg. in the
> case of selecting supplementary individuals, if you are truly attempting to
> test a concept (for example, in validating a model), you would actually
> *want* random behaviour (ie. an effectively random sample). This is
> particularly the case if you are performing repeated sampling, as one often
> does with supplementary individuals. So be careful to only set the seed
> when you do NOT want a random sample; otherwise, just leave out set.seed()
> from the process and let the computer pick a sample at random.
>
> Best,
> Caitlin.
>
>
> On Thu, Jun 19, 2014 at 12:17 AM, Manuela
> wrote:
>
>> Dear Caitlin,
>>
>>
>> Thank you for such a clear response and at same time for being so
>> knowledgeable. It was quiet interesting to have a glimpse on the way how
>> the Adegenet team decided to use the set.seeds to obtain consistent
>> results, as well as (that was just brilliant!) to control the lab. jitter.
>>
>> As you point up with the 3 examples its better to try several set.seeds
>> in order to find out the best labels position with our dataset. And when we
>> reach the final stage of cross-validation we ought to choose one seed to
>> ensure that the training set of supplementary individuals (no matter the
>> number (10%, 20%)) will always made up of the same set of individuals.
>>
>> Thank you. I've learnt so much with this long response.
>>
>> Cheers,
>> M.
>>
>>
>> 2014-06-18 19:48 GMT+01:00 Caitlin Collins :
>>
>> Hi,
>>>
>>> Glad to see you've been reading the tutorial in such detail!
>>>
>>> These are great questions, and the way you have asked them actually
>>> hints at the answer: set.seed() is not inherently linked to multivariate
>>> techniques or datasets, but rather with random number generation (more
>>> specifically, with getting *reproducible* results from "random"
>>> processes). This is probably why you have seen set.seed come up in the
>>> context of bootstrap Monte Carlo procedures!
>>>
>>> Essentially, when R is asked to generate a "random" number, it actually
>>> generates a pseudo-random number by taking some input and generating an
>>> output that seems random. Without being given an input, R does this by
>>> using your computer's clock and using the current time as its starting
>>> point, from which it generates a seemingly random number. You would not get
>>> the same random number at a different time, so we find this adequate to
>>> call the process "random" number generation, BUT if in fact you tried to
>>> generate two "random" numbers at the exact same time (down to the
>>> millisecond), you would actually get the exact same "random" number. (Note:
>>> I have glossed over a lot of really interesting things about this process,
>>> so if you want to know more about random number generation, please read on
>>> here:
>>> http://cran.r-project.org/web/packages/randtoolbox/vignettes/fullpres.pdf
>>> ).
>>>
>>> This potential problem with random number generation can occasionally be
>>> quite useful in cases where we want to run something that requires random
>>> number generation but where we would also like to get the same result each
>>> time.
>>> set.seed() is the way we control this. With set.seed(), the "seed" is
>>> used as the input to our random number generation (instead of the clock),
>>> which allows you to get *reproducible *"random" numbers.
>>>
>>> Try this example:
>>>
>>> rnorm(3)
>>> rnorm(3)
>>>
>>> set.seed(1)
>>> rnorm(3)
>>>
>>> set.seed(1) # note: for set.seed() to work, you need to use it before
>>> every instance of random number generation.
>>> rnorm(3)
>>>
>>> Neat! Having established this, we can now answer your questions about
>>> why we use set.seed() where we do in the DAPC tutorial.
>>>
>>> On page 20, we use it before creating a loading plot. This is just
>>> because we use the argument lab.jitter to move the labels around a bit.
>>> Jitter works by adding random noise, so we can control it with set.seed().
>>> We have chosen to use set.seed(4) simply because it "randomly" put the
>>> labels in a nice enough place. Arguably, set.seed(6) would have done a
>>> better job (next time!), but it's a good thing we didn't use set.seed(2).
>>>
>>> If you would like, you can see for yourself:
>>>
>>> data(H3N2)
>>> pop(H3N2) <- factor(H3N2$other$epid)
>>> dapc.flu <- dapc(H3N2, n.pca=30,n.da=10)
>>>
>>> set.seed(4)
>>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
>>> lab.jitter=1)
>>>
>>> set.seed(6)
>>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
>>> lab.jitter=1)
>>>
>>> set.seed(2)
>>> contrib <- loadingplot(dapc.flu$var.contr, axis=2, thres=.07,
>>> lab.jitter=1)
>>>
>>> Finally, we use set.seed(2) on page 39 to get a "random" sample of 20
>>> individuals (you were right about that) to serve as our "supplementary
>>> individuals" for that exercise. Here, the use of set.seed(2) just ensures
>>> that no matter how many times we edit and re-build that tutorial, we will
>>> always get the same set of 20 individuals, which is useful for
>>> consistency's sake.
>>>
>>> All in all, I apologise for the long response that was possibly less
>>> related to DAPC than you might have expected, but I hope that helped answer
>>> your question!
>>>
>>> Best,
>>> Caitlin.
>>>
>>>
>>>
>>>
>>> On Wed, Jun 18, 2014 at 6:51 PM, Manuela
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>>
>>>> I'd like to understand the role of set.seeds and the criteria chosen
>>>> in the DAPC examples according to the two examples presented in the
>>>> lattested version of DAPC tutorial.
>>>>
>>>> I used to see set. seeds(N?) in the context of significance as well as
>>>> bootstrap Monte Carlo procedures, but not within multivariate techniques or
>>>> even with datasets.
>>>>
>>>> At page 20 from DAPC tutorial there is a set. seed(4) before getting
>>>> the loadingplot. Also, another example at page 39, before split the dataset
>>>> microbov in two parts. And by the way, what is 20 in the sample(e,20....)?
>>>> 20 individuals picked at random from all microbov populations?
>>>>
>>>>
>>>> So, I do have two questions.
>>>> One is "why to use them?" here in these particular examples?
>>>> The second one "what criteria were behind the choice of the number 4 in
>>>> the former case, and the number 2 in the latter?
>>>>
>>>> How do I know which seed will be the best one for my datased in case I
>>>> need to have the loadingplot?
>>>>
>>>> Thanks in advance,
>>>> M.
>>>>
>>>> _______________________________________________
>>>> adegenet-forum mailing list
>>>> adegenet-forum at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From kelly.bennett at manchester.ac.uk Wed Jun 18 13:45:00 2014
From: kelly.bennett at manchester.ac.uk (Kelly Bennett)
Date: Wed, 18 Jun 2014 11:45:00 +0000
Subject: [adegenet-forum] confusing p value in mantel test
Message-ID:
Hello,
I have run a mantel test with the following code
dna <- read.dna(file = "dna_manteltest.fasta", format = "fasta")
dna.dists <- dist(dna, method = "euclidean")
as.matrix(dna.dists)[1:5, 1:5]
geo <- read.csv(file = "geo_matrix.csv")
geo[1:2, 1:2]
geo.dists <- dist(geo, method = "euclidean")
as.matrix(geo.dists)[1:5, 1:5]
mantelresult<-mantel.rtest(dna.dists, geo.dists, nrepet = 9999)
cor.test(geo.dists, dna.dists)
plot(mantelresult <- mantel.rtest(dna.dists, geo.dists), main = "Mantel's test")
mantelresult
>From my plot it looks like there should be isolation by distance and a correlation test shows a significant association but my p value for the Monte Carlo test = 1
Does anyone have any ideas about this contradiction? I have attached the plot to this email
Thank you very much,
Kelly
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Mantelplot.pdf
Type: application/pdf
Size: 4508 bytes
Desc: Mantelplot.pdf
URL:
From vojta at trapa.cz Mon Jun 23 14:58:17 2014
From: vojta at trapa.cz (=?utf-8?B?Vm9qdMSbY2g=?= Zeisek)
Date: Mon, 23 Jun 2014 14:58:17 +0200
Subject: [adegenet-forum] confusing p value in mantel test
In-Reply-To:
References:
Message-ID: <15094289.eak1km79LX@veles.site>
Hello
Dne St 18. ?ervna 2014 11:45:00, Kelly Bennett napsal(a):
> Hello,
>
> I have run a mantel test with the following code
>
> dna <- read.dna(file = "dna_manteltest.fasta", format = "fasta")
> dna.dists <- dist(dna, method = "euclidean")
Why do You use function dist() and not dist.dna() (package APE) having various
mutations models? IMHO, Euclidean distance is not the best for nucleotide
data, I'd use it for fragmentation data, but not here.
> as.matrix(dna.dists)[1:5, 1:5]
> geo <- read.csv(file = "geo_matrix.csv")
> geo[1:2, 1:2]
> geo.dists <- dist(geo, method = "euclidean")
> as.matrix(geo.dists)[1:5, 1:5]
> mantelresult<-mantel.rtest(dna.dists, geo.dists, nrepet = 9999)
> cor.test(geo.dists, dna.dists)
> plot(mantelresult <- mantel.rtest(dna.dists, geo.dists), main = "Mantel's
> test") mantelresult
>
> From my plot it looks like there should be isolation by distance and a
> correlation test shows a significant association but my p value for the
> Monte Carlo test = 1
>
> Does anyone have any ideas about this contradiction? I have attached the
> plot to this email
Well, it will produce some result every time You give it some data, even if
they are wrongly used. Right now it might be the case.
> Thank you very much,
>
> Kelly
Sincerely,
Vojt?ch
--
Vojt?ch Zeisek
http://trapa.cz/en/
Department of Botany, Faculty of Science
Charles University in Prague
Ben?tsk? 2, Prague, 12801, CZ
http://botany.natur.cuni.cz/en/
Institute of Botany, Academy of Science
Z?mek 1, Pr?honice, 25243, CZ
http://www.ibot.cas.cz/en/
Czech Republic
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part.
URL:
From m.navascues at gmail.com Mon Jun 23 15:06:35 2014
From: m.navascues at gmail.com (Miguel Navascues)
Date: Mon, 23 Jun 2014 15:06:35 +0200
Subject: [adegenet-forum] confusing p value in mantel test
In-Reply-To:
References:
Message-ID: <53A8265B.7050005@supagro.inra.fr>
Hello Kelly,
It looks like there is a NEGATIVE correlation between genetic and
geographical distance, no isolation by distance...
Best,
Miguel
On 18/06/14 13:45, Kelly Bennett wrote:
>
>
> Hello,
>
> I have run a mantel test with the following code
>
> dna <- read.dna(file = "dna_manteltest.fasta", format = "fasta")
> dna.dists <- dist(dna, method = "euclidean")
> as.matrix(dna.dists)[1:5, 1:5]
> geo <- read.csv(file = "geo_matrix.csv")
> geo[1:2, 1:2]
> geo.dists <- dist(geo, method = "euclidean")
> as.matrix(geo.dists)[1:5, 1:5]
> mantelresult<-mantel.rtest(dna.dists, geo.dists, nrepet = 9999)
> cor.test(geo.dists, dna.dists)
> plot(mantelresult <- mantel.rtest(dna.dists, geo.dists), main =
> "Mantel's test")
> mantelresult
>
> From my plot it looks like there should be isolation by distance and a
> correlation test shows a significant association but my p value for the
> Monte Carlo test = 1
>
> Does anyone have any ideas about this contradiction? I have attached the
> plot to this email
>
> Thank you very much,
>
> Kelly
>
>
>
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
--
Miguel NAVASCU?S, PhD
Charg? de Recherche (CR2) INRA
UMR CBGP Centre de Biologie pour la Gestion des Populations
Institut National de la Recherche Agronomique
Campus International de Baillarguet, CS 30016
34988 Montferrier-sur-Lez (France)
phone: +33(0)4.99.62.33.70
fax: +33(0)4.99.62.33.45
e-mail: miguel.navascues AT supagro.inra.fr
e-mail: m.navascues AT gmail.com
Skype: m.navascues
web: http://www1.montpellier.inra.fr/cbgp/
web: http://sites.google.com/site/navascuesresearch/
From schwarcz.kaiser at gmail.com Wed Jun 25 21:27:41 2014
From: schwarcz.kaiser at gmail.com (Kaiser Schwarcz)
Date: Wed, 25 Jun 2014 16:27:41 -0300
Subject: [adegenet-forum] adegenet with chloroplast
Message-ID:
Is that a way to analyse chloroplast microssatellite data with adegenet?
I have a str file with my data for STRUCTURE but I don't know how import it
to genind because my data is not "codom" nor a "PA"
Is thare a way to do it?
*Kaiser Dias Schwarcz*
Me. Biologia Molecular e Evolu??o
Unicamp - Brasil
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From sonofvin at gmail.com Wed Jun 25 23:17:09 2014
From: sonofvin at gmail.com (Vinson Doyle)
Date: Wed, 25 Jun 2014 17:17:09 -0400
Subject: [adegenet-forum] adegenet with chloroplast
In-Reply-To:
References:
Message-ID:
Treat it as codom and import using read.table. Then convert to genind with
df2genind.
-Vinson
On Wed, Jun 25, 2014 at 3:27 PM, Kaiser Schwarcz
wrote:
> Is that a way to analyse chloroplast microssatellite data with adegenet?
> I have a str file with my data for STRUCTURE but I don't know how import
> it to genind because my data is not "codom" nor a "PA"
>
> Is thare a way to do it?
>
> *Kaiser Dias Schwarcz*
> Me. Biologia Molecular e Evolu??o
> Unicamp - Brasil
>
> _______________________________________________
> adegenet-forum mailing list
> adegenet-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From bonanomi.sara85 at gmail.com Thu Jun 26 15:52:20 2014
From: bonanomi.sara85 at gmail.com (Sara Bonanomi)
Date: Thu, 26 Jun 2014 15:52:20 +0200
Subject: [adegenet-forum] convert genind object in data frame (A:C A:T
T:G...)
Message-ID:
Dear Thibaut,
I don?t get how you could convert genepop file or a genind object into a
dataframe, so I could get for instance a csv table with my genotypes in
bases (e.g A:G , A:C...).
Thank you,
Best regards
Sara
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From vojta at trapa.cz Thu Jun 26 15:59:24 2014
From: vojta at trapa.cz (=?utf-8?B?Vm9qdMSbY2g=?= Zeisek)
Date: Thu, 26 Jun 2014 15:59:24 +0200
Subject: [adegenet-forum] convert genind object in data frame (A:C A:T
T:G...)
In-Reply-To:
References:
Message-ID: <2448216.6dLDrNeryg@veles.site>
Hello
Dne ?t 26. ?ervna 2014 15:52:20, Sara Bonanomi napsal(a):
> Dear Thibaut,
>
> I don?t get how you could convert genepop file or a genind object into a
> dataframe, so I could get for instance a csv table with my genotypes in
> bases (e.g A:G , A:C...).
Might be I miss something, but I'd guess You convert Your data from data frame
to genind object, right? Then I'd just pick those original data. If this is
not Your case, check functions genind2genotype and genind2df. I don't think
there is way how to reconstruct genind back from genpop as genpop (as far as I
know) doesn't store all information needed to correctly assign alleles to
original individuals.
> Thank you,
>
> Best regards
>
> Sara
All the best,
Vojt?ch
--
Vojt?ch Zeisek
http://trapa.cz/en/
Department of Botany, Faculty of Science
Charles University in Prague
Ben?tsk? 2, Prague, 12801, CZ
http://botany.natur.cuni.cz/en/
Institute of Botany, Academy of Science
Z?mek 1, Pr?honice, 25243, CZ
http://www.ibot.cas.cz/en/
Czech Republic
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part.
URL:
From t.jombart at imperial.ac.uk Sun Jun 29 19:30:51 2014
From: t.jombart at imperial.ac.uk (Jombart, Thibaut)
Date: Sun, 29 Jun 2014 17:30:51 +0000
Subject: [adegenet-forum] convert genind object in data frame (A:C
A:T T:G...)
In-Reply-To:
References:
Message-ID: <2CB2DA8E426F3541AB1907F98ABA65709A12E30F@icexch-m1.ic.ac.uk>
Hello,
check out genind2df. All in the basics tutorial
Cheers
Thibaut
________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Sara Bonanomi [bonanomi.sara85 at gmail.com]
Sent: 26 June 2014 14:52
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] convert genind object in data frame (A:C A:T T:G...)
Dear Thibaut,
I don?t get how you could convert genepop file or a genind object into a dataframe, so I could get for instance a csv table with my genotypes in bases (e.g A:G , A:C...).
Thank you,
Best regards
Sara