[adegenet-forum] problems with na.replace and values for n.pca & n.da
Jombart, Thibaut
t.jombart at imperial.ac.uk
Tue Oct 20 15:34:40 CEST 2015
Hi there,
you should not edit manually a genepop file before reading it using read.genepop. If alleles are coded using two characters, the proper format for NAs in a genepop file should be "0000". read.genepop has been working fine so far, but if you have an example of NAs being mis-read, we will fix the issue quickly.
tab(...) is the way to go to extract allele counts / frequencies from a genind and replace missing data (see ?tab), but you don't need it for dapc as the function handles NAs for you. This is illustrated in multiple examples in ?dapc, and in more details in the tutorial cited before.
Looking at your graphs, it does not seem that clusters are useful summaries of your dataset. You can increase n.start in find.clusters if you want to smooth the curve a bit, but I doubt there is any relevant 'K' here. A simple centred (not scaled) PCA may be more useful?
Cheers
Thibaut
==============================
Dr Thibaut Jombart
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
Norfolk Place, London W2 1PG, UK
Tel. : 0044 (0)20 7594 3658
http://sites.google.com/site/thibautjombart/
http://sites.google.com/site/therepiproject/
http://adegenet.r-forge.r-project.org/
Twitter: @thibautjombart
________________________________
From: D. Magdalena Sorger [dm.sorger at gmail.com]
Sent: 19 October 2015 21:55
To: Roman Lustrik
Cc: Jombart, Thibaut; adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] problems with na.replace and values for n.pca & n.da
Hi Roman (and everyone else),
Thank you, it looks like this is the same tutorial I had been using and that my questions refer to (July 29, 2015).
Please, let me know if anyone can help me with my questions, both regarding replacing missing allele data but also (and more importantly) regarding the selection of the # of PCs and the # of discriminant functions (see below).
Best,
Magdalena
On Fri, Oct 16, 2015 at 10:32 AM, Roman Lustrik <roman.lustrik at biolitika.si<mailto:roman.lustrik at biolitika.si>> wrote:
Hello Magdalena,
you can find a DAPC tutorial on GitHub: https://github.com/thibautjombart/adegenet/wiki/Tutorials
HTH,
Roman
----
In god we trust, all others bring data.
________________________________
From: "D. Magdalena Sorger" <dm.sorger at gmail.com<mailto:dm.sorger at gmail.com>>
To: "Thibaut Jombart" <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>>
Cc: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Sent: Thursday, October 15, 2015 7:45:18 PM
Subject: Re: [adegenet-forum] problems with na.replace and values for n.pca & n.da
Dear Thibaut,
I found the information on scaleGen() in a tutorial from July 29, 2015 on the adegenet website: http://adegenet.r-forge.r-project.org/files/tutorial-basics.pdf
Other than this function, I didn't find any instructions on how to treat missing values in that tutorial. I also didn't find instructions for how to use tab() to replace missing values.
If I just read my raw data in (missing allele values denoted as "NA" in genepop file), it informs me that there is 0% missing data which is what concerns me:
> msts_m2<-read.genepop("BOR-Od_m2_301w.gen")
> summary(msts_m2)
// Number of individuals: 301
// Group sizes: 12 9 18 12 20 11 18 18 12 12 17 18 20 28 12 12 12 11 12 17
// Number of alleles per locus: 4 4 13 19 20 10 20 7
// Number of alleles per group: 16 17 18 18 26 18 23 21 18 19 22 17 17 31 19 16 21 17 18 16
// Percentage of missing data: 0 %
// Observed heterozygosity: 0.41 0.55 0.81 0.66 0.79 0.69 0.78 0.74
// Expected heterozygosity: 0.39 0.6 0.88 0.87 0.92 0.75 0.92 0.76
>
When I call tab, it lists the missing data at e.g., locus loc21 as loc21.NA:
> msts_m2[loc="loc21"]@tab
loc21.02 loc21.03 loc21.NA loc21.04 loc21.05 loc21.06 loc21.01
1 1 1 0 0 0 0 0
2 2 0 0 0 0 0 0
3 2 0 0 0 0 0 0
4 1 1 0 0 0 0 0
5 1 1 0 0 0 0 0
I also tried this, but the output remains the same as above:
tab(msts_m2, missing="mean")
Best,
Magdalena
On Thu, Oct 15, 2015 at 9:10 AM, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
Hi there,
please check the current tutorials. scaleGen is no longer used, and has been replaced by tab. Then you should be able to feed the resulting object to dapc without problem.
Cheers
Thibaut
________________________________
From: D. Magdalena Sorger [dm.sorger at gmail.com<mailto:dm.sorger at gmail.com>]
Sent: 15 October 2015 02:36
To: Jombart, Thibaut
Cc: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Subject: Re: [adegenet-forum] problems with na.replace and values for n.pca & n.da
Dear all,
I updated my adegenet package, however, I still have issues with confirming that it is reading my missing values in correctly.
I now use this code to read in my genepop file:
msts_m2<-read.genepop("BOR-Od_m2_301w.gen")
I tried this line of code for replacing NA's with means:
msts_m2 <- scaleGen(msts_m2, NA.method="mean")
....but this just seems to be appropriate for the regular PCA and my DAPC code (see below) won't work and give me this error: Error in (function (classes, fdef, mtable): unable to find an inherited method for function ‘pop’ for signature ‘"matrix"’
Therefore: What is the proper way to treat missing values (denoted as NA in my genepop file) for a DAPC - do they need to be replaced with means (which I assume) and if so what is the proper line of code to do that?
If I don't run the scaleGen line of code and continue with the DAPC code (see below), I still get the same output I had questions about earlier (see below).
Best,
Magdalena
On Mon, Oct 12, 2015 at 6:21 AM, Jombart, Thibaut <t.jombart at imperial.ac.uk<mailto:t.jombart at imperial.ac.uk>> wrote:
Hi there,
your post suggests you are using an outdated version of adegenet - na.replace and some arguments you are using have been removed from the package since version 2.0.0. This means you have been consulting an outdated version of the tutorials.. where did you find it? If there is outdated doc around I need to get rid of it.
Please update to the devel version (2.0.1) from github:
https://github.com/thibautjombart/adegenet
Current tutorials should answer your first question. Missing data are detected automatically if the right format is used. Let's wait to make sure your data were imported fine for the others.
Cheers
Thibaut
==============================
Dr Thibaut Jombart
MRC Centre for Outbreak Analysis and Modelling
Department of Infectious Disease Epidemiology
Imperial College - School of Public Health
Norfolk Place, London W2 1PG, UK
Tel. : 0044 (0)20 7594 3658
http://sites.google.com/site/thibautjombart/
http://sites.google.com/site/therepiproject/
http://adegenet.r-forge.r-project.org/
Twitter: @thibautjombart
________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of D. Magdalena Sorger [dm.sorger at gmail.com<mailto:dm.sorger at gmail.com>]
Sent: 09 October 2015 20:00
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Subject: [adegenet-forum] problems with na.replace and values for n.pca & n.da
Dear all,
I am trying to constuct a DAPC scatterplot with adegenet and have three questions that after consulting online resources, tutorials, etc. still haven't been answered definitively.
My data set consists of (diploid) microsatellite data (8 markers) for 20 ant colonies of 9-28 workers each (mean=15), 301 workers total. I have missing data in 7 spots (i.e. individuals with missing data at one or more loci). My first question is about reading in the genepop file:
1) What is the proper command for reading in my file in regards to missing data?
I had replaced all missing data ("0000") with "NA" in the genepop file and used the below code assuming that it would recognize my missing data as NA (first line of code) and replace missing values with means (second line):
msts_m2<-read.genepop("BOR-Od_m2_301w.gen",missing="NA")
na.replace(msts_m2,"mean", quiet=FALSE)
However, when I run this code, it informs me that it replaced 119 missing values. This obviously seems too much as it should have only replaced 7. I'm not sure why this isn't working, see output below
OUTPUT:
> na.replace(msts_m2,"mean", quiet=FALSE)
Replaced 119 missing values
#####################
### Genind object ###
#####################
- genotypes of individuals -
S4 class: genind
@call: read.genepop(file = "BOR-Od_m2_301w.gen", missing = "NA")
@tab: 301 x 93 matrix of genotypes
@ind.names: vector of 301 individual names
@loc.names: vector of 8 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the 93 columns of @tab
@all.names: list of 8 components yielding allele names for each locus
@ploidy: 2
@type: codom
Optionnal contents:
@pop: factor giving the population of each individual
@pop.names: factor giving the population of each individual
@other: - empty -
My second and third questions relate to the selection of the # of PCs and the # of discriminant functions to retain. It seems that each time I make slight changes to these numbers, the output changes vastly and so I want to make sure I input the proper numbers.
First I use the find.clusters function, I designate 20 for n.pca here since I have 20 colonies:
clusters<-find.clusters(msts_m2,max.n.pca=20)
When asked for the number of PCs to retain I select 30 which is the level at which the points seem to level off. The BIC graph looks nothing like the graph in the vignette (it does not level off but below a certain level shows a downward zigzag pattern) so I choose 20 given that I have 20 colonies.
dapc_m2<-dapc(msts_m2,clusters$grp)
[Inline image 1][Inline image 2]
Next, when asked for PCs to retain I again choose the level at which the points start to level off (30) but when asked for discriminant functions to retain, I'm at a loss. I have about 18 relatively constantly decreasing bars. I usually choose the point between the first few ones and the rest where there is some kind of bigger (sometimes arbitrary) break and use that number (in this case: 4):
2) How to choose the appropriate number for PCs to retain and discr. functions to retain?
[Inline image 3][Inline image 4]
best.n.pca<-a.score(dapc_m2)
temp<-optim.a.score(dapc_m2)
After running the optim.a.score function I receive a graph that tells me at the top "optimal number of PCs". This is the number I put into my last line of code (below) for n.pca, and for n.da I choose the number I used when prompted earlier:
3) Are these the correct numbers to use for this last line of code?
dapc_m2<-dapc(msts_m2,n.pca=7,n.da=4)
--
Magdalena Sorger
____________
Department of Applied Ecology
North Carolina State University
127 David Clark Labs, Box 7617
100 Eugene Brooks Ave.
Raleigh, NC 27695, USA
# 919-513-7464<tel:919-513-7464>
dmsorger at ncsu.edu<mailto:dmsorger at ncsu.edu>
www.theantlife.com<http://www.theantlife.com>
--
Magdalena Sorger
____________
Department of Applied Ecology
North Carolina State University
127 David Clark Labs, Box 7617
100 Eugene Brooks Ave.
Raleigh, NC 27695, USA
# 919-513-7464<tel:919-513-7464>
dmsorger at ncsu.edu<mailto:dmsorger at ncsu.edu>
www.theantlife.com<http://www.theantlife.com>
--
Magdalena Sorger
____________
Department of Applied Ecology
North Carolina State University
127 David Clark Labs, Box 7617
100 Eugene Brooks Ave.
Raleigh, NC 27695, USA
# 919-513-7464<tel:919-513-7464>
dmsorger at ncsu.edu<mailto:dmsorger at ncsu.edu>
www.theantlife.com<http://www.theantlife.com>
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
--
Magdalena Sorger
____________
Department of Applied Ecology
North Carolina State University
127 David Clark Labs, Box 7617
100 Eugene Brooks Ave.
Raleigh, NC 27695, USA
# 919-513-7464
dmsorger at ncsu.edu<mailto:dmsorger at ncsu.edu>
www.theantlife.com<http://www.theantlife.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151020/52b16738/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 19074 bytes
Desc: image.png
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151020/52b16738/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 15234 bytes
Desc: image.png
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151020/52b16738/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 19507 bytes
Desc: image.png
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151020/52b16738/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 18215 bytes
Desc: image.png
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20151020/52b16738/attachment-0007.png>
More information about the adegenet-forum
mailing list