[adegenet-forum] Discrepancy in NA counts
Roman Luštrik
roman.lustrik at biolitika.si
Mon Nov 28 13:40:24 CET 2016
Hi Elizabeth,
it would appear there is something funky happening with the code due to locus names being numeric. This has happened before in some other function. Until we fix this, you can change your locus names so that they start with a letter.
Here is the excerpt from the genind object indicating that these two samples have alleles 33:
X1401_25.13 X1401_25.33 X1403_13.11 X1403_13.13 X1403_13.33 X1404_17.13 X1404_17.33 X1404_17.11
C_KH1059 0 1 1 0 0 0 1 0
M_KH1834 0 1 1 0 0 1 0 0
Cheers,
Roman
----
In god we trust, all others bring data.
From: "Biz Sheedy" <biz.sheedy at gmail.com>
To: "Roman Luštrik" <roman.lustrik at biolitika.si>
Cc: adegenet-forum at lists.r-forge.r-project.org
Sent: Monday, November 28, 2016 11:00:53 AM
Subject: Re: [adegenet-forum] Discrepancy in NA counts
Thanks for looking into this.
Something that I did differently to the code you provided, was that I only answered the prompts for the read.structure function. This meant I did not use sep="\t" and the number of alleles was 62 instead of 72, which I think should be comparable to the excel count. Following the code you provide, ' is.na ' finds 23 NAs (instead of 20 NAs at 62 alleles and 16 zeroes in excel).
Your explanation makes sense to me for the additional three NAs in adegenet, but I still don't understand how in locus 1401_25 the data for two individuals (C_KH1059 and M_KH1834) changed from being homozygous for "3" to being "NA"?
I would really appreciate any further help on this.
Thanks again,
Elizabeth
On 28 November 2016 at 18:03, Roman Luštrik < roman.lustrik at biolitika.si > wrote:
Hi,
I think the problem is that adegenet, for consistency, adds NAs to accommodate the extra alleles present for a particular locus. Take for example C_KH1238 (bottom row in the example pasted belo).
In raw file, it has missing values for locus 1378_53, but this locus has three alleles, ergo 3 NAs and not 2. Can't go through all the NAs right now, but I think there's a pretty good chance this is what is causing the discrepancy between what you see in "excel" and in adegenet.
1369_41.11 1372_14.22 1372_14.24 1373_9.44 1373_9.24 1377_42.44 1377_42.24 1378_53.22 1378_53.24 1378_53.44 1379_10.33 1379_10.13 1382_37.33
...
C_KH1238 0 1 0 1 0 1 0 NA NA NA 1 0 1 # notice 3 NAs for all available alleles for 1378_53, not just two (as expected for diploid)
Here is the code I used to explore this:
library(adegenet)
xy <- read.table("Sub_batch_1.stru", header = TRUE, sep = "\t")
xy <- xy[, c(-1, -2)]
table(as.matrix(xy))
# 0 1 2 3 4
# 16 467 618 760 867
xy <- read.structure("Sub_batch_1.stru", NA.char="0",
n.ind = 44, n.loc = 31, onerowperind = FALSE,
col.lab = 1, col.pop = 2, row.marknames = 1,
sep = "\t", col.others = 0)
xy <- tab(xy)
xy[grepl("C_KH1238", rownames(xy)), grepl("1378_53", colnames(xy))]
Cheers,
Roman
----
In god we trust, all others bring data.
From: "Biz Sheedy" < biz.sheedy at gmail.com >
To: "Roman Luštrik" < roman.lustrik at biolitika.si >
Sent: Monday, November 28, 2016 9:11:39 AM
Subject: Re: [adegenet-forum] Discrepancy in NA counts
My apologies. First time posting to a forum so I am a little unsure of things. I have attached a subset of the data, which includes the locus that I saw had problems.
In this case there are 31 loci with 16 zeroes counted (excel), and 20 NAs counted (adegenet). The additional NAs occur in locus 1401_25.
Thanks so much,
Elizabeth
On 28 November 2016 at 16:31, Roman Luštrik < roman.lustrik at biolitika.si > wrote:
BQ_BEGIN
Hi,
can you share a (subset) of the dataset? It's hard to pinpoint where things might be going wrong without some data in hand.
Cheers,
Roman
----
In god we trust, all others bring data.
From: "Biz Sheedy" < biz.sheedy at gmail.com >
To: adegenet-forum at lists.r-forge.r-project.org
Sent: Friday, November 25, 2016 10:44:16 AM
Subject: [adegenet-forum] Discrepancy in NA counts
Dear All,
I am trying to read SNP data from Stacks into adegenet. I have tried read.structure and read.genepop but they both give (the same) NA counts that are higher than expected. Using read.table on the structure-formatted file (with "ind" and "pop" inserted into the first two columns of row one) gave the expected number of missing data.
I looked at a single population subset (both the original and the converted data) in excel and found a locus where in the original data, all nine individuals were "3", but in the converted data one individual was "NA". The loci before and after this one both matched/were correct.
I am not sure what I have missed for this to happen, my R skills are beginner at best. Any help with reading the data in correctly would be greatly appreciated!
Thank you,
Elizabeth
R version 3.3.2
adegenet version 2.0.1
Data: 44 individuals, diploid, 4279 loci.
all<-read.structure("all_batch_1.stru", NA.char="0")
Total cells in excel: 376552
After read.structure/genepop: 44*8558=376552
0s in excel: 3952
0s after read.table; length(which(X==0)): 3952
NA after read.structure/genepop; sum( is.na (all$tab)): 4008
Difference: 56
Subset Chichi
Total cells: 77022
After read.structure/genepop: 9*8558=77022
0s in excel: 742
NA after read.structure/genepop; sum( is.na (chi$tab)): 756
Difference: 14
--
4-1-1 Amakubo
Department of Botany
National Museum of Nature and Science
Tsukuba, Ibaraki 305-0005
Japan
biz.sheedy at gmail.com
_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum
--
4-1-1 Amakubo
Department of Botany
National Museum of Nature and Science
Tsukuba, Ibaraki 305-0005
Japan
biz.sheedy at gmail.com
BQ_END
--
4-1-1 Amakubo
Department of Botany
National Museum of Nature and Science
Tsukuba, Ibaraki 305-0005
Japan
biz.sheedy at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20161128/38302c86/attachment.html>
More information about the adegenet-forum
mailing list