[adegenet-forum] Nucleotide substitution model used by haploGen()/SeqTrack()
Jarrett Phillips
jphill01 at uoguelph.ca
Wed Sep 26 17:46:07 CEST 2018
Dear Thibaut,
This question stems from ones I asked in August.
I've been experimenting with the R code from haploGen() and include it below (with my own slight modifications for my specific problem). Hopefully you can lend some advice as to the issue I am facing (see below).
Note here I refer to a haplotype as a unique DNA sequence.
library(pegas)
set.seed(17)
sim.seqs <- TRUE
length.seqs <- 500
num.seqs <- 100 # number of DNA sequences
if (sim.seqs == TRUE) {
nucl <- as.DNAbin(c('a','c','g','t'))
res <- sample(nucl, size = length.seqs, replace = TRUE, prob = rep(0.25, 4))
if (subst.model == "K80") {
transi.set <- list('a' = as.DNAbin('g'),
'c' = as.DNAbin('t'),
'g' = as.DNAbin('a'),
't' = as.DNAbin('c'))
transv.set <- list('a' = as.DNAbin(c('c', 't')),
'c' = as.DNAbin(c('a', 'g')),
'g' = as.DNAbin(c('c', 't')),
't' = as.DNAbin(c('a', 'g')))
transi <- function(res) {
unlist(transi.set[as.character(res)])
}
transv <- function(res) {
sapply(transv.set[as.character(res)], sample, 1)
}
duplicate.seq <- function(res) {
num.transi <- rbinom(n = 1, size = length.seqs, prob = transi.rate) # total number of transitions
if (num.transi > 0) {
idx <- sample(length.seqs, size = num.transi, replace = FALSE)
res[idx] <- transi(res[idx])
}
num.transv <- rbinom(n = 1, size = length.seqs, prob = transv.rate) # total number of transversions
if (num.transv > 0) {
idx <- sample(length.seqs, size = num.transv, replace = FALSE)
res[idx] <- transv(res[idx])
}
res
}
}
res <- matrix(replicate(num.seqs, duplicate.seq(res)), byrow = TRUE, nrow = num.seqs)
class(res) <- "DNAbin"
# write.dna(res, file = "seqs.fas", format = "fasta")
h <- sort(haplotype(res), decreasing = TRUE, what = "frequencies")
rownames(h) <- 1:nrow(h)
}
## Output
h
Haplotypes extracted from: res
Number of haplotypes: 5
Sequence length: 500
Haplotype labels and frequencies:
1 2 3 4 5
96 1 1 1 1
If the code is run multiple times (without the seed), the same pattern emerges: h is always skewed toward the most dominant haplotype.
Is there a way that you've implemented in haploGen() (and that I have potentially glossed over) that gives different trends each time the code is run?
For example, I want to be able to run the code (without seed) and get something like
h
Haplotypes extracted from: res
Number of haplotypes: 5
Sequence length: 500
Haplotype labels and frequencies:
1 2 3 4 5
35 15 25 20 5
haploGen() does exactly this. Each time haploGen() is run, a different distribution for h is outputted. I just want to emulate what you have done, but am stuck at this roadblock and I m uncertain of what is causing this issue.
Note: I am just wanting to generalize your approach to generate DNA sequences (not necessarily for outbreak modelling).
Can you please shed some light?
Thank you in advance.
Sincerely,
Jarrett Phillips
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180926/408416d5/attachment.html>
More information about the adegenet-forum
mailing list