[adegenet-forum] Nucleotide substitution model used by haploGen()/SeqTrack()

Wed Sep 26 17:46:07 CEST 2018

Dear Thibaut,

This question stems from  ones I asked in August.

I've been experimenting with the R code from haploGen() and include it below (with my own slight modifications for my specific problem). Hopefully you can lend some advice as to the issue I am facing (see below).

Note here I refer to a haplotype as a unique DNA sequence.

library(pegas)

set.seed(17)

sim.seqs <- TRUE

length.seqs <- 500

num.seqs <- 100 # number of DNA sequences

if (sim.seqs == TRUE) {

      nucl <- as.DNAbin(c('a','c','g','t'))

     res <- sample(nucl, size = length.seqs, replace = TRUE, prob = rep(0.25, 4))

     if (subst.model == "K80") {

        transi.set <- list('a' = as.DNAbin('g'),

                           'c' = as.DNAbin('t'),

                           'g' = as.DNAbin('a'),

                           't' = as.DNAbin('c'))

        transv.set <- list('a' = as.DNAbin(c('c', 't')),

                           'c' = as.DNAbin(c('a', 'g')),

                           'g' = as.DNAbin(c('c', 't')),

                           't' = as.DNAbin(c('a', 'g')))

        transi <- function(res) {

          unlist(transi.set[as.character(res)])

        }

        transv <- function(res) {

          sapply(transv.set[as.character(res)], sample, 1)

        }

        duplicate.seq <- function(res) {

          num.transi <- rbinom(n = 1, size = length.seqs, prob = transi.rate) # total number of transitions

          if (num.transi > 0) {

            idx <- sample(length.seqs, size = num.transi, replace = FALSE)

            res[idx] <- transi(res[idx])

          }

          num.transv <- rbinom(n = 1, size = length.seqs, prob = transv.rate) # total number of transversions

          if (num.transv > 0) {

            idx <- sample(length.seqs, size = num.transv, replace = FALSE)

            res[idx] <- transv(res[idx])

          }

          res

          }

        }

      res <- matrix(replicate(num.seqs, duplicate.seq(res)), byrow = TRUE, nrow = num.seqs)

      class(res) <- "DNAbin"

      # write.dna(res, file = "seqs.fas", format = "fasta")

      h <- sort(haplotype(res), decreasing = TRUE, what = "frequencies")

      rownames(h) <- 1:nrow(h)

    }

## Output

h

Haplotypes extracted from: res

    Number of haplotypes: 5

         Sequence length: 500

Haplotype labels and frequencies:

  1  2  3  4  5

96  1  1  1  1

If the code is run multiple times (without the seed), the same pattern emerges: h is always skewed toward the most dominant haplotype.

Is there a way that you've implemented in haploGen() (and that I have potentially glossed over) that gives different trends each time the code is run?

For example, I want to be able to run the code (without seed) and get something like

h

Haplotypes extracted from: res

    Number of haplotypes: 5

         Sequence length: 500

Haplotype labels and frequencies:

  1    2    3    4   5

35  15  25  20  5

haploGen() does exactly this. Each time haploGen() is run, a different distribution for h  is outputted. I just want to emulate what you have done, but am stuck at this roadblock and I m uncertain of what is causing this issue.

Note: I am just wanting to generalize your approach to generate DNA sequences (not necessarily for outbreak modelling).

Can you please shed some light?

Thank you in advance.

Sincerely,

Jarrett Phillips
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180926/408416d5/attachment.html>