[adegenet-forum] Nucleotide substitution model used by haploGen()/SeqTrack()

Jarrett Phillips jphill01 at uoguelph.ca
Wed Sep 26 17:46:07 CEST 2018


Dear Thibaut,


This question stems from  ones I asked in August.


I've been experimenting with the R code from haploGen() and include it below (with my own slight modifications for my specific problem). Hopefully you can lend some advice as to the issue I am facing (see below).


Note here I refer to a haplotype as a unique DNA sequence.



library(pegas)


set.seed(17)


sim.seqs <- TRUE

length.seqs <- 500

num.seqs <- 100 # number of DNA sequences



if (sim.seqs == TRUE) {



      nucl <- as.DNAbin(c('a','c','g','t'))


     res <- sample(nucl, size = length.seqs, replace = TRUE, prob = rep(0.25, 4))


     if (subst.model == "K80") {



        transi.set <- list('a' = as.DNAbin('g'),

                           'c' = as.DNAbin('t'),

                           'g' = as.DNAbin('a'),

                           't' = as.DNAbin('c'))

        transv.set <- list('a' = as.DNAbin(c('c', 't')),

                           'c' = as.DNAbin(c('a', 'g')),

                           'g' = as.DNAbin(c('c', 't')),

                           't' = as.DNAbin(c('a', 'g')))



        transi <- function(res) {

          unlist(transi.set[as.character(res)])

        }



        transv <- function(res) {

          sapply(transv.set[as.character(res)], sample, 1)

        }



        duplicate.seq <- function(res) {

          num.transi <- rbinom(n = 1, size = length.seqs, prob = transi.rate) # total number of transitions

          if (num.transi > 0) {

            idx <- sample(length.seqs, size = num.transi, replace = FALSE)

            res[idx] <- transi(res[idx])

          }



          num.transv <- rbinom(n = 1, size = length.seqs, prob = transv.rate) # total number of transversions

          if (num.transv > 0) {

            idx <- sample(length.seqs, size = num.transv, replace = FALSE)

            res[idx] <- transv(res[idx])

          }

          res

          }

        }



      res <- matrix(replicate(num.seqs, duplicate.seq(res)), byrow = TRUE, nrow = num.seqs)



      class(res) <- "DNAbin"



      # write.dna(res, file = "seqs.fas", format = "fasta")



      h <- sort(haplotype(res), decreasing = TRUE, what = "frequencies")

      rownames(h) <- 1:nrow(h)



    }


## Output


h


Haplotypes extracted from: res


    Number of haplotypes: 5

         Sequence length: 500


Haplotype labels and frequencies:


  1  2  3  4  5

96  1  1  1  1



If the code is run multiple times (without the seed), the same pattern emerges: h is always skewed toward the most dominant haplotype.


Is there a way that you've implemented in haploGen() (and that I have potentially glossed over) that gives different trends each time the code is run?


For example, I want to be able to run the code (without seed) and get something like


h


Haplotypes extracted from: res


    Number of haplotypes: 5

         Sequence length: 500


Haplotype labels and frequencies:


  1    2    3    4   5

35  15  25  20  5



haploGen() does exactly this. Each time haploGen() is run, a different distribution for h  is outputted. I just want to emulate what you have done, but am stuck at this roadblock and I m uncertain of what is causing this issue.


Note: I am just wanting to generalize your approach to generate DNA sequences (not necessarily for outbreak modelling).


Can you please shed some light?



Thank you in advance.



Sincerely,


Jarrett Phillips
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20180926/408416d5/attachment.html>


More information about the adegenet-forum mailing list