[adegenet-forum] reading dipolid genetic sequence data for use in SPCA

Jombart, Thibaut t.jombart at imperial.ac.uk
Tue Nov 15 12:06:21 CET 2011


Hi Nevil, 

This functionality does not exist yet. The function read.dna does not formally allow for storing 2 sequences per individual, although if you keep track of individuals ID as in your case, this is possible.

There is a kludge that should do the job though. Convert your data to a genind with twice the number of individuals, then aggregate data as per population (genind2genpop) only using the individual ID as population; revert back to allele frequencies; create a genind from this output.
Here's an example with your sample data (I had to change the sequences, there were no polymorphic site in the sample)
####
library(adegenet)
library(ape)
dna <- read.dna("dat.fa", format="fasta")
temp <- DNAbin2genind(dna)
> nInd(temp) # 4 sequences
[1] 4
> indNames(temp) # names are duplicated, all is fine
          1           2           3           4 
   "EYR018"    "EYR018" "ANWC45051" "ANWC45051" 

## now aggregate per individual
x <- genind2genpop(temp, pop=indNames(temp)) # this object has 2 'populations'
tab <- makefreq(x)$tab
####

The object 'tab' is a table of allele frequencies per individual; thus it can be used by the genind constructor to make a valid genind object:
####
x <- genind(tab)

> x

   #####################
   ### Genind object ### 
   #####################
- genotypes of individuals - 

S4 class:  genind
@call: genind(tab = tab)

@tab:  2 x 18 matrix of genotypes

@ind.names: vector of  2 individual names
@loc.names: vector of  9 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  18 columns of @tab
@all.names: list of  9 components yielding allele names for each locus
@ploidy:  2
@type:  codom

Optionnal contents: 
@pop:  - empty -
@pop.names:  - empty -

@other: - empty -
####

We can check the content:
####
## old sequences:
> as.character(dna)[,as.integer(locNames(x))]
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
EYR018    "t"  "t"  "g"  "t"  "t"  "t"  "t"  "g"  "a" 
EYR018    "a"  "a"  "c"  "c"  "t"  "t"  "t"  "g"  "a" 
ANWC45051 "a"  "a"  "g"  "t"  "t"  "t"  "t"  "g"  "a" 
ANWC45051 "a"  "a"  "g"  "t"  "a"  "a"  "a"  "c"  "t" 

## genind object
> genind2df(x, sep="/")
           10  11  80  81  84  85  86 193 196
EYR018    a/t a/t c/g c/t t/t t/t t/t g/g a/a
ANWC45051 a/a a/a g/g t/t a/t a/t a/t c/g a/t
####

Job done. Dirty, but that should work as long as each individual ID appears exactly twice in the original DNA file.

I'll implement this in DNAbin2genind when I have time, probably for the next release.

Cheers

Thibaut

________________________________________
From: adegenet-forum-bounces at r-forge.wu-wien.ac.at [adegenet-forum-bounces at r-forge.wu-wien.ac.at] on behalf of Nevil Amos [nevil.amos at gmail.com]
Sent: 15 November 2011 05:44
To: adegenet-forum at lists.r-forge.r-project.org
Subject: [adegenet-forum] reading dipolid genetic sequence data for use in      SPCA

I am trying to read dipolid sequence  data in fasta format into a genind
object for spca.    The input data is as pairs of sequences arranged
with the Id of the individual before ach sequence so each individual
name appears before tow consecutive sequences.  there are 44 individuals
with 88 sequences of equal length.

when I read this using read.dna() and dnabin2genind()  the data is read
but there are  88 inds in geneind at ind.names, and 88 genotypes in
genind at tab,  each ind.name is repeated, ie the sequences are being
treated as from 88 individuals rather than paired sequences from 44.

I cannot see where I include an argument to inform that the data is dipoid.

Please can you advise how to proceed with this data?

I have pasted below the code and the data for the first two individuals:
 > AB4<-read.dna("d:/nevs_docs/EYR/NccGenes/AB4_phased.fas", format =
"fasta", skip = 0,
+          nlines = 0, comment.char = "[", seq.names = NULL,
+          as.character = F, as.matrix = T)
 > AB4genind<-DNAbin2genind(AB4)
 > AB4genind

    #####################
    ### Genind object ###
    #####################
- genotypes of individuals -

S4 class:  genind
@call: DNAbin2genind(x = AB4)

@tab:  88 x 4 matrix of genotypes

@ind.names: vector of  88 individual names
@loc.names: vector of  2 locus names
@loc.nall: number of alleles per locus
@loc.fac: locus factor for the  4 columns of @tab
@all.names: list of  2 components yielding allele names for each locus
@ploidy:  1
@type:  codom

Optionnal contents:
@pop:  - empty -
@pop.names:  - empty -

@other: - empty -

fasta file for first two individuals

 >'EYR018'   [by DnaSP Ver. 5.10.01, from file:
AB4_EYR44_196bp_phased.nex     Nov 11, 2011]
GAGAGATCTAAGGAGCCTAAGCTATGATCTGTGGAGCAATAGGTGGCTCACCTGAGACAC
AGCCTTGGCTGGAGTCACGGTACTTTCCAGAGCTCTGTGCTGAAGAGC-GGCTCTCAGTG
AAGCTTTGTCCTCCCTCTGCAGGGCTGGATGGGCTGGCTGAACGCTGTGCCCAGTACAAG
AAAGATGGTGCTGACA
 >'EYR018'
GAGAGATCTAAGGAGCCTAAGCTATGATCTGTGGAGCAATAGGTGGCTCACCTGAGACAC
AGCCTTGGCTGGAGTCACGGTACTTTCCAGAGCTCTGTGCTGAAGAGC-GGCTCTCAGTG
AAGCTTTGTCCTCCCTCTGCAGGGCTGGATGGGCTGGCTGAACGCTGTGCCCAGTACAAG
AAAGATGGTGCTGACA
 >'ANWC45051'
GAGAGATCTAAGGAGCCTAAGCTATGATCTGTGGAGCAATAGGTGGCTCACCTGAGACAC
AGCCTTGGCTGGAGTCACGGTACTTTCCAGAGCTCTGTGCTGAAGAGC-GGCTCTCAGTG
AAGCTTTGTCCTCCCTCTGCAGGGCTGGATGGGCTGGCTGAACGCTGTGCCCAGTACAAG
AAAGATGGTGCTGACA
 >'ANWC45051'
GAGAGATCTAAGGAGCCTAAGCTATGATCTGTGGAGCAATAGGTGGCTCACCTGAGACAC
AGCCTTGGCTGGAGTCACGGTACTTTCCAGAGCTCTGTGCTGAAGAGC-GGCTCTCAGTG
AAGCTTTGTCCTCCCTCTGCAGGGCTGGATGGGCTGGCTGAACGCTGTGCCCAGTACAAG
AAAGATGGTGCTGACA





_______________________________________________
adegenet-forum mailing list
adegenet-forum at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/adegenet-forum


More information about the adegenet-forum mailing list