[adegenet-forum] DNAbin and pop

Jombart, Thibaut t.jombart at imperial.ac.uk
Mon Dec 16 08:02:11 CET 2013


Hello, 

no, it does improve on your first script drastically.

You had to enter manually the population factor in R; now you can just process the names of your sequences to extract this information automatically with one short command (gsub(...)). 
In :
gsub("[[:digit:]]","",lab)
you just need to replace 'lab' with the labels of your sequences (labels(youDNAbinObject). 

What you are asking for won't be possible because fasta files only store 1) one sequence label and 2) the sequence.
However, since we are talking of just one extra command line, I think this is still an efficient way to do things. 

There is no storage of population information in DNAbin objects, so pop(...) won't work. If you want to store both data in a single object, you can use a list where $dna will be your DNAbin and $pop will be a population.

> It would be neat to have a way of reading from the fasta/phylip files the first two letters, and use them as factors

No, it would not, because this is not part of the format definition nor a common practice (though storing info in the sequence labels is). 

> because the departure examples include R.data, which are not very useful for the beginners.

RData are usually easier to distribute alongside a R package. However, this is not always the case. Examples with non RData inputs include:
- read.genetix
- read.fstat
- read.genepop
- read.structure
- read.snp
- read.dna
- fasta2DNAbin
- ...

You can find at least 2 tutorials on adegenet's website with non-RData input files (actually, fasta files). 

Cheers
Thibaut


________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org [adegenet-forum-bounces at lists.r-forge.r-project.org] on behalf of Rita Castilho [rita.castil at gmail.com]
Sent: 16 December 2013 06:42
To: adegenet-forum at lists.r-forge.r-project.org
Subject: Re: [adegenet-forum] DNAbin and pop

Dear Thibaut

Thanks for the prompt reply!
Unfortunately I do not see how that improves on the example given.
When one uses allelic data, there are simple (automatic) ways to build a genind object that includes the factor pop or even a xy coordinates factor. That is because the read.file functions available include that possibility (read.genepop, retains the pop info, read.genalex, retains pop, and xy info). And there is no need of further manipulations. So I was looking for something similar, perhaps not a read.file function, because read.fasta does not include that, but a set of scritps that will do it.
I saw another previous suggestion of yours, but it implies still an extra file:
popFac <- read.csv("oneColumnFileWithMyGroupsInIt.csv")
popFac <- factor(unlist(popFac))
pop(obj) <- popFac

and in any case I could not understand how to use it, as I get an error:

data.dnabin <- fasta2DNAbin("Engraulis_P3_mtDNA.fas")
popFac <- read.csv("Engraulis_P3_mtDNA_pops.csv")
popFac <- factor(unlist(popFac))
pop(data.dnabin) <- popFac

Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘pop<-’ for signature ‘"DNAbin"’

It would be neat to have a way of reading from the fasta/phylip files the first two letters, and use them as factors. I am not familiarized with R enough to be able to do it. I just use the packages, and most of the times I have a hard time to get things working, because the departure examples include R.data, which are not very useful for the beginners.

In any case I appreciate your efforts towards programming for the community!


Best
Rita




[cid:part1.05070704.06000907 at gmail.com]
Jombart, Thibaut<mailto:t.jombart at imperial.ac.uk>
December 16, 2013 5:33 AM

Hello,

yes, there are simpler ways. sub/gsub and regular expressions are immensely useful to extract information contained in the labels of sequences.

For instance:
##


lab <- c("AD01012","AD666","FR1212","AD0101","FR9873")
lab


[1] "AD01012" "AD666"   "FR1212"  "AD0101"  "FR9873"


pop <- gsub("[[:digit:]]","",lab)
pop


[1] "AD" "AD" "FR" "AD" "FR"
##

For some useful examples, see ?sub and ?regexp

Cheers
Thibaut

________________________________________
From: adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org> [adegenet-forum-bounces at lists.r-forge.r-project.org<mailto:adegenet-forum-bounces at lists.r-forge.r-project.org>] on behalf of Rita Castilho [rita.castil at gmail.com<mailto:rita.castil at gmail.com>]
Sent: 16 December 2013 05:02
To: adegenet-forum at lists.r-forge.r-project.org<mailto:adegenet-forum at lists.r-forge.r-project.org>
Subject: [adegenet-forum] DNAbin and pop

Hi!
I am new to R and I have a lot of trouble in going from a phylip or fasta file to a genind object or fasta2DNAbin containing pop information.
My files are always phylip or fasta files, and sequences have a reference composed of an di-alpha followed by 4 numeric digits (e.g. CD1495). The first two letters determine the population to which the sequence belongs to.

Is there a quick way to do it instead of doing this, as the grouping factor can be easily deduced from the current individual labels, saving the task of read that info R separately?

#reading data
dna <- fasta2DNAbin('data.fas')
# setting pops
data.pop <- as.factor(rep(c('AD', 'CD', 'FR', 'GE', 'RE', 'OT', 'YU', 'AU'), c(17, 11, 12, 12, 25, 14, 13, 20)))

Many thanks
Rita




[cid:part1.05070704.06000907 at gmail.com]
Rita Castilho<mailto:rita.castil at gmail.com>
December 16, 2013 5:02 AM
Hi!
I am new to R and I have a lot of trouble in going from a phylip or fasta file to a genind object or fasta2DNAbin containing pop information.
My files are always phylip or fasta files, and sequences have a reference composed of an di-alpha followed by 4 numeric digits (e.g. CD1495). The first two letters determine the population to which the sequence belongs to.

Is there a quick way to do it instead of doing this, as the grouping factor can be easily deduced from the current individual labels, saving the task of read that info R separately?

#reading data
dna <- fasta2DNAbin('data.fas')
# setting pops
data.pop <- as.factor(rep(c('AD', 'CD', 'FR', 'GE', 'RE', 'OT', 'YU', 'AU'), c(17, 11, 12, 12, 25, 14, 13, 20)))

Many thanks
Rita
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compose-unknown-contact.jpg
Type: image/jpeg
Size: 770 bytes
Desc: compose-unknown-contact.jpg
URL: <http://lists.r-forge.r-project.org/pipermail/adegenet-forum/attachments/20131216/9589249d/attachment.jpg>


More information about the adegenet-forum mailing list