[Seqinr-forum] searching for sequences from Aspergillus nidulans in 'genbank'
Coghlan, Avril
A.Coghlan at ucc.ie
Wed Nov 18 18:45:18 CET 2009
Dear Jean and Simon,
Thank you for your helpful replies.
That is much clearer to me now.
To help figure out why there is a difference between the 2,100 sequences retrieved via acnuc, and the 2,404 sequences retrieved by searching the NCBI website directly, would it be helpful if I printed out the sequences found by both searches for you?
Also, I have several other cases where I find a different number of sequences when I search acnuc using SeqinR, and when I search the NCBI website directly. In some cases, the number of sequences are quite different. I am not sure why they are different, I guess that it is because I don't understand very well which data is stored in ACNUC. Would you mind explaining these cases if I gave you a list of them?
Regards, and thanks,
Avril
-----Original Message-----
From: penel at biomserv.univ-lyon1.fr [mailto:penel at biomserv.univ-lyon1.fr]
Sent: 16 November 2009 17:42
To: Jean lobry
Cc: seqinr-forum at r-forge.wu-wien.ac.at; Coghlan, Avril
Subject: Re: [Seqinr-forum] searching for sequences from Aspergillus nidulans in 'genbank'
Dear Avril and Jean,
Jean is right when he says that GenBank is divided in several divisions,
but the EST division is actualy present in GenBank under ACNUC.
(btw, you can selected sequence from a given division in acnuc by using
the keyword "division division_name":
'k=division est' will selec sequences from the EST division).
The case of the differences in number of sequences may be explained by
the following:
when you requested the NCBI with
"Aspergillus nidulans"[ORGN]
you obtain the following line
Found 29119 nucleotide sequences. Nucleotide [12271] EST [16848]
If you check the "Nucleotide" you got:
All: 12271
RefSeq :9867
mRNA :9760
The problem is that RefSeq sequences are not in GenBank under ACNUC
(but in RefSeq under ACNUC), so these sequences are not found.
Alternatively, if you check the "EST" you got:
All: 16848
mRNA: 16848
In ACNUC if you type
"sp=Aspergillus nidulans et k=division est"
you will get identically 16848 sequences
I have got no explanation yet for the difference between the sequences
in Nucleotide without RefSEq ( i.e. 2,404 sequences) and the acnuc sequences
from Aspergillus nidulans which are not EST (i.e 2,100 sequences), I
will check further...
all the best,
Simon
Jean lobry a écrit :
>> Dear SeqinR forum,
>>
>> Today I used SeqinR to retrieve sequences from the fungus Aspergillus
>> nidulans from the ACNUC 'genbank' database, using the commands:
>>
>>> choosebank("genbank")
>>> query("anidulans","SP=aspergillus nidulans")
>>> anidulans$nelem
>>>
>> [1] 18948
>> This means that there were 18948 sequences from Aspergillus nidulans
>> found.
>>
>> As far as I understand it, the ACNUC 'genbank' database corresponds to
>> the NCBI Nucleotide database, is that right?
>>
>> I also did a search directly of the NCBI Nucleotide database on the NCBI
>> website for Aspergillus nidulans sequences, by going to
>> http://www.ncbi.nlm.nih.gov/nucleotide/ and searching for "Aspergillus
>> nidulans"[ORGN]. That search found 29119 nucleotide sequences (12271 of
>> which are ESTs).
>>
>> I am wondering why there is a difference between the search that I did
>> of the ACNUC 'genbank' database, and on the NCBI Nucleotide Database
>> website?
>> I don't think it can be due to the ACNUC database missing some sequences
>> recently submitted to NCBI, as the ACNUC website says that the ACNUC
>> 'genbank' database was very recently updated, on Nov 13, 2009 (from
>> http://pbil.univ-lyon1.fr/cgi-bin/get_relnum?db=GenBank&ident=1929541324
>> ).
>>
>
> Dear Avril,
>
> genbank is organized into general divisions:
>
> http://www.ncbi.nlm.nih.gov/HTGS/divisions.html
>
> They are not all included in our ACNUC database for genbank.
>
> IIRC, the "functional divisions" (viz. EST, STS, GSS and HTG) are
> not included. This explains the difference between the two results.
>
> @Simon: am I correct here?
>
>
>> I will be very grateful for your advice, as I would like to use the
>> SeqinR library for a bioinformatics practical for students, and want to
>> make sure I understand how it works.
>>
>
> For a practical for students I would suggest to use frozen databases.
> They are accessible with the special value "TP" for the "tagbank"
> argument on opening (TP means "Travaux Pratiques" which is french
> for practicals).
>
> ######
>
>> library(seqinr)
>> choosebank(tagbank = "TP")
>>
> [1] "emblTP" "swissprotTP" "hoverprotTP" "hovernuclTP" "trypano"
>
>> choosebank("emblTP")
>> banknameSocket$details
>>
> [1] " **** ACNUC Data Base Content ****
> "
> [2] " EMBL Library Release 78 WITHOUT ESTs (March 2004)"
> [3] "27,571,397,913 bases; 12,533,594 sequences; 1,604,500 subseqs;
> 339,186 refers."
> [4] "Software by M. Gouy & M. Jacobzone, Laboratoire de biometrie,
> Universite Lyon I "
>
>> query("anidulans","SP=aspergillus nidulans")
>> anidulans$nelem
>>
> [1] 218
> ######
>
> There are only 218 sequences for Aspergillus nidulans in this frozen
> version of EMBL, but the advantage is that the results are stable over
> time. Your practical will be ready unchanged for next year.
>
> Best,
>
--
Simon Penel
Laboratoire de Biometrie et Biologie Evolutive Bat 711 - CNRS UMR 5558 - Universite Lyon 1 43 bd du 11 novembre 1918 69622 Villeurbanne Cedex Tel: 04 72 43 29 04 Fax: 04 72 43 13 88
http://lbbe.univ-lyon1.fr/-Penel-Simon-.html?lang=fr
More information about the Seqinr-forum
mailing list