[Seqinr-forum] searching for sequences from Aspergillus nidulans in 'genbank'
penel at biomserv.univ-lyon1.fr
penel at biomserv.univ-lyon1.fr
Thu Nov 26 15:21:59 CET 2009
Dear Jean and Avril,
It seems that missing sequences are sequences from whole genome shotgun.
These sequences are not included in ACNUC-Genbank because these data
are included in the ACNUC database "EMBL-wgs".
If you query emblwgs, you will find 248 sequences from Aspergillus
nidulans : the sequence AACD00000000 contains the 248 sequences.
Warning, you have access to this sequence via its accession number :
"ac=AACD00000000" not via its name.
Note : in the new seqinr function it may be useful to check both the
the accession number ad the name to avoid this type of problems?
All teh best
Simon
penel at biomserv.univ-lyon1.fr a écrit :
> Dear Avril and Jean,
>
> sorry for thr delay in answering, I will compare sequences to check
> where the 304 missing sequences come from.
>
>
> Do no hesitate to send me the list of problematic sequences you find in
> the other cases,
>
>
> see you , all the best
>
> Simon
>
>
>
>
>
>
> Coghlan, Avril a écrit :
>
>> Dear Jean and Simon,
>>
>> Thank you for your helpful replies.
>>
>> That is much clearer to me now.
>>
>> To help figure out why there is a difference between the 2,100 sequences retrieved via acnuc, and the 2,404 sequences retrieved by searching the NCBI website directly, would it be helpful if I printed out the sequences found by both searches for you?
>>
>> Also, I have several other cases where I find a different number of sequences when I search acnuc using SeqinR, and when I search the NCBI website directly. In some cases, the number of sequences are quite different. I am not sure why they are different, I guess that it is because I don't understand very well which data is stored in ACNUC. Would you mind explaining these cases if I gave you a list of them?
>>
>> Regards, and thanks,
>> Avril
>>
>> -----Original Message-----
>> From: penel at biomserv.univ-lyon1.fr [mailto:penel at biomserv.univ-lyon1.fr]
>> Sent: 16 November 2009 17:42
>> To: Jean lobry
>> Cc: seqinr-forum at r-forge.wu-wien.ac.at; Coghlan, Avril
>> Subject: Re: [Seqinr-forum] searching for sequences from Aspergillus nidulans in 'genbank'
>>
>> Dear Avril and Jean,
>>
>> Jean is right when he says that GenBank is divided in several divisions,
>> but the EST division is actualy present in GenBank under ACNUC.
>> (btw, you can selected sequence from a given division in acnuc by using
>> the keyword "division division_name":
>> 'k=division est' will selec sequences from the EST division).
>>
>> The case of the differences in number of sequences may be explained by
>> the following:
>>
>> when you requested the NCBI with
>>
>> "Aspergillus nidulans"[ORGN]
>>
>> you obtain the following line
>>
>> Found 29119 nucleotide sequences. Nucleotide [12271] EST [16848]
>>
>> If you check the "Nucleotide" you got:
>> All: 12271
>> RefSeq :9867
>> mRNA :9760
>>
>> The problem is that RefSeq sequences are not in GenBank under ACNUC
>> (but in RefSeq under ACNUC), so these sequences are not found.
>>
>> Alternatively, if you check the "EST" you got:
>> All: 16848
>> mRNA: 16848
>>
>> In ACNUC if you type
>> "sp=Aspergillus nidulans et k=division est"
>> you will get identically 16848 sequences
>>
>>
>> I have got no explanation yet for the difference between the sequences
>> in Nucleotide without RefSEq ( i.e. 2,404 sequences) and the acnuc sequences
>> from Aspergillus nidulans which are not EST (i.e 2,100 sequences), I
>> will check further...
>>
>>
>> all the best,
>> Simon
>>
>>
>>
>>
>> Jean lobry a écrit :
>>
>>
>>>> Dear SeqinR forum,
>>>>
>>>> Today I used SeqinR to retrieve sequences from the fungus Aspergillus
>>>> nidulans from the ACNUC 'genbank' database, using the commands:
>>>>
>>>>
>>>>
>>>>> choosebank("genbank")
>>>>> query("anidulans","SP=aspergillus nidulans")
>>>>> anidulans$nelem
>>>>>
>>>>>
>>>>>
>>>> [1] 18948
>>>> This means that there were 18948 sequences from Aspergillus nidulans
>>>> found.
>>>>
>>>> As far as I understand it, the ACNUC 'genbank' database corresponds to
>>>> the NCBI Nucleotide database, is that right?
>>>>
>>>> I also did a search directly of the NCBI Nucleotide database on the NCBI
>>>> website for Aspergillus nidulans sequences, by going to
>>>> http://www.ncbi.nlm.nih.gov/nucleotide/ and searching for "Aspergillus
>>>> nidulans"[ORGN]. That search found 29119 nucleotide sequences (12271 of
>>>> which are ESTs).
>>>>
>>>> I am wondering why there is a difference between the search that I did
>>>> of the ACNUC 'genbank' database, and on the NCBI Nucleotide Database
>>>> website?
>>>> I don't think it can be due to the ACNUC database missing some sequences
>>>> recently submitted to NCBI, as the ACNUC website says that the ACNUC
>>>> 'genbank' database was very recently updated, on Nov 13, 2009 (from
>>>> http://pbil.univ-lyon1.fr/cgi-bin/get_relnum?db=GenBank&ident=1929541324
>>>> ).
>>>>
>>>>
>>>>
>>> Dear Avril,
>>>
>>> genbank is organized into general divisions:
>>>
>>> http://www.ncbi.nlm.nih.gov/HTGS/divisions.html
>>>
>>> They are not all included in our ACNUC database for genbank.
>>>
>>> IIRC, the "functional divisions" (viz. EST, STS, GSS and HTG) are
>>> not included. This explains the difference between the two results.
>>>
>>> @Simon: am I correct here?
>>>
>>>
>>>
>>>
>>>> I will be very grateful for your advice, as I would like to use the
>>>> SeqinR library for a bioinformatics practical for students, and want to
>>>> make sure I understand how it works.
>>>>
>>>>
>>>>
>>> For a practical for students I would suggest to use frozen databases.
>>> They are accessible with the special value "TP" for the "tagbank"
>>> argument on opening (TP means "Travaux Pratiques" which is french
>>> for practicals).
>>>
>>> ######
>>>
>>>
>>>
>>>> library(seqinr)
>>>> choosebank(tagbank = "TP")
>>>>
>>>>
>>>>
>>> [1] "emblTP" "swissprotTP" "hoverprotTP" "hovernuclTP" "trypano"
>>>
>>>
>>>
>>>> choosebank("emblTP")
>>>> banknameSocket$details
>>>>
>>>>
>>>>
>>> [1] " **** ACNUC Data Base Content ****
>>> "
>>> [2] " EMBL Library Release 78 WITHOUT ESTs (March 2004)"
>>> [3] "27,571,397,913 bases; 12,533,594 sequences; 1,604,500 subseqs;
>>> 339,186 refers."
>>> [4] "Software by M. Gouy & M. Jacobzone, Laboratoire de biometrie,
>>> Universite Lyon I "
>>>
>>>
>>>
>>>> query("anidulans","SP=aspergillus nidulans")
>>>> anidulans$nelem
>>>>
>>>>
>>>>
>>> [1] 218
>>> ######
>>>
>>> There are only 218 sequences for Aspergillus nidulans in this frozen
>>> version of EMBL, but the advantage is that the results are stable over
>>> time. Your practical will be ready unchanged for next year.
>>>
>>> Best,
>>>
>>>
>>>
>>
>>
>
>
>
--
Simon Penel
Laboratoire de Biometrie et Biologie Evolutive Bat 711 - CNRS UMR 5558 - Universite Lyon 1 43 bd du 11 novembre 1918 69622 Villeurbanne Cedex Tel: 04 72 43 29 04 Fax: 04 72 43 13 88
http://lbbe.univ-lyon1.fr/-Penel-Simon-.html?lang=fr
More information about the Seqinr-forum
mailing list