[Seqinr-forum] how to get sequence information of a certain gene?

penel at biomserv.univ-lyon1.fr penel at biomserv.univ-lyon1.fr
Wed Nov 5 11:05:11 CET 2008


Hi,

I am afraid there is no  simple way to do that.

As first steps, I suggest the following:

First of all,  query ensembl ( the current release is ensembl 49):

 >library("seqinr")
 >choosebank("ensembl")

then, to get the sequences associated to the gene ENSG00000171791 :


 >query("toto","k=ENSG00000171791")


you obtain 2 sequences which are the  CDS associated to the gene


 > toto$req[1]
[[1]]
          name          length           frame          ncbicg
"HOMO18_63.PE1"           "720"             "0"             "1"

 > toto$req[2]
[[1]]
          name          length           frame          ncbicg
"HOMO18_63.PE2"           "720"             "0"             "1"


It is possible that several CDS are associated to the same gene when 
alternative splicing occurs.
Unfortunately, it seems that in the current case, this is probably due 
in a redundancy in the ensembl annotations.



(Maybe this problem disapears in release 50)



Then for each of these CDS you can get its name

 >select <- toto$req[1]
 >nom <- getName(select)
 > nom
[1] "HOMO18_63.PE1"
 >

Afterwards  you should get the subsequences found in the sequence wich 
contains the CDS, ie the sequence named HOMO18_63 (the name of the CDS
without the extension)

For example, if you want ti obtain the for internal introns
 >query("intint","n=HOMO18_63 et t=int_int")

Finaly you should select the internal introns associated to the  CDS you 
want


and then you get all the subsequences associated to this sequence

 >modifylist("intint", op = "\"HOMO18_63.PE1       \"", type = "scan")
 > intint$req
[[1]]
          name          length           frame          ncbicg
"HOMO18_63.IN1"        "189322"             "0"             "1"

It means there is 1 internal intron with a length of 189322 in the gene


With these information you can calculate the number of intron and exons 
and the total length of introns and exons.


In the same manner you can get the 3'ncr region of the CDS:

 >query("ncr3","n=HOMO18_63 et t=3'ncr")
 >modifylist("ncr3", op = "\"HOMO18_63.PE1       \"", type = "scan")
 > ncr3$req
[[1]]
          name          length           frame          ncbicg
"HOMO18_63.3F1"         "85401"             "0"             "1"


For the 5'ncr:

 > query("ncr5","n=HOMO18_63 et t=5'ncr")
 > modifylist("ncr5", op = "\"HOMO18_63.PE1       \"", type = "scan")
 > ncr5$req
[[1]]
          name          length           frame          ncbicg
"HOMO18_63.5F1"         "13115"             "0"             "1"

 >


Concerning 3'UTR and CpG islands, it depends if this information is 
present in Ensembl annotations...
It seems not.


Hope this help,

Simon





#LI YONGJIN# a écrit :
> Hi Seqinr list,
> Given a gene name (ENSG00000171791), how to get the information listed below.
> number of exons, total exon length, total intron length; 
> number of promoter CpG islands, 3'UTR length.
> Best Regards,
> Rocky
>  
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> Seqinr-forum mailing list
> Seqinr-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/seqinr-forum
>   


-- 
Simon Penel
Laboratoire de Biometrie et Biologie Evolutive           Bat 711  -   CNRS UMR 5558  -    Universite Lyon 1              43 bd du 11 novembre 1918 69622 Villeurbanne Cedex       Tel:   04 72 43 29 04      Fax:  04 72 43 13 88
http://lbbe.univ-lyon1.fr/-Penel-Simon-.html?lang=fr



More information about the Seqinr-forum mailing list