[Seqinr-forum] how to get sequence information of a certain gene?
penel at biomserv.univ-lyon1.fr
penel at biomserv.univ-lyon1.fr
Wed Nov 5 11:05:11 CET 2008
Hi,
I am afraid there is no simple way to do that.
As first steps, I suggest the following:
First of all, query ensembl ( the current release is ensembl 49):
>library("seqinr")
>choosebank("ensembl")
then, to get the sequences associated to the gene ENSG00000171791 :
>query("toto","k=ENSG00000171791")
you obtain 2 sequences which are the CDS associated to the gene
> toto$req[1]
[[1]]
name length frame ncbicg
"HOMO18_63.PE1" "720" "0" "1"
> toto$req[2]
[[1]]
name length frame ncbicg
"HOMO18_63.PE2" "720" "0" "1"
It is possible that several CDS are associated to the same gene when
alternative splicing occurs.
Unfortunately, it seems that in the current case, this is probably due
in a redundancy in the ensembl annotations.
(Maybe this problem disapears in release 50)
Then for each of these CDS you can get its name
>select <- toto$req[1]
>nom <- getName(select)
> nom
[1] "HOMO18_63.PE1"
>
Afterwards you should get the subsequences found in the sequence wich
contains the CDS, ie the sequence named HOMO18_63 (the name of the CDS
without the extension)
For example, if you want ti obtain the for internal introns
>query("intint","n=HOMO18_63 et t=int_int")
Finaly you should select the internal introns associated to the CDS you
want
and then you get all the subsequences associated to this sequence
>modifylist("intint", op = "\"HOMO18_63.PE1 \"", type = "scan")
> intint$req
[[1]]
name length frame ncbicg
"HOMO18_63.IN1" "189322" "0" "1"
It means there is 1 internal intron with a length of 189322 in the gene
With these information you can calculate the number of intron and exons
and the total length of introns and exons.
In the same manner you can get the 3'ncr region of the CDS:
>query("ncr3","n=HOMO18_63 et t=3'ncr")
>modifylist("ncr3", op = "\"HOMO18_63.PE1 \"", type = "scan")
> ncr3$req
[[1]]
name length frame ncbicg
"HOMO18_63.3F1" "85401" "0" "1"
For the 5'ncr:
> query("ncr5","n=HOMO18_63 et t=5'ncr")
> modifylist("ncr5", op = "\"HOMO18_63.PE1 \"", type = "scan")
> ncr5$req
[[1]]
name length frame ncbicg
"HOMO18_63.5F1" "13115" "0" "1"
>
Concerning 3'UTR and CpG islands, it depends if this information is
present in Ensembl annotations...
It seems not.
Hope this help,
Simon
#LI YONGJIN# a écrit :
> Hi Seqinr list,
> Given a gene name (ENSG00000171791), how to get the information listed below.
> number of exons, total exon length, total intron length;
> number of promoter CpG islands, 3'UTR length.
> Best Regards,
> Rocky
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Seqinr-forum mailing list
> Seqinr-forum at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/seqinr-forum
>
--
Simon Penel
Laboratoire de Biometrie et Biologie Evolutive Bat 711 - CNRS UMR 5558 - Universite Lyon 1 43 bd du 11 novembre 1918 69622 Villeurbanne Cedex Tel: 04 72 43 29 04 Fax: 04 72 43 13 88
http://lbbe.univ-lyon1.fr/-Penel-Simon-.html?lang=fr
More information about the Seqinr-forum
mailing list