[Seqinr-forum] Suggestions for web-scraping a webpage with R ?

Tal Galili tal.galili at gmail.com
Sun Mar 14 21:39:14 CET 2010


Hi all,

After searching on how to do a tRNA alignment based on the secondary
structure folding, I abandoned this strategy and am now trying to reply on
the processed data that is available online. But now, I need to
download/parse it - and here is where I would love for any suggestion/help.


I would like to go through all the "species pages" present in this link:

http://gtrnadb.ucsc.edu/

So for each of them I will go to:

   1. The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/
   )
   2. And then to the "Secondary Structures" page link (for example:
   http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)

Inside that link I wish to scrap the data in the page so that I will have a
long list containing this data (for example):

chr.trna3 (1-77)    Length: 77 bp
Type: Ala   Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....

Where each line will have it's own list (inside the list for each "trna"
inside the list for each animal)

I remember coming across the packages Rcurl and XML (in R) that can allow
for such a task. But I don't know how to use them. So what I would love to
have is: 1. Some suggestion on how to build such a code. 2. And
recommendation for how to learn the knowledge needed for performing such a
task.

Thanks for any help,

Tal

----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100314/b04c7daa/attachment-0001.htm 


More information about the Seqinr-forum mailing list