[Seqinr-forum] Suggestions for web-scraping a webpage with R ?
Tal Galili
tal.galili at gmail.com
Sun Mar 14 21:39:14 CET 2010
Hi all,
After searching on how to do a tRNA alignment based on the secondary
structure folding, I abandoned this strategy and am now trying to reply on
the processed data that is available online. But now, I need to
download/parse it - and here is where I would love for any suggestion/help.
I would like to go through all the "species pages" present in this link:
http://gtrnadb.ucsc.edu/
So for each of them I will go to:
1. The species page link (for example: http://gtrnadb.ucsc.edu/Aero_pern/
)
2. And then to the "Secondary Structures" page link (for example:
http://gtrnadb.ucsc.edu/Aero_pern/Aero_pern-structs.html)
Inside that link I wish to scrap the data in the page so that I will have a
long list containing this data (for example):
chr.trna3 (1-77) Length: 77 bp
Type: Ala Anticodon: CGC at 35-37 (35-37) Score: 93.45
Seq: GGGCCGGTAGCTCAGCCtGGAAGAGCGCCGCCCTCGCACGGCGGAGGcCCCGGGTTCAAATCCCGGCCGGTCCACCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<....
Where each line will have it's own list (inside the list for each "trna"
inside the list for each animal)
I remember coming across the packages Rcurl and XML (in R) that can allow
for such a task. But I don't know how to use them. So what I would love to
have is: 1. Some suggestion on how to build such a code. 2. And
recommendation for how to learn the knowledge needed for performing such a
task.
Thanks for any help,
Tal
----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/seqinr-forum/attachments/20100314/b04c7daa/attachment-0001.htm
More information about the Seqinr-forum
mailing list