[Phylobase-devl] Phylobase GSoC idea
Steve Kembel
skembel at berkeley.edu
Mon Mar 10 21:51:34 CET 2008
Hi all,
Here's a Google Summer of Code 'idea'. Deadline for getting these up
on the wiki is today. Thoughts? Edits? Anyone else want to sign up to
be a mentor? Any other ideas? People suggested plotting, RUnit/
testing, linking with nexml or phyloxml...?
Rationale
There is a need for efficient phylogenetic tree manipulation methods
in the R statistical package to take advantage of the statistical
computing ability of R for bioinformatics and comparative phylogenetic
analyses. NESCent sponsored a hackathon focused on integration of
comparative methods within the R statistical package to promote
interoperability, the support of data exchange standards, and greater
usability of tools and methods in evolutionary bioinformatics. One
result of this hackathon has been the development of the phylobase
package, which seeks to provide a set of S4 classes and methods for
representing and manipulating phylogenetic trees and data in R.
Currently phylobase contains structures for representing phylogenetic
trees and associated data, but methods for tree manipulation remain
incomplete or have not been optimized. Current implementation of
phylogenetic tree storage and manipulation are inadequate for working
the large tree and multiple tree datasets that are increasingly common
in bioinformatics and comparative biology.
Approach
The R programming language, an object-oriented statistical programming
language, has recently introduced a new objecet-oriented class system
(S4). Phylogenetic trees in phylobase are currently represented as S4
data objects. The methods for tree manipulation are currently a
mixture of S3 and S4 methods and C/C++ extensions. The approach for
this project will be to identify obstacles to manipulating large trees
and datasets, which could include optimizing tree or data
representation in memory, and to develop and implement efficient
algorithms for tree representation and manipulation using object-
oriented S4 classes and methods or C/C++ extensions.
Challenges
While the R statistical programming language is extremely powerful and
provides a rich feature set, it is inefficient at handling very large
objects and heavy computational lifting (recursion, for-loops). The
general challenge for this project will be to identify data structures
and methods that have the greatest impact on the ability to work with
very large trees and datasets, and to implement these structures and
methods in a more efficient way. This will require profiling and
testing of existing code, the use of S4 classes and methods, and
possibly the R API and C/C++ extensions to the R language.
Involved toolkits or projects
phylobase, R, S4 classes
Mentors
Steven Kembel, ?
More information about the Phylobase-devl
mailing list