[Phylobase-devl] Phylobase GSoC idea

Mon Mar 10 21:51:34 CET 2008

Hi all,

Here's a Google Summer of Code 'idea'. Deadline for getting these up  
on the wiki is today. Thoughts? Edits? Anyone else want to sign up to  
be a mentor? Any other ideas? People suggested plotting, RUnit/ 
testing, linking with nexml or phyloxml...?

Rationale

There is a need for efficient phylogenetic tree manipulation methods  
in the R statistical package to take advantage of the statistical  
computing ability of R for bioinformatics and comparative phylogenetic  
analyses. NESCent sponsored a hackathon focused on integration of  
comparative methods within the R statistical package to promote  
interoperability, the support of data exchange standards, and greater  
usability of tools and methods in evolutionary bioinformatics. One  
result of this hackathon has been the development of the phylobase  
package, which seeks to provide a set of S4 classes and methods for  
representing and manipulating phylogenetic trees and data in R.  
Currently phylobase contains structures for representing phylogenetic  
trees and associated data, but methods for tree manipulation remain  
incomplete or have not been optimized. Current implementation of  
phylogenetic tree storage and manipulation are inadequate for working  
the large tree and multiple tree datasets that are increasingly common  
in bioinformatics and comparative biology.

Approach

The R programming language, an object-oriented statistical programming  
language, has recently introduced a new objecet-oriented class system  
(S4). Phylogenetic trees in phylobase are currently represented as S4  
data objects. The methods for tree manipulation are currently a  
mixture of S3 and S4 methods and C/C++ extensions. The approach for  
this project will be to identify obstacles to manipulating large trees  
and datasets, which could include optimizing tree or data  
representation in memory, and to develop  and implement efficient  
algorithms for tree representation and manipulation using object- 
oriented S4 classes and methods or C/C++ extensions.

Challenges

While the R statistical programming language is extremely powerful and  
provides a rich feature set, it is inefficient at handling very large  
objects and heavy computational lifting (recursion, for-loops). The  
general challenge for this project will be to identify data structures  
and methods that have the greatest impact on the ability to work with  
very large trees and datasets, and to implement these structures and  
methods in a more efficient way. This will require profiling and  
testing of existing code, the use of S4 classes and methods, and  
possibly the R API and C/C++ extensions to the R language.

Involved toolkits or projects

phylobase, R, S4 classes

Mentors

Steven Kembel, ?