[Phylobase-devl] Phylobase GSoC idea

Tue Mar 11 05:29:53 CET 2008

Hi all,

>   Hmmm.  Maybe I could co-mentor ... (5 hours/week sounds like
> a lot ...)

It does sound like a serious commitment but also potentially very  
valuable. And the shirt is tempting. :) Sharing mentoring  
responsibilities for a single project is more realistic than having  
several projects going. Peter was interested in mentoring but he'd  
also be eligible to apply to GSoC as a student and can't do both. I'd  
be happy to co-mentor if I had the right skill set for the project (R/ 
C/C++ all ok).

>  Brian did mention that there were some existing C++ libraries
> for tree manipulation etc. ... patching into these might be
> the (an?) answer?

I rewrote the idea I proposed before to open it up to a wider range of  
potential projects, from tree manipulation to multi-tree, metadata or  
even buildiing an interface with nexml or nexus (i.e. more work on the  
ioNCL code). See below. Too vague now?

It did sound like there was more interest in the plotting idea if we  
were to go with just one phylobase-related project.

Steve

---
Rationale

NESCent sponsored a hackathon focused on integration of comparative  
methods within the R statistical package to promote interoperability,  
the support of data exchange standards, and greater usability of tools  
and methods in evolutionary bioinformatics. One result of this  
hackathon has been the development of the phylobase package, which  
seeks to provide a set of S4 classes and methods for representing and  
manipulating phylogenetic trees and associated data in R. Phylobase  
contains structures for representing phylogenetic trees and associated  
data, but methods for tree manipulation, representation of multiple  
trees and metadata, and interfaces with other data formats (i.e.  
nexus, nexml) remain incomplete or have not been optimized for use  
with the large, multi-tree datasets that are increasingly common in  
bioinformatics and comparative biology.

Approach

Phylogenetic trees and associated data in phylobase are represented as  
S4 data objects. The methods for tree/data manipulation and import are  
currently a mixture of S3 and S4 methods and C/C++ extensions. The  
approach for this project will be to implement efficient algorithms  
for tree and data representation and manipulation using object- 
oriented S4 classes and methods, or C/C++ extensions where necessary  
for performance. We would suggest focusing on methods such as tree  
pruning, subsetting, and manipulation of multiple tree objects that  
are currently incomplete and will have the greatest impact on the  
ability to work with very large trees and datasets. It would also be  
useful to improve interfaces with other data formats such as nexus and  
nexml that will be the likely source for import of trees, data and  
metadata.

Challenges

While the R statistical programming language is extremely powerful and  
provides a rich feature set, it is inefficient at handling very large  
objects and heavy computational lifting (recursion, for-loops). The  
general challenge for this project will be to optimize the data  
structures (trees, multi-trees, associated data, metadata) and methods  
(pruning, subsetting of trees and data) that have the greatest impact  
on the ability to work with very large trees and datasets. This will  
require profiling and testing of existing code, implementing existing  
algorithms using S4 classes and methods and C/C++ extensions, or  
writing interfaces with data formats such as nexus and nexml.

Involved toolkits or projects

phylobase, R, C/C++, nexus class library, nexml

Mentors
???