[Phylobase-devl] Phylobase GSoC idea
Steve Kembel
skembel at berkeley.edu
Tue Mar 11 05:29:53 CET 2008
Hi all,
> Hmmm. Maybe I could co-mentor ... (5 hours/week sounds like
> a lot ...)
It does sound like a serious commitment but also potentially very
valuable. And the shirt is tempting. :) Sharing mentoring
responsibilities for a single project is more realistic than having
several projects going. Peter was interested in mentoring but he'd
also be eligible to apply to GSoC as a student and can't do both. I'd
be happy to co-mentor if I had the right skill set for the project (R/
C/C++ all ok).
> Brian did mention that there were some existing C++ libraries
> for tree manipulation etc. ... patching into these might be
> the (an?) answer?
I rewrote the idea I proposed before to open it up to a wider range of
potential projects, from tree manipulation to multi-tree, metadata or
even buildiing an interface with nexml or nexus (i.e. more work on the
ioNCL code). See below. Too vague now?
It did sound like there was more interest in the plotting idea if we
were to go with just one phylobase-related project.
Steve
---
Rationale
NESCent sponsored a hackathon focused on integration of comparative
methods within the R statistical package to promote interoperability,
the support of data exchange standards, and greater usability of tools
and methods in evolutionary bioinformatics. One result of this
hackathon has been the development of the phylobase package, which
seeks to provide a set of S4 classes and methods for representing and
manipulating phylogenetic trees and associated data in R. Phylobase
contains structures for representing phylogenetic trees and associated
data, but methods for tree manipulation, representation of multiple
trees and metadata, and interfaces with other data formats (i.e.
nexus, nexml) remain incomplete or have not been optimized for use
with the large, multi-tree datasets that are increasingly common in
bioinformatics and comparative biology.
Approach
Phylogenetic trees and associated data in phylobase are represented as
S4 data objects. The methods for tree/data manipulation and import are
currently a mixture of S3 and S4 methods and C/C++ extensions. The
approach for this project will be to implement efficient algorithms
for tree and data representation and manipulation using object-
oriented S4 classes and methods, or C/C++ extensions where necessary
for performance. We would suggest focusing on methods such as tree
pruning, subsetting, and manipulation of multiple tree objects that
are currently incomplete and will have the greatest impact on the
ability to work with very large trees and datasets. It would also be
useful to improve interfaces with other data formats such as nexus and
nexml that will be the likely source for import of trees, data and
metadata.
Challenges
While the R statistical programming language is extremely powerful and
provides a rich feature set, it is inefficient at handling very large
objects and heavy computational lifting (recursion, for-loops). The
general challenge for this project will be to optimize the data
structures (trees, multi-trees, associated data, metadata) and methods
(pruning, subsetting of trees and data) that have the greatest impact
on the ability to work with very large trees and datasets. This will
require profiling and testing of existing code, implementing existing
algorithms using S4 classes and methods and C/C++ extensions, or
writing interfaces with data formats such as nexus and nexml.
Involved toolkits or projects
phylobase, R, C/C++, nexus class library, nexml
Mentors
???
More information about the Phylobase-devl
mailing list