[Phylobase-devl] New phylobase build approach using static libncl (Was: Rcpp and OS X compiliation)

Peter Cowan pdc at berkeley.edu
Wed Mar 3 23:55:06 CET 2010


On Mar 3, 2010, at 2:12 PM, Mark Holder wrote:

> Hi,
> On Mar 3, 2010, at 3:43 PM, Peter Cowan wrote:
>> I agree that this is something to address.  Not only might there be clashes but changing names, will be annoying to users.  Brian or Derrick could answer better, but I assume this is because some of the code used to parse the tree string can't handle the underscores and spaces.
>> 
>> Which brings me to one of the questions I've had about NCL.  What are the export options for trees.  Does NCL parse the tree block and have an internal storage that we could convert more directly into our tree format?
>> 
>> Currently I think a tree string (essentially newick?) is passed back to the R code which parses it with regular expressions.  The RegEx code is lifted directly from APE and is complicated and somewhat fragile.
> 
> NCL is geared toward C++ client code.  So it has an internal representation of trees that can be used by clients, or it can returned the newick tree string with taxon labels or with numbers instead of taxon labels.

Okay, I figure something like that was the case.  Is the internal representation a node based pointer tree?  We might already have some code that can convert between that and our edge matrix format, where each row in a two column matrix is a node and its ancestor.  It's a pretty clunky format but R doesn't have a nice way like pointers.  

For the newick string below (unrooted) it looks like:

     ancestor descendant
[1,]        5          1
[2,]        5          2
[3,]        5          6
[4,]        6          3
[5,]        6          4

1-4 refer to the tips, which is why they are only in the descendant column, and 5,6 are the internal nodes.  If the tree is rooted there is another row like:

     ancestor descendant
[6,]        0          5

Branch lengths and taxa labels are stored separately as vectors.

If you can point me to a description of the internal representation, I will toy around with converting it directly.


> If you describe to me what the phylobase tree string in R would look like, I can probably write something that spits out a string of R that is more to your liking.  I don't know enough about R/C++ interactions to construct complex objects on the fly from the NCL wrapper code.
> 
> 
> Two newick-based alternatives that would be very easy to implement and may be more robust than returning newick strings like this:
> 
> 	'(taxA:0.9,taxB:10,(taxC:3e-08,taxD:5):3.1)'
> 
> 
> would be:
> 
> 
> 	1. return the taxon labels as one list and return the tree with taxa numbered (starting at 1 is the NEXUS and R way).  So the tree above would still come back as newick, but would be:
> 	'(1:0.9,2:10,(3:3e-08,4:5)):3.1)'
> 
> This would help you deal with complex taxon names (they want interfere with the regex code from APE).

This would be very easy for us to deal with.  


> 
> or
> 
> 
> 	2. Have NCL return a list of strings. Each of which is a token in newick parsing. So the tree would come back as:
> 
> 
> 	c("'", "(", "1", ":", "0.9", ",", "2", ":", "10", ",", "(", "3", ":", "3e-08", ",", "4", ":", "5", ")", ")", ":", "3.1", ")", ")
> 
> 
> Obviously that is verbose, but it allows you to deal with strange taxon names (NCL will figure out how to tokenize them even if there are spaces in them), and scientific notation in branch lengths.

This might be the way to go if we decide to rewrite a parser ourselves, but I think one of the two above options is probably better at this point.

Thanks

Peter

> all the best,
> Mark
> 
> 
> 
> 
> 



More information about the Phylobase-devl mailing list