[Phylobase-devl] prune/subset questions

Sat Aug 29 03:33:45 CEST 2009

Hi all,

As far as I can tell, the new phylo4 prune method I've written is 
working just fine, and supports both trim.internal=TRUE and 
trim.internal=FALSE. It only does subtree=FALSE, more on that below. 
Some questions for the group:

1. Is there a compelling reason to keep both subset *and* prune methods? 
Or is this just a historical artifact? I think the only differences are: 
(1) you can only pass the trim.internal and subtree arguments to prune, 
but not subset, and (2) subset accepts tips.include, tips.exclude, mrca, 
and node.subtree, whereas prune only does tips.exclude. Why not just 
expose trim.internal and subtree (if desired) via the subset methods, 
and eliminate prune? Or if someone really wants a prune function, it can 
simply be an inflexible wrapper for subset, only accepting tips.exclude.

2. Do we need/want to support a subtree=TRUE option? I haven't worked on 
this at all. For what it's worth, even using the current ape-based 
subset method, this option unreliable for phylo4(d):

require(phylobase)
data(geospiza)
geotree <- extractTree(geospiza)
prune(geotree, c(1,3), subtree=TRUE)
## Error in checkTree(object) : All labels must be unique
## In addition: Warning message:
## In asMethod(object) : trees with unknown order may be unsafe in ape

Here it's because the resulting tree would have two tip labels called 
"[1_tips]". Anyway, I would be happy with leaving subtree as a future 
feature possibility for now.

3. Any opinions on dealing with root edge length during subsetting? The 
current method (using ape::drop.tip) just loses that information. In the 
new method, the root edge essentially accumulates the edges associated 
with any singletons that form along it as a consequence of the pruning. 
Of course, that could make for a long root edge when retaining just two 
closely related species in a large tree. Alternatively, albeit somewhat 
arbitrarily, we could make it be the length of the edge connecting the 
new root to its parent node in the original tree. Of course, this could 
also be computed after the fact, e.g. with:

edgeLength(phy, MRCA(phy, tips.included))

where phy was the full (pre-subset) tree.

4. This new method was initially kinda slow, but mostly because it makes 
a bunch of descendants() calls in one part, and that can be slow. So I 
rewrote descendants() to use a (very simple) C function that works on a 
preordered edge matrix, which helps a lot with speed. I'll commit if 
this are no objections. The new subset is still slower than ape's 
drop.tip, but not horribly so.

Cheers,
Jim