[Phylobase-devl] prune/subset questions

Wed Sep 2 19:43:01 CEST 2009

On Wed, Sep 02, 2009 at 12:59:38AM -0700, Jim Regetz wrote:
> Some follow-up...
> 
> Jim Regetz wrote:
> > Hi all,
> > 
> > As far as I can tell, the new phylo4 prune method I've written is 
> > working just fine, and supports both trim.internal=TRUE and 
> > trim.internal=FALSE. It only does subtree=FALSE, more on that below.
> 
> I just committed this. It is currently a prune method (for consistency 
> with previous organization), but as mentioned below I would prefer to 
> migrate it into the subset method.
> 
> > Some questions for the group:
> > 
> > 1. Is there a compelling reason to keep both subset *and* prune methods? 
> > Or is this just a historical artifact? I think the only differences are: 
> > (1) you can only pass the trim.internal and subtree arguments to prune, 
> > but not subset, and (2) subset accepts tips.include, tips.exclude, mrca, 
> > and node.subtree, whereas prune only does tips.exclude. Why not just 
> > expose trim.internal and subtree (if desired) via the subset methods, 
> > and eliminate prune? Or if someone really wants a prune function, it can 
> > simply be an inflexible wrapper for subset, only accepting tips.exclude.
> 
> I am increasingly inclined to drop prune as a formal method, making 
> subset (and associated "[" extraction) the sole interface to this 
> functionality.
> 
> Objections?

Fine by me.

> Going a step further, should we drop all the ape-based phylo subsetting 
> methods (including pruning and tip dropping)? Seems like ape 
> import/export are a sufficient link; maintaining additional coupling 
> through wrapper functions will mean duplication -- and maintenance -- of 
> code and documentation, which could become a headache as both phylobase 
> and ape continue to evolve.

+1

> > 2. Do we need/want to support a subtree=TRUE option? I haven't worked on 
> > this at all. For what it's worth, even using the current ape-based 
> > subset method, this option unreliable for phylo4(d):
> > 
> > require(phylobase)
> > data(geospiza)
> > geotree <- extractTree(geospiza)
> > prune(geotree, c(1,3), subtree=TRUE)
> > ## Error in checkTree(object) : All labels must be unique
> > ## In addition: Warning message:
> > ## In asMethod(object) : trees with unknown order may be unsafe in ape
> > 
> > Here it's because the resulting tree would have two tip labels called 
> > "[1_tips]". Anyway, I would be happy with leaving subtree as a future 
> > feature possibility for now.
> 
> Still not done. Currently, if you use subtree=TRUE, you get routed 
> through phylo coercion and ape::drop.tip, as before. Of course, this 
> won't be possible any longer if we drop ape-based subsetting altogether. 
> I don't have a good sense of how much of a priority this is.
> 
> > 3. Any opinions on dealing with root edge length during subsetting? The 
> > current method (using ape::drop.tip) just loses that information. In the 
> > new method, the root edge essentially accumulates the edges associated 
> > with any singletons that form along it as a consequence of the pruning. 
> > Of course, that could make for a long root edge when retaining just two 
> > closely related species in a large tree. Alternatively, albeit somewhat 
> > arbitrarily, we could make it be the length of the edge connecting the 
> > new root to its parent node in the original tree. Of course, this could 
> > also be computed after the fact, e.g. with:
> > 
> > edgeLength(phy, MRCA(phy, tips.included))
> > 
> > where phy was the full (pre-subset) tree.
> 
> Currently implemented as described. In case someone is interested in 
> obtaining the alternative root edge length, I can just add that line of 
> code to the example in the documentation.
> 
> > 4. This new method was initially kinda slow, but mostly because it makes 
> > a bunch of descendants() calls in one part, and that can be slow. So I 
> > rewrote descendants() to use a (very simple) C function that works on a 
> > preordered edge matrix, which helps a lot with speed. I'll commit if 
> > this are no objections. The new subset is still slower than ape's 
> > drop.tip, but not horribly so.
> 
> I've since modified phylo4 subset/prune to reduce the reliance on 
> descendants(), but I can still commit the C-based descendants 
> replacement. The subsetting does, however, now require at least one 
> reorder(), which is fine for small trees, but slow for large ones. So I 
> wrote a C function for reorder() as well (thanks to Peter for the push 
> in that direction). There are actually two alternative C functions, one 
> faster but only called if the tree is a true binary tree (no singletons 
> or polytomies), and the other slower but usable on any valid phylo4 tree 
> (I think). Performance of both is no better than the pure R reorder on 
> small trees, maybe even trivially slower, but much better on large trees.
> 
> Are folks fine with having these things reimplemented as compiled C 
> functions? What I've done so far is pretty simple, and uses only the .C 
> interface, for what it's worth. I could imagine writing an analogous C 
> function for ancestors(), but that's probably enough at least for the 
> time being. Opinions?

I've been hoping to get the reorder command rewritten in C for a while, but don't know enough C to make it happen.  I spent a lot of time optimizing it, so its gratifying that it's at least fast for small trees.  Nonetheless, I'd love to see this committed.  

I think it would be nice to keep the pure R in the sources, just commented out.  I know that I've looked at and benefited from code that was rewritten in C but still left as a comment next to the .Call().  Python has a similar policy of keeping both a C and a python version of the same function, and even updating them together.

Peter

> 
> Thanks,
> Jim
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl