[Phylobase-devl] prune/subset questions

Sat Sep 5 00:30:53 CEST 2009

Hi all,

Peter D. Cowan wrote:
> On Wed, Sep 02, 2009 at 12:59:38AM -0700, Jim Regetz wrote:
>> Some follow-up...
>>
>> Jim Regetz wrote:
>>> Hi all,
>>>
>>> As far as I can tell, the new phylo4 prune method I've written is 
>>> working just fine, and supports both trim.internal=TRUE and 
>>> trim.internal=FALSE. It only does subtree=FALSE, more on that below.
>> I just committed this. It is currently a prune method (for consistency 
>> with previous organization), but as mentioned below I would prefer to 
>> migrate it into the subset method.
>>
>>> Some questions for the group:
>>>
>>> 1. Is there a compelling reason to keep both subset *and* prune methods? 
>>> Or is this just a historical artifact? I think the only differences are: 
>>> (1) you can only pass the trim.internal and subtree arguments to prune, 
>>> but not subset, and (2) subset accepts tips.include, tips.exclude, mrca, 
>>> and node.subtree, whereas prune only does tips.exclude. Why not just 
>>> expose trim.internal and subtree (if desired) via the subset methods, 
>>> and eliminate prune? Or if someone really wants a prune function, it can 
>>> simply be an inflexible wrapper for subset, only accepting tips.exclude.
>> I am increasingly inclined to drop prune as a formal method, making 
>> subset (and associated "[" extraction) the sole interface to this 
>> functionality.
>>
>> Objections?
> 
> Fine by me.

I changed my mind about removing prune altogether, but otherwise did the 
above. Particularly given that prune is called twice when subsetting a 
phylo4d object, I decided it was better to have one wrapper function 
that does the argument matching, error checking, etc., and a second 
function that does the real work. Those are pretty much the roles that 
subset and prune, respectively, were already playing, so I decided just 
to leave it that way. But really this is just an implementation detail. 
As far as the user is concerned, and as far as the revised documentation 
suggests, subset is the main function, and prune is a special case (with 
"[" as third way to subset).

>> Going a step further, should we drop all the ape-based phylo subsetting 
>> methods (including pruning and tip dropping)? Seems like ape 
>> import/export are a sufficient link; maintaining additional coupling 
>> through wrapper functions will mean duplication -- and maintenance -- of 
>> code and documentation, which could become a headache as both phylobase 
>> and ape continue to evolve.
> 
> +1

Done.

>>> 2. Do we need/want to support a subtree=TRUE option? I haven't worked on 
>>> this at all. For what it's worth, even using the current ape-based 
>>> subset method, this option unreliable for phylo4(d):
>>>
>>> require(phylobase)
>>> data(geospiza)
>>> geotree <- extractTree(geospiza)
>>> prune(geotree, c(1,3), subtree=TRUE)
>>> ## Error in checkTree(object) : All labels must be unique
>>> ## In addition: Warning message:
>>> ## In asMethod(object) : trees with unknown order may be unsafe in ape
>>>
>>> Here it's because the resulting tree would have two tip labels called 
>>> "[1_tips]". Anyway, I would be happy with leaving subtree as a future 
>>> feature possibility for now.
>> Still not done. Currently, if you use subtree=TRUE, you get routed 
>> through phylo coercion and ape::drop.tip, as before. Of course, this 
>> won't be possible any longer if we drop ape-based subsetting altogether. 
>> I don't have a good sense of how much of a priority this is.

Left for the future.

>>> 3. Any opinions on dealing with root edge length during subsetting? The 
>>> current method (using ape::drop.tip) just loses that information. In the 
>>> new method, the root edge essentially accumulates the edges associated 
>>> with any singletons that form along it as a consequence of the pruning. 
>>> Of course, that could make for a long root edge when retaining just two 
>>> closely related species in a large tree. Alternatively, albeit somewhat 
>>> arbitrarily, we could make it be the length of the edge connecting the 
>>> new root to its parent node in the original tree. Of course, this could 
>>> also be computed after the fact, e.g. with:
>>>
>>> edgeLength(phy, MRCA(phy, tips.included))
>>>
>>> where phy was the full (pre-subset) tree.
>> Currently implemented as described. In case someone is interested in 
>> obtaining the alternative root edge length, I can just add that line of 
>> code to the example in the documentation.

Done.

>>> 4. This new method was initially kinda slow, but mostly because it makes 
>>> a bunch of descendants() calls in one part, and that can be slow. So I 
>>> rewrote descendants() to use a (very simple) C function that works on a 
>>> preordered edge matrix, which helps a lot with speed. I'll commit if 
>>> this are no objections. The new subset is still slower than ape's 
>>> drop.tip, but not horribly so.
>> I've since modified phylo4 subset/prune to reduce the reliance on 
>> descendants(), but I can still commit the C-based descendants 
>> replacement. The subsetting does, however, now require at least one 
>> reorder(), which is fine for small trees, but slow for large ones. So I 
>> wrote a C function for reorder() as well (thanks to Peter for the push 
>> in that direction). There are actually two alternative C functions, one 
>> faster but only called if the tree is a true binary tree (no singletons 
>> or polytomies), and the other slower but usable on any valid phylo4 tree 
>> (I think). Performance of both is no better than the pure R reorder on 
>> small trees, maybe even trivially slower, but much better on large trees.
>>
>> Are folks fine with having these things reimplemented as compiled C 
>> functions? What I've done so far is pretty simple, and uses only the .C 
>> interface, for what it's worth. I could imagine writing an analogous C 
>> function for ancestors(), but that's probably enough at least for the 
>> time being. Opinions?
> 
> I've been hoping to get the reorder command rewritten in C for a while, but don't know enough C to make it happen.  I spent a lot of time optimizing it, so its gratifying that it's at least fast for small trees.  Nonetheless, I'd love to see this committed.  
> 
> I think it would be nice to keep the pure R in the sources, just commented out.  I know that I've looked at and benefited from code that was rewritten in C but still left as a comment next to the .Call().  Python has a similar policy of keeping both a C and a python version of the same function, and even updating them together.

Done and done.

BTW, in the Rd file there is no longer an alias for subset itself, nor 
for "[", both of which have generics defined elsewhere. AFAICT that is 
what the current version of "Writing R Extensions" recommends (on page 40):

"If a package has methods for a function defined originally somewhere 
else, and does not change the underlying default method for the 
function, the package is responsible for documenting the methods it 
creates, but not for the function itself or the default method."

These still do the trick though:

methods?subset
methods?"["
?subset(phy) # for phylo4(d) object phy

Cheers,
Jim

> Peter
> 
>> Thanks,
>> Jim
>> _______________________________________________
>> Phylobase-devl mailing list
>> Phylobase-devl at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl