[Phylobase-devl] accessor and replace method consistency

Sat Sep 5 02:37:41 CEST 2009

Hi all,

I promise this will be my last biggie proposal before the release.

We've found and fixed a number of bugs resulting from misaligning the 
elements of various slots (labels and nodes, edges and lengths, nodes 
and data, etc). I've noticed a few new ones that I haven't reported yet 
(in plot, prune, and tdata). Partly these occur because we sometimes 
assume too much about slot data when accessing them directly via the @ 
operator -- assumptions that tend to be true of the test data we usually 
use, but aren't in fact enforced. And partly this is because the 
accessor methods themselves are not consistent across slots (and have 
changed over time).

Rather than go with a band-aid approach, I think we should step back and 
do a little accessor cleanup. It won't be that hard -- I'm happy to make 
the changes, test things, and fix any downstream functions/methods that 
break as a result. If we go with this, then I would recommend we use 
accessors more routinely in the code. If in some case someone elects to 
circumvent the getter/setter methods for performance reasons, fine, but 
tread carefully (and consider adding a code comment for clarification).

So, for the details. As I understand it, we place no restrictions on the 
order of elements in the vector slots (tip.label, node.label, 
edge.length, edge.label) nor on the order of rows in the tabular slots 
(edge, tip.data, node.data). The association between slots is done via 
the element names for the label and length slots, row names for the data 
slots, and of course the actual matrix values for the edge matrix.

That's fine, and provides some flexibility, but it also imposes extra 
overhead on most phylobase functions. In lieu of enforcing particular 
ordering as part of the phylo[4] definition (which we may yet decide to 
do in the future, for efficiency!), I propose making the following 
consistent ordering rules a feature of the getter/setter methods. Note 
that some of these already work this way, especially coming out of 
Francois' branch, but the point is to really codify and test it:

* The following should always return the vectors ordered by node ID, in 
ascending order:

nodeId
nodeType (already does)
labels
tipLabels
nodeLabels

This means the ordering will *always* correspond to 1:nTips for tips, 
(nTips+1):(nTips+nNodes) for internal nodes, and 1:(nTips+nNodes) for 
the 'all' case. In the case of missing labels, the accessor should still 
return a complete vector of NAs.

* Similarly, tdata (whether tip, internal, or all) should always return 
a data frame with rows ordered by node ID, in ascending order.

* edge should always returns the edge matrix as-is, because the actual 
row order can matter for traversal methods.

* edgeLength and edgeLabel should always return the vectors in an order 
that matches the rows of the current edge matrix. As with node labels, 
in the case of missing edge lengths or labels, the accessor should still 
return a complete vector of NAs.

* For Replace methods, when names are not used, the replacement ordering 
should corresponding directly to accessor ordering (and not the existing 
internal ordering). For example, even if the tip.label slot is:

x at tip.label
    2    3    1
"t2" "t3" "t1"

then tipLabels(x) <- c("tA", "tB", "tC") should still produce this:

x at tip.label
    1    2    3
"tA" "tB" "tC"

It already does, happily, but not all of the setters are like this.

Incidentally, if we modify nodeId as I proposed, then the pretty print 
method would show the object with tips first, then the internal nodes. I 
realize that's different from what we have now, and maybe I missed an 
earlier discussion about this, but I would be totally fine with that. Of 
course, if folks always want internal nodes first, then tips, that's 
fine, and we can tweak the print method accordingly. Either way would be 
better than now IMHO, as the printed order can change when the edge 
order changes (even with the pretty method).

Whew. I'm happy to debate these (and can certainly change my mind on 
details), but in any case I think it would be a boon for developers and 
users alike to have more consistency on these fronts.

Thoughts?

Jim