[Phylobase-devl] accessor and replace method consistency
Jim Regetz
regetz at nceas.ucsb.edu
Sat Sep 5 02:37:41 CEST 2009
Hi all,
I promise this will be my last biggie proposal before the release.
We've found and fixed a number of bugs resulting from misaligning the
elements of various slots (labels and nodes, edges and lengths, nodes
and data, etc). I've noticed a few new ones that I haven't reported yet
(in plot, prune, and tdata). Partly these occur because we sometimes
assume too much about slot data when accessing them directly via the @
operator -- assumptions that tend to be true of the test data we usually
use, but aren't in fact enforced. And partly this is because the
accessor methods themselves are not consistent across slots (and have
changed over time).
Rather than go with a band-aid approach, I think we should step back and
do a little accessor cleanup. It won't be that hard -- I'm happy to make
the changes, test things, and fix any downstream functions/methods that
break as a result. If we go with this, then I would recommend we use
accessors more routinely in the code. If in some case someone elects to
circumvent the getter/setter methods for performance reasons, fine, but
tread carefully (and consider adding a code comment for clarification).
So, for the details. As I understand it, we place no restrictions on the
order of elements in the vector slots (tip.label, node.label,
edge.length, edge.label) nor on the order of rows in the tabular slots
(edge, tip.data, node.data). The association between slots is done via
the element names for the label and length slots, row names for the data
slots, and of course the actual matrix values for the edge matrix.
That's fine, and provides some flexibility, but it also imposes extra
overhead on most phylobase functions. In lieu of enforcing particular
ordering as part of the phylo[4] definition (which we may yet decide to
do in the future, for efficiency!), I propose making the following
consistent ordering rules a feature of the getter/setter methods. Note
that some of these already work this way, especially coming out of
Francois' branch, but the point is to really codify and test it:
* The following should always return the vectors ordered by node ID, in
ascending order:
nodeId
nodeType (already does)
labels
tipLabels
nodeLabels
This means the ordering will *always* correspond to 1:nTips for tips,
(nTips+1):(nTips+nNodes) for internal nodes, and 1:(nTips+nNodes) for
the 'all' case. In the case of missing labels, the accessor should still
return a complete vector of NAs.
* Similarly, tdata (whether tip, internal, or all) should always return
a data frame with rows ordered by node ID, in ascending order.
* edge should always returns the edge matrix as-is, because the actual
row order can matter for traversal methods.
* edgeLength and edgeLabel should always return the vectors in an order
that matches the rows of the current edge matrix. As with node labels,
in the case of missing edge lengths or labels, the accessor should still
return a complete vector of NAs.
* For Replace methods, when names are not used, the replacement ordering
should corresponding directly to accessor ordering (and not the existing
internal ordering). For example, even if the tip.label slot is:
x at tip.label
2 3 1
"t2" "t3" "t1"
then tipLabels(x) <- c("tA", "tB", "tC") should still produce this:
x at tip.label
1 2 3
"tA" "tB" "tC"
It already does, happily, but not all of the setters are like this.
Incidentally, if we modify nodeId as I proposed, then the pretty print
method would show the object with tips first, then the internal nodes. I
realize that's different from what we have now, and maybe I missed an
earlier discussion about this, but I would be totally fine with that. Of
course, if folks always want internal nodes first, then tips, that's
fine, and we can tweak the print method accordingly. Either way would be
better than now IMHO, as the printed order can change when the edge
order changes (even with the pretty method).
Whew. I'm happy to debate these (and can certainly change my mind on
details), but in any case I think it would be a boon for developers and
users alike to have more consistency on these fronts.
Thoughts?
Jim
More information about the Phylobase-devl
mailing list