[Phylobase-devl] labeling order

Sat Dec 27 19:47:47 CET 2008

Hi all,

>> Here are some opinions, in case it is still time to express some  
>> (after
>> the battle). I recognize most of them consist in encouraging not to
>> change data formats as much as possible -- basically because I have  
>> now
>> a working package based on our current data representation. Also,  
>> from
>> what I and some of my colleagues working with phylobase have  
>> experienced
>> so far, it works pretty well and in a sensible way.
>
>  I think the attempt is to make things consistent, which they weren't
> entirely before.  I agree that changing things as little as possible  
> is
> a good idea!
>
>>>  Hmmm.
>>>
>>>  For the record, here's Steve's statement:
>>>
>>> SWK - This is crucial and we should decide soon, needs to be sorted
>>> out for 1.5. I think that many of the problems we're having with
>>> labels and reordering are due to the fact that until now we treated
>>> nodes and edges as interchangable. i.e. we had node labels in edge
>>> matrix order, but these labels should really be associated with
>>> nodes, not with edges.
>> I could not agree more.
>>> This assumption caused things to break once edges
>>> and nodes were not equivalent (now that root edge is in the edge  
>>> matrix
>>> and we allow edge matrix reordering, or for unrooted trees). I
>>> think we need to be very clear about whether methods are actually
>>> operating on nodes or edges.
>>> I suggest that edge, edge.labels and edge.lengths (branch lengths)
>>> are in 'edge' order.
>> I can hardly see how it would make sense otherwise. All information
>> provided for a given item should be sorted according to this item.  
>> Tips
>> labels should be in the tip order, node (internal nodes) label  
>> sorted as
>> node numbers, etc.
>
>  Here's where it gets tricky.  Of course it's sensible for edge
> lengths and labels to be in edge matrix order ... for the others
> (tip labels, node labels), what do you mean by "tip order", "node  
> numbers"?

This is my question as well - especially for unrooted trees where  
there is not an edge for every node and vice versa, so tips and nodes  
and edges can't be in the same order. I think that as we are  
suggesting ways to modify the tree structure, we can provide examples  
of what the edge matrix, node labels, etc. would actually look like  
for an unrooted and rooted tree, and how tree reordering would work, I  
think this might help clarify things.

>>> Marguerite:
>>> This one is very important, and I think it's a very bad idea to  
>>> unlink
>>> the edges and nodes. Edges and nodes are intimately linked. In my
>>> mind, the edge is simply the branch below the node. So to have edges
>>> in one order and nodes in another order makes no sense to me at all.
>>> Why don't we simply give node ID's in "edge" order as you are using
>>> it? otherwise, there is HUGE potential for confusion. And we would
>>> need yet another index that indicates a mapping of the node ID to  
>>> the
>>> edge matrix.
>>>
>> Again, I completely agree. Edges are uniquely identified by their
>> desending node, and this is what we have used from the begining.
>> Moreover, this is what is used in ape, and I think we should diverge
>> from it only when it is mandatory (e.g. plotting trees with  
>> singleton if
>> these make sense). Most phylobase users are and will be primarly  
>> ape users.
>
>  We're not diverging from this.
>  We're saying that we will keep data and the lists of node labels
> (tips and internal nodes) in order of node numbers, and not rearrange
> them every time we reorder the edge matrix.

Following up on the previous point, maybe what we really need is to  
spell out how we want tree structures to look, similar to the  
whitepaper on the phylo class.

I understand the desire to not break existing code and provide a  
phylogeny class that is intuitive for users and developers, but I  
don't agree that we should feel bound to follow the ape phylo  
structure. If we're just implementing phylo in S4 then we should be  
upfront about it and follow the phylo class specification exactly. I  
don't think we're doing that, though. There are a number of features  
that we might want to implement that aren't in phylo, including  
singleton nodes, reordering of the edges or nodes, root edges in the  
edge matrix, reticulations, how to represent rooted versus unrooted  
trees, separate labels and data for edges and nodes, and so on.

>>> Instead, why don't we just decide on a standard ordering for  
>>> phylobase
>>> number the node ID's in this way, and then allow the edge matrix and
>>> nodeID (and all data vectors) to be reordered as needed for whatever
>>> functions.  Using the node ID, we can easily  put everything back to
>>> the "default" phylobase order, BUT ONLY IF all objects (edge matrix,
>>> branch lengths, labels, etc etc are in the SAME order. Don't "break"
>>> the integrity of the object just for programming convenience.  
>>> There is
>>> just too much danger for confusion. I, for one, would stop using
>>> phylobase, because it's just too hard to remember the  
>>> peculiarities of
>>> the way the object is constructed. Everytime I wanted to do  
>>> something,
>>> I'd have to relearn the rules.
>>>
>> Same for me.
>>
>
>  Hmm.
>  I've been working to try to make everything consistent in node order
> (as Steve suggested).  Thibaut/Marguerite, what do you suggest for the
> case of unrooted trees?  Thibaut, how often do you match up edges with
> data and labels?
>   I've done a bunch of stuff, and I'd like to commit it, because it's
> all reasonably consistent now, but I'd like to hear some more
> conversation -- I'm willing to work back through while it's fresh in
> my mind and do everything the opposite way (keeping everything
> in edge-matrix order all the time), provided we know how to handle
> unrooted trees (and are willing to live with not being able to handle
> reticulations).

I agree with Ben. I don't mind undoing what we did, but only if  
there's a clear plan of exactly how we want things to work, spelled  
out in detail, having thought about how it will work for unrooted and  
rooted trees. I'm not arguing that the changes that were made to edge  
and node numbering are the only way to go, or even the best, but they  
are what we came up with to deal with the fact that when you add the  
root edge to the edge matrix, unrooted trees broke most of the  
existing code because they have an edge that is shared between two  
nodes. I understand now why the ape developers kept the root edge out  
of the edge matrix, it makes it complicated to deal with rooted and  
unrooted trees using the same methods.

 From Ben's example code...
> unroot(tree.owls)$edge
>     [,1] [,2]
> [1,]    5    6
> [2,]    6    1
> [3,]    6    2
> [4,]    5    3
> [5,]    5    4
>
>  node 5 does not appear in the second (descendant) column
> of the edge matrix, so the node information has to be somewhat
> distinct from the edge information -- it's one unit longer.
> ape dealt with this by having root information (if any) hanging
> out in a separate place within the data structure, but we got
> rid of that ...

Say I want labels for nodes 5 and 6. Where do those labels go? i.e.  
what does labels() look like for this edge matrix, and how do we  
reorder this tree for plotting or traversal? What about after we root  
this tree at node 5?

I could imagine a few ways to deal with this:

0) Undo. Revert to where we were a week ago - take the root edge out  
of the edge matrix.
1) Do what we just did - put the root edge is in the edge matrix,  
nodes have their own set of attributes, so do edges, and we write  
accessors to translate between node id's and edges (transparent to end  
user for rooted trees?).
2) Keep root in the edge matrix but split rooted and unrooted trees  
into separate classes each with their own methods. I think this would  
be more confusing for programmers and users, but we could basically  
follow the ape tree structure for rooted trees and have a slightly  
modified structure for unrooted trees.
3) Arbitrarily root unrooted trees at one of the nodes that share an  
edge and strip out the imaginary root edge for printing and plotting  
methods. This was an idea that Brian suggested. I'm not sure how hard  
this would be to implement.

Cheers,
Steve