[Phylobase-devl] labeling order

Sun Dec 28 21:34:04 CET 2008

Hi Folks,

Sorry for falling off. Christmas was a pretty big deal with my family,  
and then we had a near-24 hour blackout. All is back to normal now.

You guys should go ahead and do what you think best, as you are doing  
all the work. A lot has transpired over the emails and I don't think I  
caught all of it, but here are my opinions for what it's worth.

Thanks Peter for the clarification that the node ordering is the "1",  
"2", etc., in the edge matrix. I don't know what problem this solves  
for some of you, so my concern is just that it's harder to understand.  
I don't like such things because mistakes start to creep into my  
programming.

UNROOTED TREES:
I don't have any experience working with unrooted trees -- all of my  
critters have ancient origins, and it's more or less clear which nodes  
should be most basal. I *think* all comparative methods assume that we  
know the directionality of the tree (which are most basal vs. more  
recent), so I'm not really clear on when it becomes really useful to  
have an unrooted tree. As far as I know, if you are really not sure,  
you must do the analysis each possible way (with the "root" at the  
alternative positions). If we are imagining using these methods for  
population data, then I don't know. Anyway, with that caveat, I prefer  
option 3 in Steve's email:

> 3) Arbitrarily root unrooted trees at one of the nodes that share an
> edge and strip out the imaginary root edge for printing and plotting
> methods. This was an idea that Brian suggested. I'm not sure how hard
> this would be to implement.

The ideally, the user should specify where the "fake root" should be,  
but it can be done arbitrarily. If we take this option, then we  
probably need another attribute for unrooted trees:

@rooted
[1] FALSE

LABELLING:
I prefer the following one: Whether the root is first or not doesn't  
matter to me so much, but the edge matrix and all attributes are in  
the same order. It is clear and simple.

> If we reorder the edge matrix to:
>
> @edge
>      [,1] [,2]
> [1,]   NA    5
> [2,]    7    3
> [3,]    7    4
> [4,]    5    6
> [5,]    6    1
> [6,]    6    2
> [7,]    5    7
>
> ======== Edge order ordering of single label vector ========
>
> @label
> [1] "root" "t3" "t4" "n1" "t1" "t2" "n2"

I think that any rules we have should be taken care of in the tree  
constructor. For example, if the tips should  have a certain  
numbering, we should assign them there. If the tree should be  
cladewise or whatever, that should be reorganized there. We should  
have as few rules as possible, but what we have should be clear and  
validated.

In addition, we should probably have a vector that indicates which are  
the tips, or tip/node, etc. It could be something like:

@tips
[1] 3 4 1 2

Or
@nodetype
[1] "r" "t" "t" "i" "t" "t" "i"

If you want to be more hygenic and have all internal nodes first,  
followed by the tips in number (and then sorted that way in the edge  
matrix, or whatever, that is fine, but it should be created in the  
constructor.   I can work on this later when I am back from SICB (mid- 
Jan).

My concern is that hidden rules will inhibit new programmers from  
other fields from joining in. I would like the structure to be obvious  
to anyone who looks at it. The "ape-like" rules may seem completely  
normal if you are familiar with them, but they can be really confusing  
if you are coming in from the outside. It then becomes easier to just  
make up your own scheme -- thereby fragmenting the programming efforts  
again.

APE & PHYLOBASE:
I think Tibaut brought up a good point regarding what the mission of  
phylobase is. I thought it was to be a universal translator between  
all of the packages. If it is to be a partner package to ape, then the  
goals are different and we should indeed just adopt the ape standards.

SICB:
I am wondering if a working version of the package will be ready for  
teaching at SICB? The workshop is Jan 4th and 5th.

Marguerite

On Dec 28, 2008, at 5:41 AM, Ben Bolker wrote:

>  Very quick comments:
>
> * it's very easy to make tip and node labels named vectors
> (with the node numbers as names) -- I've done this in my branch,
> at least for phylo4 (may need to be checked/replicated for
> the coercions?)
>

The naming is great, especially for input to the constructor. But if  
the structure is obvious and strictly enforced, then it shouldn't be  
necessary to have the vectors inside the phylo4 objects named. (I  
guess it would help with reordering -- so maybe this is better to have  
than not).

> * absolutely agree that conversion to/from ape is essential.
> I think we should actually pull Emmanuel in on some of the  
> conversations
> about ordering and what should be done (i.e., ape seems to depend
> on cladewise ordering for several functions, but this does not
> seem to be specified/enforced very strongly
>
Yes, this is essential. We should have a robust converter from phylo4  
to phylo.

> * can someone who does have access poke around at my branch
> ("newlabels") and see what they think?  We're coming up on our
> nominal CRAN release date, and we're getting slowed down (I think)
> by failing to make a decision on this issue ...

I've forgotten how to do this. Can you please remind?

Thanks,
Marguerite

p.s. Happy New Year! Hope you had a Merry Christmas!
>
>
>  cheers
>    Ben
>
> Thibaut Jombart wrote:
>> Hello again,
>>
>> there are plenty of things in the past emails, so I might be  
>> missing a few.
>>>
>>>>>  Hmmm.
>>>>>
>>>>>  For the record, here's Steve's statement:
>>>>>
>>>>> SWK - This is crucial and we should decide soon, needs to be  
>>>>> sorted
>>>>> out for 1.5. I think that many of the problems we're having with
>>>>> labels and reordering are due to the fact that until now we  
>>>>> treated
>>>>> nodes and edges as interchangable. i.e. we had node labels in edge
>>>>> matrix order, but these labels should really be associated with
>>>>> nodes, not with edges.
>>>> I could not agree more.
>>>>
>>>>> This assumption caused things to break once edges
>>>>> and nodes were not equivalent (now that root edge is in the edge
>>>>> matrix
>>>>> and we allow edge matrix reordering, or for unrooted trees). I
>>>>> think we need to be very clear about whether methods are actually
>>>>> operating on nodes or edges.
>>>>> I suggest that edge, edge.labels and edge.lengths (branch lengths)
>>>>> are in 'edge' order.
>>>> I can hardly see how it would make sense otherwise. All information
>>>> provided for a given item should be sorted according to this  
>>>> item. Tips
>>>> labels should be in the tip order, node (internal nodes) label  
>>>> sorted as
>>>> node numbers, etc.
>>>>
>>>
>>>  Here's where it gets tricky.  Of course it's sensible for edge
>>> lengths and labels to be in edge matrix order ... for the others
>>> (tip labels, node labels), what do you mean by "tip order", "node
>>> numbers"?
>>>
>> Sorry, I have no working R from here, so I can provide no clear  
>> example.
>> Say a tree has T tips and N internal nodes.
>> Tip labels should be provided for nodes 1:N, and so-called node  
>> labels
>> (internal nodes) for (T+1):(T+N). That is, the ordering tagged  
>> as"Node
>> number ordering of labels" from Peter's (useful, thanks !) example.  
>> As
>> Peter has shown, only this ordering still holds when changing the
>> ordering of edges.
>>>>> Everything else (node labels, tip labels) should
>>>>> be in node id order. nodeId can translate between these two  
>>>>> orders.
>>>>> Reorder can act on the edge* only since the underlying node ids
>>>>> will not change.
>>>>>
>>>>> Francois: It's definitely a crucial issue. Perhaps we could track
>>>>> node.labels and tip.labels by using named vectors, the names of  
>>>>> the
>>>>> vector would be the nodeId.
>>>>>
>>>> I may be missing smthg here, but isn't this we do when using  
>>>> getnodes?
>>>>
>>>
>>>  I think we don't need more identifiers than node numbers ...
>>>
>> I don't get this answer. What I meant was: it would be clearer if we
>> used named vectors for node/tip labels. Possibly even for edges.  
>> Taking
>> back Peter's example:
>>
>> @edge
>>     [,1] [,2]
>> [1,]    5    6
>> [2,]    6    1
>> [3,]    6    2
>> [4,]    5    7
>> [5,]    7    3
>> [6,]    7    4
>> [7,]   NA    5
>>
>> -> Use named vectors
>>
>> @tip.label
>> 1     2    3    4    "t1" "t2" "t3" "t4"
>>
>> @node.label
>>  5     6    7
>> "root" "n1" "n2"
>>
>> So we make it clear what ordering is used. In the doc, we can then  
>> just
>> say that names of labels vectors for internal nodes and tips are  
>> numbers
>> indentifying these items in @edge.
>>> Marguerite:
>>> This one is very important, and I think it's a very bad idea to  
>>> unlink
>>> the edges and nodes. Edges and nodes are intimately linked. In my
>>> mind, the edge is simply the branch below the node. So to have edges
>>> in one order and nodes in another order makes no sense to me at all.
>>> Why don't we simply give node ID's in "edge" order as you are using
>>> it? otherwise, there is HUGE potential for confusion. And we would
>>> need yet another index that indicates a mapping of the node ID to  
>>> the
>>> edge matrix.
>>>
>>>> Again, I completely agree. Edges are uniquely identified by their
>>>> desending node, and this is what we have used from the begining.
>>>> Moreover, this is what is used in ape, and I think we should  
>>>> diverge
>>>> from it only when it is mandatory (e.g. plotting trees with  
>>>> singleton if
>>>> these make sense). Most phylobase users are and will be primarly  
>>>> ape
>>>> users.
>>>>
>>>
>>>  We're not diverging from this.
>>>  We're saying that we will keep data and the lists of node labels
>>> (tips and internal nodes) in order of node numbers, and not  
>>> rearrange
>>> them every time we reorder the edge matrix.
>>>
>> Yes, so no disagreement for me here.
>>>>> Instead, why don't we just decide on a standard ordering for  
>>>>> phylobase
>>>>> number the node ID's in this way, and then allow the edge matrix  
>>>>> and
>>>>> nodeID (and all data vectors) to be reordered as needed for  
>>>>> whatever
>>>>> functions.  Using the node ID, we can easily  put everything  
>>>>> back to
>>>>> the "default" phylobase order, BUT ONLY IF all objects (edge  
>>>>> matrix,
>>>>> branch lengths, labels, etc etc are in the SAME order. Don't  
>>>>> "break"
>>>>> the integrity of the object just for programming convenience.  
>>>>> There is
>>>>> just too much danger for confusion. I, for one, would stop using
>>>>> phylobase, because it's just too hard to remember the  
>>>>> peculiarities of
>>>>> the way the object is constructed. Everytime I wanted to do  
>>>>> something,
>>>>> I'd have to relearn the rules.
>>>>>
>>>> Same for me.
>>>>
>>>>
>>>
>>>  Hmm.
>>>  I've been working to try to make everything consistent in node  
>>> order
>>> (as Steve suggested).  Thibaut/Marguerite, what do you suggest for  
>>> the
>>> case of unrooted trees?
>> Order by node numbers, as in Peter's example.
>>>  Thibaut, how often do you match up edges with
>>> data and labels?
>>>
>> Never. Not sure I will ever need to do so. All matching I use are
>> data/labels with tips and internal nodes.
>>>   I've done a bunch of stuff, and I'd like to commit it, because  
>>> it's
>>> all reasonably consistent now, but I'd like to hear some more
>>> conversation -- I'm willing to work back through while it's fresh in
>>> my mind and do everything the opposite way (keeping everything
>>> in edge-matrix order all the time), provided we know how to handle
>>> unrooted trees (and are willing to live with not being able to  
>>> handle
>>> reticulations).
>>>
>>>  More discussion please?
>>>
>>> Ben
>> Peter wrote:
>>> Following up on the previous point, maybe what we really need is to
>>> spell out how we want tree structures to look, similar to the
>>> whitepaper on the phylo class.
>>>
>>> I understand the desire to not break existing code and provide a
>>> phylogeny class that is intuitive for users and developers, but I  
>>> don't
>>> agree that we should feel bound to follow the ape phylo structure.  
>>> If
>>> we're just implementing phylo in S4 then we should be upfront  
>>> about it
>>> and follow the phylo class specification exactly. I don't think  
>>> we're
>>> doing that, though. There are a number of features that we might  
>>> want
>>> to implement that aren't in phylo, including singleton nodes,
>>> reordering of the edges or nodes, root edges in the edge matrix,
>>> reticulations, how to represent rooted versus unrooted trees,  
>>> separate
>>> labels and data for edges and nodes, and so on.
>>>
>> I think I had a different idea of what phylobase was about, but no
>> problem there. To me, the first purpose of phylobase was handling
>> phylogeny+data associated to tips and possibly internal nodes. That  
>> is,
>> leave what concerns phylogeny alone to ape, as it already set the  
>> basis
>> for handling phylogeny, and did that pretty well. Of course we can  
>> think
>> of improvements, but to me they might belong to ape more than to
>> phylobase (that was my point with all 'treewalk' functions, that  
>> would
>> be useful for ape's phylogenies as well). For instance, I am pretty  
>> sure
>> that if we provide Emmanuel a good example of a tree where handling
>> singletons is needed, and possibly a patch to the code, that would be
>> implemented quickly in ape. I understand that diverging from ape is
>> quicker and more straightforward than making ape and phylobase evolve
>> together. My position would be mimimizing such changes and making  
>> sure
>> tree conversion remains possible both ways.
>>
>> Best,
>>
>> Thibaut.
>
>
> -- 
> Ben Bolker
> Associate professor, Biology Dep't, Univ. of Florida
> bolker at ufl.edu / www.zoology.ufl.edu/bolker
> GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl