[Phylobase-devl] Conference call minutes

Sun Mar 23 03:19:08 CET 2008

On Mar 22, 2008, at 12:26 PM, Peter Cowan wrote:

>
>>> The problem with non-unique node labels happens when you try to
>>> create a phylo4d object, this will be fixed when we switch to using
>>> a vector/list of labels to identify nodes in place of the current
>>> use of the row.names of the data frame for node ID and data
>>> attachment:
>
>> Yes, I realize this. That is why I was insisting earlier that
>> matching shouldn't be done on labels. Either they are labels for
>> convenience or they are not. We can't have it both ways. If users
>> want to match their data, then they should make sure that the data
>> are assigned to the proper node by providing the "node index" that
>> they match to. It would be a simple thing, so long as they can print
>> the node index and enter them into a spreadsheet with the data. It
>> is a lot easier, for example, than making sure that each species
>> name is spelled correctly in each dataset.
>>
>> Sometimes this is a big pain because you get one set of species
>> names from PAUP or whatever, but you have a different abbreviation
>> in your phenotypic data. Then all species must be renamed in one
>> dataset or the other.  It's a lot easier just to make sure that a
>> number matches.
>
> Marguerite I don't fully understand this example.
>
> For my own edification let me see if I understand what folks are
> thinking.  There are two aspects of a phylo object under discussion
> the node index and the node labels.  As far as node labels are
> concerned there is agreement that these need to be arbitrary with no
> restrictions on being unique, or even existing.
>
> However, there is a discussion about node indices, and whether they
> should be enforced to be  an ordered vector from 1:Nnodes.
>
> One argument for keeping them 1:Nnodes is that it is easier to iterate
> over the nodes this way.  I can see that, but looping for each in R is
> easy, is there an example where this would be difficult?
>
> One argument for just using a number is that it become easier to
> compare trees, but this require expose the node indices to end users.
> I'm not sure this has much value for the end users who generally don't
> care much for the internal representation of the tree.  Is there a
> value to developers to have non consecutive node indices?
>
> Steve's proposed solution to the tree comparison/tracking issue is to
> use node labels (not indices), this would require a richer node label
> model than the one currently implemented.  I think Steve has a node
> label data frame in mind.  That would allow unique node label
> information to sit next to potentially non-unique node label
> information.  This seems overly complex to me.  A phylo4D object
> already has a node data data frame, which has unique row names,
> perhaps this should be used instead?
>
I guess what I see our needs are from a design perspective is:

1) node indices:
Which are required to be unique so that we can refer to specific  
nodes. It is thus logical do do any matching by this node index. It  
happens to be convenient to use numerals, because they are the most  
compact (and easy for human minds to quicky differentiate) unique  
representation.    This is not a big deal, but in OUCH the node index  
(it's called node label to be totally confusing with our usage here)  
is of type character, mainly to take advantage of R's great string  
matching facilities.

2) taxon names:
These are convenience labels for the user and not required to be  
unique. It is generally used by users to be able to specify labels to  
print on the tree plot.  For example, they may want some node labels  
to appear but not others. Thus, some nodes will have missing values.

As I understand it, the proposal is to have yet another node index  
that is used to match phylo4d objects to the tree. This seems to be a  
bit redundant. We already have a unique index, why would we want to  
reinvent the wheel? Furthermore, this is more difficult for the user.  
They have to associate it with both the tree and the data object in  
order for the matching to work, instead of just copying it from tree  
to data. We'll have to do checking to make sure these are unique or  
the merge will produce wonky results or fail. And they may make less  
than ideal choices as to the index (probably some words or long names  
or something that might get misspelled in one but not the other,  
etc., if they are not used to thinking about effiiciency).  But  
really, I don't see the need for more than one unique index. They  
serve the same function.

All in all, it seems much simpler to generate the node index for the  
phylo4 object as an integer index as we already do, and ask the users  
to copy this index into their data object.  The node numbers are  
printed in the print() function.

Whether or not the index needs to be consecutively numbered is a  
separate issue. I don't think it's necessary, because I'd like the  
node index to remain stable after pruning or minor manipulations of  
the tree.  As Peter says, you can iterate over elements of a vector  
very easily.

It works like this:

for (i in node.index) { ... }

Each iteration of i is an element of node.index
If you want it's order in the vector you can simply do  node.order =  
which(i)

Steve mentions extra labels. To me these are just extra data fields.

Marguerite

>
>>>
>>>>> There was also a proposal to relax the restriction on node
>>>>> numbers being 1:length(nodes).
>>>
>>> I feel like we're mixing up what I am going to call node indexing
>>> and node labelling. Node indexing is purely for internal/
>>> development purposes - currently nodes are indexed as 1:NNodes, all
>>> functions and methods can safely assume that they can iterate over
>>> nodes in this way, end users never need to think about these
>>> numbers unless they want to. Node labelling encompasses any other
>>> sort of data or identifier that you want to associate with a node,
>>> i.e. for end-users who want to be able to identify nodes that are
>>> the 'same node' across multiple trees, which could be implemented
>>> as actual node labels accessed via labels() or could be included as
>>> node data in a phylo4d object, since both labels and data persist
>>> across subset operations.
>
>>>>> Pros:
>>>>> Easier diffing of trees. For example, if I have a large tree of
>>>>> birds, but only have beak trait data for a subset and tarsus
>>>>> length for a different subset, comparing the two subsets is
>>>>> easier if the nodes are NOT renumbered.
>>>
>>>
>>> If I understand the example, it sounds like what you want is a set
>>> of unique node labels on the large tree of birds that would allow
>>> an end-user to match nodes between subsequent subsets of the large
>>> tree:
>>> intersect(labels(subTree1),labels(subTree2))
>>>
>>> I think this is a problem that is best solved by adding node labels
>>> to the large tree, not by changing the way nodes are indexed by all
>>> functions and methods in phylobase. It sounds like we do need a
>>> method to create unique node labels, either as labels() or phylo4d
>>> data, when users need them? I may just be missing the point of
>>> changing the way nodes are indexed, I think about this stuff as
>>> someone who writes functions that iterate over the nodes on a tree,
>>> which would be more complicated if nodes had arbitrary index  
>>> numbers.
>>>
>>> Cheers,
>>> Steve
>>
>> ____________________________________________
>> Marguerite A. Butler
>> Department of Zoology
>> University of Hawaii
>> 2538 McCarthy Mall, Edmondson 259
>> Honolulu, HI  96822
>>
>> Phone: 808-956-4713
>> FAX:   808-956-9812
>> Dept: 808-956-8617
>> http://www2.hawaii.edu/~mbutler
>> http://www.hawaii.edu/zoology/
>>
>> _______________________________________________
>> Phylobase-devl mailing list
>> Phylobase-devl at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/ 
>> phylobase-devl
>
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/ 
> phylobase-devl

____________________________________________
Marguerite A. Butler
Department of Zoology
University of Hawaii
2538 McCarthy Mall, Edmondson 259
Honolulu, HI  96822

Phone: 808-956-4713
FAX:   808-956-9812
Dept: 808-956-8617
http://www2.hawaii.edu/~mbutler
http://www.hawaii.edu/zoology/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20080322/a23b9a7e/attachment-0001.htm