[Phylobase-devl] Conference call minutes

Sun Mar 23 23:19:37 CET 2008

Marguerite Butler wrote:
> On Mar 22, 2008, at 12:26 PM, Peter Cowan wrote:
> 
>>
>>>> The problem with non-unique node labels happens when you try to
>>>> create a phylo4d object, this will be fixed when we switch to using
>>>> a vector/list of labels to identify nodes in place of the current
>>>> use of the row.names of the data frame for node ID and data
>>>> attachment:
>>
>>> Yes, I realize this. That is why I was insisting earlier that
>>> matching shouldn't be done on labels. Either they are labels for
>>> convenience or they are not. We can't have it both ways. If users
>>> want to match their data, then they should make sure that the data
>>> are assigned to the proper node by providing the "node index" that
>>> they match to. It would be a simple thing, so long as they can print
>>> the node index and enter them into a spreadsheet with the data. It
>>> is a lot easier, for example, than making sure that each species
>>> name is spelled correctly in each dataset.
>>>
>>> Sometimes this is a big pain because you get one set of species
>>> names from PAUP or whatever, but you have a different abbreviation
>>> in your phenotypic data. Then all species must be renamed in one
>>> dataset or the other.  It's a lot easier just to make sure that a
>>> number matches.
>>
>> Marguerite I don't fully understand this example.
>>
>> For my own edification let me see if I understand what folks are
>> thinking.  There are two aspects of a phylo object under discussion
>> the node index and the node labels.  As far as node labels are
>> concerned there is agreement that these need to be arbitrary with no
>> restrictions on being unique, or even existing.
>>
>> However, there is a discussion about node indices, and whether they
>> should be enforced to be  an ordered vector from 1:Nnodes.
>>
>> One argument for keeping them 1:Nnodes is that it is easier to iterate
>> over the nodes this way.  I can see that, but looping for each in R is
>> easy, is there an example where this would be difficult?
>>
>> One argument for just using a number is that it become easier to
>> compare trees, but this require expose the node indices to end users.
>> I'm not sure this has much value for the end users who generally don't
>> care much for the internal representation of the tree.  Is there a
>> value to developers to have non consecutive node indices?
>>
>> Steve's proposed solution to the tree comparison/tracking issue is to
>> use node labels (not indices), this would require a richer node label
>> model than the one currently implemented.  I think Steve has a node
>> label data frame in mind.  That would allow unique node label
>> information to sit next to potentially non-unique node label
>> information.  This seems overly complex to me.  A phylo4D object
>> already has a node data data frame, which has unique row names,
>> perhaps this should be used instead?

   Back to where we started -- we can't do this because **data frame
row names are required to be unique**.

   data frame is overkill for node labels, but vector of character
(enforced to have the appropriate length) is fine ...

>>
> I guess what I see our needs are from a design perspective is:
> 
> 1) node indices:
> Which are required to be unique so that we can refer to specific nodes. 
> It is thus logical do do any matching by this node index. It happens to 
> be convenient to use numerals, because they are the most compact (and 
> easy for human minds to quicky differentiate) unique representation.    
> This is not a big deal, but in OUCH the node index (it's called node 
> label to be totally confusing with our usage here) is of type character, 
> mainly to take advantage of R's great string matching facilities.
> 
> 2) taxon names:
> These are convenience labels for the user and not required to be unique. 
> It is generally used by users to be able to specify labels to print on 
> the tree plot.  For example, they may want some node labels to appear 
> but not others. Thus, some nodes will have missing values.
> 
> As I understand it, the proposal is to have yet another node index that 
> is used to match phylo4d objects to the tree. This seems to be a bit 
> redundant. We already have a unique index, why would we want to reinvent 
> the wheel? Furthermore, this is more difficult for the user. They have 
> to associate it with both the tree and the data object in order for the 
> matching to work, instead of just copying it from tree to data. We'll 
> have to do checking to make sure these are unique or the merge will 
> produce wonky results or fail. And they may make less than ideal choices 
> as to the index (probably some words or long names or something that 
> might get misspelled in one but not the other, etc., if they are not 
> used to thinking about effiiciency).  But really, I don't see the need 
> for more than one unique index. They serve the same function.

   I didn't think we were proposing to have another node index -- just
to make sure that the node labels were appropriately dissociated from
the rownames of the data matrix ....

> 
> All in all, it seems much simpler to generate the node index for the 
> phylo4 object as an integer index as we already do, and ask the users to 
> copy this index into their data object.  The node numbers are printed in 
> the print() function.
> 
> Whether or not the index needs to be consecutively numbered is a 
> separate issue. I don't think it's necessary, because I'd like the node 
> index to remain stable after pruning or minor manipulations of the 
> tree.  As Peter says, you can iterate over elements of a vector very 
> easily.
> 
> It works like this:
> 
> for (i in node.index) { ... }
> 
> Each iteration of i is an element of node.index
> If you want it's order in the vector you can simply do  node.order = 
> which(i)
> 
> Steve mentions extra labels. To me these are just extra data fields.
> 
> Marguerite
> 
>>
>>>>
>>>>>> There was also a proposal to relax the restriction on node
>>>>>> numbers being 1:length(nodes).
>>>>
>>>> I feel like we're mixing up what I am going to call node indexing
>>>> and node labelling. Node indexing is purely for internal/
>>>> development purposes - currently nodes are indexed as 1:NNodes, all
>>>> functions and methods can safely assume that they can iterate over
>>>> nodes in this way, end users never need to think about these
>>>> numbers unless they want to. Node labelling encompasses any other
>>>> sort of data or identifier that you want to associate with a node,
>>>> i.e. for end-users who want to be able to identify nodes that are
>>>> the 'same node' across multiple trees, which could be implemented
>>>> as actual node labels accessed via labels() or could be included as
>>>> node data in a phylo4d object, since both labels and data persist
>>>> across subset operations.
>>
>>>>>> Pros:
>>>>>> Easier diffing of trees. For example, if I have a large tree of
>>>>>> birds, but only have beak trait data for a subset and tarsus
>>>>>> length for a different subset, comparing the two subsets is
>>>>>> easier if the nodes are NOT renumbered.
>>>>
>>>>
>>>> If I understand the example, it sounds like what you want is a set
>>>> of unique node labels on the large tree of birds that would allow
>>>> an end-user to match nodes between subsequent subsets of the large
>>>> tree:
>>>> intersect(labels(subTree1),labels(subTree2))
>>>>
>>>> I think this is a problem that is best solved by adding node labels
>>>> to the large tree, not by changing the way nodes are indexed by all
>>>> functions and methods in phylobase. It sounds like we do need a
>>>> method to create unique node labels, either as labels() or phylo4d
>>>> data, when users need them? I may just be missing the point of
>>>> changing the way nodes are indexed, I think about this stuff as
>>>> someone who writes functions that iterate over the nodes on a tree,
>>>> which would be more complicated if nodes had arbitrary index numbers.
>>>>
>>>> Cheers,
>>>> Steve
>>>
>>> ____________________________________________
>>> Marguerite A. Butler
>>> Department of Zoology
>>> University of Hawaii
>>> 2538 McCarthy Mall, Edmondson 259
>>> Honolulu, HI  96822
>>>
>>> Phone: 808-956-4713
>>> FAX:   808-956-9812
>>> Dept: 808-956-8617
>>> http://www2.hawaii.edu/~mbutler
>>> http://www.hawaii.edu/zoology/
>>>
>>> _______________________________________________
>>> Phylobase-devl mailing list
>>> Phylobase-devl at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl 
>>>
>>
>> _______________________________________________
>> Phylobase-devl mailing list
>> Phylobase-devl at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl 
>>
> 
> ____________________________________________
> Marguerite A. Butler
> Department of Zoology
> University of Hawaii
> 2538 McCarthy Mall, Edmondson 259
> Honolulu, HI  96822
> 
> Phone: 808-956-4713
> FAX:   808-956-9812
> Dept: 808-956-8617
> http://www2.hawaii.edu/~mbutler
> http://www.hawaii.edu/zoology/
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
Url : http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20080323/9fe4ba13/attachment.pgp