[Phylobase-devl] Conference call minutes
Ben Bolker
bolker at zoo.ufl.edu
Sun Mar 23 23:19:37 CET 2008
Marguerite Butler wrote:
> On Mar 22, 2008, at 12:26 PM, Peter Cowan wrote:
>
>>
>>>> The problem with non-unique node labels happens when you try to
>>>> create a phylo4d object, this will be fixed when we switch to using
>>>> a vector/list of labels to identify nodes in place of the current
>>>> use of the row.names of the data frame for node ID and data
>>>> attachment:
>>
>>> Yes, I realize this. That is why I was insisting earlier that
>>> matching shouldn't be done on labels. Either they are labels for
>>> convenience or they are not. We can't have it both ways. If users
>>> want to match their data, then they should make sure that the data
>>> are assigned to the proper node by providing the "node index" that
>>> they match to. It would be a simple thing, so long as they can print
>>> the node index and enter them into a spreadsheet with the data. It
>>> is a lot easier, for example, than making sure that each species
>>> name is spelled correctly in each dataset.
>>>
>>> Sometimes this is a big pain because you get one set of species
>>> names from PAUP or whatever, but you have a different abbreviation
>>> in your phenotypic data. Then all species must be renamed in one
>>> dataset or the other. It's a lot easier just to make sure that a
>>> number matches.
>>
>> Marguerite I don't fully understand this example.
>>
>> For my own edification let me see if I understand what folks are
>> thinking. There are two aspects of a phylo object under discussion
>> the node index and the node labels. As far as node labels are
>> concerned there is agreement that these need to be arbitrary with no
>> restrictions on being unique, or even existing.
>>
>> However, there is a discussion about node indices, and whether they
>> should be enforced to be an ordered vector from 1:Nnodes.
>>
>> One argument for keeping them 1:Nnodes is that it is easier to iterate
>> over the nodes this way. I can see that, but looping for each in R is
>> easy, is there an example where this would be difficult?
>>
>> One argument for just using a number is that it become easier to
>> compare trees, but this require expose the node indices to end users.
>> I'm not sure this has much value for the end users who generally don't
>> care much for the internal representation of the tree. Is there a
>> value to developers to have non consecutive node indices?
>>
>> Steve's proposed solution to the tree comparison/tracking issue is to
>> use node labels (not indices), this would require a richer node label
>> model than the one currently implemented. I think Steve has a node
>> label data frame in mind. That would allow unique node label
>> information to sit next to potentially non-unique node label
>> information. This seems overly complex to me. A phylo4D object
>> already has a node data data frame, which has unique row names,
>> perhaps this should be used instead?
Back to where we started -- we can't do this because **data frame
row names are required to be unique**.
data frame is overkill for node labels, but vector of character
(enforced to have the appropriate length) is fine ...
>>
> I guess what I see our needs are from a design perspective is:
>
> 1) node indices:
> Which are required to be unique so that we can refer to specific nodes.
> It is thus logical do do any matching by this node index. It happens to
> be convenient to use numerals, because they are the most compact (and
> easy for human minds to quicky differentiate) unique representation.
> This is not a big deal, but in OUCH the node index (it's called node
> label to be totally confusing with our usage here) is of type character,
> mainly to take advantage of R's great string matching facilities.
>
> 2) taxon names:
> These are convenience labels for the user and not required to be unique.
> It is generally used by users to be able to specify labels to print on
> the tree plot. For example, they may want some node labels to appear
> but not others. Thus, some nodes will have missing values.
>
> As I understand it, the proposal is to have yet another node index that
> is used to match phylo4d objects to the tree. This seems to be a bit
> redundant. We already have a unique index, why would we want to reinvent
> the wheel? Furthermore, this is more difficult for the user. They have
> to associate it with both the tree and the data object in order for the
> matching to work, instead of just copying it from tree to data. We'll
> have to do checking to make sure these are unique or the merge will
> produce wonky results or fail. And they may make less than ideal choices
> as to the index (probably some words or long names or something that
> might get misspelled in one but not the other, etc., if they are not
> used to thinking about effiiciency). But really, I don't see the need
> for more than one unique index. They serve the same function.
I didn't think we were proposing to have another node index -- just
to make sure that the node labels were appropriately dissociated from
the rownames of the data matrix ....
>
> All in all, it seems much simpler to generate the node index for the
> phylo4 object as an integer index as we already do, and ask the users to
> copy this index into their data object. The node numbers are printed in
> the print() function.
>
> Whether or not the index needs to be consecutively numbered is a
> separate issue. I don't think it's necessary, because I'd like the node
> index to remain stable after pruning or minor manipulations of the
> tree. As Peter says, you can iterate over elements of a vector very
> easily.
>
> It works like this:
>
> for (i in node.index) { ... }
>
> Each iteration of i is an element of node.index
> If you want it's order in the vector you can simply do node.order =
> which(i)
>
> Steve mentions extra labels. To me these are just extra data fields.
>
> Marguerite
>
>>
>>>>
>>>>>> There was also a proposal to relax the restriction on node
>>>>>> numbers being 1:length(nodes).
>>>>
>>>> I feel like we're mixing up what I am going to call node indexing
>>>> and node labelling. Node indexing is purely for internal/
>>>> development purposes - currently nodes are indexed as 1:NNodes, all
>>>> functions and methods can safely assume that they can iterate over
>>>> nodes in this way, end users never need to think about these
>>>> numbers unless they want to. Node labelling encompasses any other
>>>> sort of data or identifier that you want to associate with a node,
>>>> i.e. for end-users who want to be able to identify nodes that are
>>>> the 'same node' across multiple trees, which could be implemented
>>>> as actual node labels accessed via labels() or could be included as
>>>> node data in a phylo4d object, since both labels and data persist
>>>> across subset operations.
>>
>>>>>> Pros:
>>>>>> Easier diffing of trees. For example, if I have a large tree of
>>>>>> birds, but only have beak trait data for a subset and tarsus
>>>>>> length for a different subset, comparing the two subsets is
>>>>>> easier if the nodes are NOT renumbered.
>>>>
>>>>
>>>> If I understand the example, it sounds like what you want is a set
>>>> of unique node labels on the large tree of birds that would allow
>>>> an end-user to match nodes between subsequent subsets of the large
>>>> tree:
>>>> intersect(labels(subTree1),labels(subTree2))
>>>>
>>>> I think this is a problem that is best solved by adding node labels
>>>> to the large tree, not by changing the way nodes are indexed by all
>>>> functions and methods in phylobase. It sounds like we do need a
>>>> method to create unique node labels, either as labels() or phylo4d
>>>> data, when users need them? I may just be missing the point of
>>>> changing the way nodes are indexed, I think about this stuff as
>>>> someone who writes functions that iterate over the nodes on a tree,
>>>> which would be more complicated if nodes had arbitrary index numbers.
>>>>
>>>> Cheers,
>>>> Steve
>>>
>>> ____________________________________________
>>> Marguerite A. Butler
>>> Department of Zoology
>>> University of Hawaii
>>> 2538 McCarthy Mall, Edmondson 259
>>> Honolulu, HI 96822
>>>
>>> Phone: 808-956-4713
>>> FAX: 808-956-9812
>>> Dept: 808-956-8617
>>> http://www2.hawaii.edu/~mbutler
>>> http://www.hawaii.edu/zoology/
>>>
>>> _______________________________________________
>>> Phylobase-devl mailing list
>>> Phylobase-devl at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>>>
>>
>> _______________________________________________
>> Phylobase-devl mailing list
>> Phylobase-devl at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>>
>
> ____________________________________________
> Marguerite A. Butler
> Department of Zoology
> University of Hawaii
> 2538 McCarthy Mall, Edmondson 259
> Honolulu, HI 96822
>
> Phone: 808-956-4713
> FAX: 808-956-9812
> Dept: 808-956-8617
> http://www2.hawaii.edu/~mbutler
> http://www.hawaii.edu/zoology/
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 252 bytes
Desc: OpenPGP digital signature
Url : http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20080323/9fe4ba13/attachment.pgp
More information about the Phylobase-devl
mailing list