[Phylobase-devl] Conference call minutes

Marguerite Butler mbutler at hawaii.edu
Mon Mar 24 22:06:33 CET 2008


On Mar 24, 2008, at 11:49 AM, Ben Bolker wrote:

> Peter Cowan wrote:
>> On Mar 23, 2008, at 3:19 PM, Ben Bolker wrote:
>>>>> For my own edification let me see if I understand what folks are
>>>>> thinking.  There are two aspects of a phylo object under  
>>>>> discussion
>>>>> the node index and the node labels.  As far as node labels are
>>>>> concerned there is agreement that these need to be arbitrary  
>>>>> with no
>>>>> restrictions on being unique, or even existing.
>>>>>
>>>>> However, there is a discussion about node indices, and whether  
>>>>> they
>>>>> should be enforced to be  an ordered vector from 1:Nnodes.
>>>>>
>>>>> One argument for keeping them 1:Nnodes is that it is easier to  
>>>>> iterate
>>>>> over the nodes this way.  I can see that, but looping for each  
>>>>> in R is
>>>>> easy, is there an example where this would be difficult?
>>>>>
>>>>> One argument for just using a number is that it become easier to
>>>>> compare trees, but this require expose the node indices to end  
>>>>> users.
>>>>> I'm not sure this has much value for the end users who  
>>>>> generally don't
>>>>> care much for the internal representation of the tree.  Is there a
>>>>> value to developers to have non consecutive node indices?
>>>>>
>>>>> Steve's proposed solution to the tree comparison/tracking issue  
>>>>> is to
>>>>> use node labels (not indices), this would require a richer node  
>>>>> label
>>>>> model than the one currently implemented.  I think Steve has a  
>>>>> node
>>>>> label data frame in mind.  That would allow unique node label
>>>>> information to sit next to potentially non-unique node label
>>>>> information.  This seems overly complex to me.  A phylo4D object
>>>>> already has a node data data frame, which has unique row names,
>>>>> perhaps this should be used instead?
>>>
>>> Back to where we started -- we can't do this because **data frame
>>> row names are required to be unique**.
>> This will teach me to flippantly throw out an idea with out  
>> thinking about it carefully, I'm not sure it would even work.   
>> However, the idea was, that if we are looking for a unique index  
>> by which to compare two trees derived from a larger tree, then the  
>> row names of the node data data frame should show the differences,  
>> because row names are retained after subsetting.
>
>   If you look at prune.R, you'll see that when pruning (which is  
> exactly
> such a case where we need to keep track of correspondence, so that we
> can drop the appropriate node data), we create a temporary set of
> "tags" to use in matching before vs. after -- assign them to
> the rownames of the node data -- and use them to subset the node data.
>
>    This strategy would work generally -- if we spent a lot of time
> generating such tags I guess I could see a point to saving them
> internally, but I don't think this will change the external API.
> So we would only need to tell people about them if we wanted  
> developers
> to be able to use them.
>
>   Looking at prune.R makes me a little nervous that node labels aren't
> getting handled correctly, but I would want to check carefully before
> I went in there and started breaking stuff ...
>


OK, last comment on this because I don't seem to be making sense to  
people:)

In Ben's example above, all you have to do is use the node numbers  
themselves, rather than generate some temporary tags.

More important than developer convenience in this case, is user  
trust. It is easier for users to double check that they've really  
done what they wanted to do if there is an easily identifiable node  
number that they can look too.  Many users are uncomfortable when  
they can't verify.  And remember, most people can't write forensic  
code to verify tree manipulations on their own.

Thibaut made some good points and a suggestion:

>> Yes, I realize this. That is why I was insisting earlier that  
>> matching shouldn't be done on labels. Either they are labels for  
>> convenience or they are not. We can't have it both
>>
> I think we can, and should. One may want his/her data matched to  
> the tree according to the tip labels. We can match data using node  
> numbers by default, but an option in the constructor should still  
> allow one to match data as we did previously. One argument for this  
> is that node numbers are tree-dependent, while taxon names are not.  
> A  user may try to match a single data.frame of traits with several  
> different trees, and it would be a pain to have to rename/reorder  
> the row.names each time.

I agree, we should match on node numbers by default and tip labels as  
an option.

So just to be clear, my suggestion is to use node number as the  
matching value by default. Users can use the node numbers to  
associate their data with the tree. Printing a tree object will dump  
the node numbers. Alternatively, there is a phylo4 to data.frame  
function that they can use to retrieve the node numbers.

Alternatively, users can also match by species name (tip label), set  
as an option.

Marguerite


____________________________________________
Marguerite A. Butler
Department of Zoology
University of Hawaii
2538 McCarthy Mall, Edmondson 259
Honolulu, HI  96822

Phone: 808-956-4713
Lab:  808-956-5867
FAX:   808-956-9812
Dept: 808-956-8617
http://www.hawaii.edu/zoology/faculty/butler.html
http://www2.hawaii.edu/~mbutler
http://www.hawaii.edu/zoology/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20080324/2eccc1ac/attachment-0001.htm 


More information about the Phylobase-devl mailing list