[Phylobase-devl] Conference call minutes

Mon Mar 24 23:34:15 CET 2008

Hi,

>>> Back to where we started -- we can't do this because **data frame
>>> row names are required to be unique**.
>> This will teach me to flippantly throw out an idea with out  
>> thinking about it carefully, I'm not sure it would even work.   
>> However, the idea was, that if we are looking for a unique index by  
>> which to compare two trees derived from a larger tree, then the row  
>> names of the node data data frame should show the differences,  
>> because row names are retained after subsetting.

Pruning while keeping node labels seems to work properly, and so could  
be used to match nodes among subsets of a larger tree:
example(phylo4)
mytree
prune(mytree,tip="speciesA")

I think I've lost track of what I'm arguing in favour of! :) But let  
me reiterate that the current problem with unique phylo4d data  
row.names is just a consequence of the initial (arbitrary) decision to  
use tree tip and internal node labels as the names of the rows of the  
phylo4d data frames. This will be solved very easily by switching to a  
system where the rows of the phylo4d data frames have no names (they  
are in the same order as the edge matrix), but there is a list or  
vector of labels associated with each row of the data.frame. I'm not  
proposing yet another set of labels, just that we maintain the status  
quo of allowing users to use labels to match trees and data if they  
want, regardless of how we decide to index nodes internally.

>>>> Yes, I realize this. That is why I was insisting earlier that  
>>>> matching shouldn't be done on labels. Either they are labels for  
>>>> convenience or they are not. We can't have it both
>>> I think we can, and should. One may want his/her data matched to  
>>> the tree according to the tip labels. We can match data using node  
>>> numbers by default, but an option in the constructor should still  
>>> allow one to match data as we did previously. One argument for  
>>> this is that node numbers are tree-dependent, while taxon names  
>>> are not. A  user may try to match a single data.frame of traits  
>>> with several different trees, and it would be a pain to have to  
>>> rename/reorder the row.names each time.
>> I agree, we should match on node numbers by default and tip labels  
>> as an option.
>> So just to be clear, my suggestion is to use node number as the  
>> matching value by default. Users can use the node numbers to  
>> associate their data with the tree. Printing a tree object will  
>> dump the node numbers. Alternatively, there is a phylo4 to  
>> data.frame function that they can use to retrieve the node numbers.

My own bias is that I almost always match trees and data using tip  
labels, and would suggest we make matching tip data by labels and node  
data by index/number the defaults (that is what the default is now).  
Here are the use cases I think are going to be common:
1) I have a species tree and a trait data set including some, all, or  
a superset of those species. I need to be able to match the trait data  
to the tips of the tree using the names of the species, which could be  
provided as the row.names of the trait data.frame, or in a column of  
the trait data.frame.
2) I have a big supertree where all the tips and some but not all of  
the internal nodes have labels (i.e. a species tree with family names  
on some nodes). I want to prune the tree to match my species-level  
trait data set, do some analyses of the tip data, and plot the tree  
with node labels intact for publication purposes. So no need to match  
data to the internal node labels, but I do want to match species names  
between tree and data, and I want to keep the internal node labels  
around for when I plot my tree, or to allow me to do things like prune  
the tree to just include particular families.
3) I have data associated with internal nodes from some external  
software (bootstrap values, or ancestral trait values) and need to  
match that data to the tree, but I don't have any meaningful labels  
for the internal nodes. If the external software lists internal node  
data in Newick string traversal order, I can attach the internal node  
data without matching to node labels  
(phylo4d(phy,node.data=dat,use.node.names=FALSE)), and attach the tip  
data either with or without matching to labels.

Cheers,
Steve