[Phylobase-devl] Conference call minutes
Steve Kembel
skembel at berkeley.edu
Mon Mar 24 23:34:15 CET 2008
Hi,
>>> Back to where we started -- we can't do this because **data frame
>>> row names are required to be unique**.
>> This will teach me to flippantly throw out an idea with out
>> thinking about it carefully, I'm not sure it would even work.
>> However, the idea was, that if we are looking for a unique index by
>> which to compare two trees derived from a larger tree, then the row
>> names of the node data data frame should show the differences,
>> because row names are retained after subsetting.
Pruning while keeping node labels seems to work properly, and so could
be used to match nodes among subsets of a larger tree:
example(phylo4)
mytree
prune(mytree,tip="speciesA")
I think I've lost track of what I'm arguing in favour of! :) But let
me reiterate that the current problem with unique phylo4d data
row.names is just a consequence of the initial (arbitrary) decision to
use tree tip and internal node labels as the names of the rows of the
phylo4d data frames. This will be solved very easily by switching to a
system where the rows of the phylo4d data frames have no names (they
are in the same order as the edge matrix), but there is a list or
vector of labels associated with each row of the data.frame. I'm not
proposing yet another set of labels, just that we maintain the status
quo of allowing users to use labels to match trees and data if they
want, regardless of how we decide to index nodes internally.
>>>> Yes, I realize this. That is why I was insisting earlier that
>>>> matching shouldn't be done on labels. Either they are labels for
>>>> convenience or they are not. We can't have it both
>>> I think we can, and should. One may want his/her data matched to
>>> the tree according to the tip labels. We can match data using node
>>> numbers by default, but an option in the constructor should still
>>> allow one to match data as we did previously. One argument for
>>> this is that node numbers are tree-dependent, while taxon names
>>> are not. A user may try to match a single data.frame of traits
>>> with several different trees, and it would be a pain to have to
>>> rename/reorder the row.names each time.
>> I agree, we should match on node numbers by default and tip labels
>> as an option.
>> So just to be clear, my suggestion is to use node number as the
>> matching value by default. Users can use the node numbers to
>> associate their data with the tree. Printing a tree object will
>> dump the node numbers. Alternatively, there is a phylo4 to
>> data.frame function that they can use to retrieve the node numbers.
My own bias is that I almost always match trees and data using tip
labels, and would suggest we make matching tip data by labels and node
data by index/number the defaults (that is what the default is now).
Here are the use cases I think are going to be common:
1) I have a species tree and a trait data set including some, all, or
a superset of those species. I need to be able to match the trait data
to the tips of the tree using the names of the species, which could be
provided as the row.names of the trait data.frame, or in a column of
the trait data.frame.
2) I have a big supertree where all the tips and some but not all of
the internal nodes have labels (i.e. a species tree with family names
on some nodes). I want to prune the tree to match my species-level
trait data set, do some analyses of the tip data, and plot the tree
with node labels intact for publication purposes. So no need to match
data to the internal node labels, but I do want to match species names
between tree and data, and I want to keep the internal node labels
around for when I plot my tree, or to allow me to do things like prune
the tree to just include particular families.
3) I have data associated with internal nodes from some external
software (bootstrap values, or ancestral trait values) and need to
match that data to the tree, but I don't have any meaningful labels
for the internal nodes. If the external software lists internal node
data in Newick string traversal order, I can attach the internal node
data without matching to node labels
(phylo4d(phy,node.data=dat,use.node.names=FALSE)), and attach the tip
data either with or without matching to labels.
Cheers,
Steve
More information about the Phylobase-devl
mailing list