[Phylobase-devl] unification of tree data slots

Thu Sep 17 00:17:01 CEST 2009

Hi all,

Promises, promises. I know I swore off any more pre-release proposals, 
but because this involves class modification, I see this as a speak now 
or forever hold my peace sort of thing!

While fleshing out unit tests, I ran into some minor issues involving 
tdata. What I thought would be quick fixes raised thorny questions about 
how to coordinate node.data and tip.data more robustly, and I kept 
coming back to the idea that one structural change in the phylo4d class 
would do wonders for simplifying tree data management:

What would folks think about having a single 'data' slot (it could be 
called something else), rather than keeping 'tip.data' and 'node.data' 
in separate slots?

This basically amounts to the difference between maintaining data in two 
places (then having to merge every time all data are needed) versus 
maintaining a single data frame (then simply extracting appropriate rows 
whenever someone requests tip or node data). I can come up with a bunch 
of reasons to make this change, mostly revolving around the fact that 
this merge step is a minefield:

* Putting the tip and node data together involves decisions. Should 
tip.data and node.data columns with the same name be treated as the same 
variable? What if there are also two columns with that name *within* 
tip.data? What if the data types are different? These are decisions that 
need to be made based on user input at the time of construction. In 
fact, some of these decisions *are* guided by inputs to the phylo4d 
constructor, and the two slots are populated in a way that tries to 
preserve those rules. But I see this is fragile. It's easy to break 
things afterwards by modifying the slot contents in valid ways, and even 
if the slots haven't been modified, the merge can be imperfect (see 
below). I'd much prefer to see a data assembly step happen once during 
construction (and as needed for data addition and/or replacement), with 
those decisions unambiguously reflected in the structure of that single, 
unified table.

* The only requirement of the tip.data and node.data slots is that they 
are data frames in which row.names exactly match the IDs of tips and 
internal nodes, respectively. Merging these can have unexpected side 
effects, making this a hard-to-debug process. Some examples:
   1. Merge discards row names and makes new "automatic" ones, which is 
dangerous in our case because the row names are node IDs! It usually 
works because the automatic row names are 1:N, which matches how we 
currently produce node IDs. But it doesn't work now if the tip.data 
and/or node.data rows are not in node order (this is perfectly legal), 
and of course it won't work if the scheme for numbering IDs ever changes 
(I know, not a huge deal, but still...). This is a good reason to 
assemble and validate the data frame up front, rather than doing it on 
the fly.
   2. The current approach can produce column name inconsistencies in 
what tdata() returns. Below, note the difference in names between the 
tip/internal vs all cases:

phy <- phylo4(rtree(3))
phyd <- phylo4d(phy, tip.data=data.frame(a=1:3),
     node.data=data.frame(a=4:5), match.data=FALSE, merge.data=FALSE)
tdata(phyd, "tip")
##    a    a
## t3 1 <NA>
## t1 2 <NA>
## t2 3 <NA>
tdata(phyd, "internal")
##    a a
## 4 NA d
## 5 NA e
tdata(phyd, "all")
##     a  a.1
## t3  1 <NA>
## t1  2 <NA>
## t2  3 <NA>
## 4  NA    d
## 5  NA    e

   3. The current approach can produce factor/character inconsistencies. 
In particular, if you supply a constructor method node.data containing a 
factor column, along with either tip.data or all.data, the factor gets 
converted to character:

phy <- phylo4(rtree(3))
phyd <- phylo4d(phy, tip.data=data.frame(tdt=letters[1:3]),
     node.data=data.frame(ndt=letters[4:5]), match.data=FALSE)
sapply(tdata(phyd), class)
##         tdt         ndt
##    "factor" "character"

* In my mind, tip data and internal data may have different scientific 
interpretations, but are otherwise equivalent. Node data (read: data for 
both tip and internal nodes) are node data, regardless of what kind of 
nodes we're talking about. Keeping them separate adds coding complexity. 
Of course, for convenience, we absolutely should continue to provide 
separate programmatic access to tip vs internal data via the arguments 
of tdata() and related methods.

Other than the fact that it will take some programmer time (I am happy 
to contribute as much as is necessary), the only possible disadvantage I 
can think of is that we'd explicitly have to store NAs in cases where a 
particular data column (or any data at all) is missing for one of the 
node types. But at worst this means a doubling of the amount of data 
stored. For what it's worth, I think the constructor methods essentially 
do this already (i.e., creating NAs columns for variables that don't 
exist for one node type but do for the other).

Practicalities: This would require some modification of the following: 
tdata, addData, formatData, .phylo4Data, and checkPhylo4Data. On the 
construction side, I think most of the necessary logic is somewhere in 
those functions already. As for the tdata accessor, it's just a matter 
of replacing the current merges with row indexing to retrieve the 
appropriate data. On the whole I think this change would make these 
functions *simpler*. Meanwhile, because tdata would still return the 
same thing as before (but with even greater consistency!), the change 
would have minimal downstream effects -- by my count, direct tip.data or 
node.data slot access occurs only 6 times in the code, ignoring the 
functions above, and these could all be fixed with trivial changes. We'd 
also have to update geospiza, the docs, and tests, naturally.

Addendum: In case anyone else's mind happens to wander in this 
direction, yes, I think a similar argument could be made for combining 
the slots for tip and internal _labels_ into a single label slot, 
because each label is now unambiguously identified by its name (node 
ID). Seems like the separation is a historical artifact? Combining them 
would simplify the corresponding accessor/replace methods, which 
currently have to look conditionally in either tip.label or node.label 
depending on the arguments. And it wouldn't be hard at all to make this 
change in the code base. Of course, I'm not going to ask for the moon 
*and* the stars, but if someone else proposed it... :)

And, shoot, while I'm thinking about slots, I can't resist: what purpose 
does the Nnode slot serve?

Thanks,
Jim