[Phylobase-devl] unification of tree data slots
Jim Regetz
regetz at nceas.ucsb.edu
Thu Sep 17 00:17:01 CEST 2009
Hi all,
Promises, promises. I know I swore off any more pre-release proposals,
but because this involves class modification, I see this as a speak now
or forever hold my peace sort of thing!
While fleshing out unit tests, I ran into some minor issues involving
tdata. What I thought would be quick fixes raised thorny questions about
how to coordinate node.data and tip.data more robustly, and I kept
coming back to the idea that one structural change in the phylo4d class
would do wonders for simplifying tree data management:
What would folks think about having a single 'data' slot (it could be
called something else), rather than keeping 'tip.data' and 'node.data'
in separate slots?
This basically amounts to the difference between maintaining data in two
places (then having to merge every time all data are needed) versus
maintaining a single data frame (then simply extracting appropriate rows
whenever someone requests tip or node data). I can come up with a bunch
of reasons to make this change, mostly revolving around the fact that
this merge step is a minefield:
* Putting the tip and node data together involves decisions. Should
tip.data and node.data columns with the same name be treated as the same
variable? What if there are also two columns with that name *within*
tip.data? What if the data types are different? These are decisions that
need to be made based on user input at the time of construction. In
fact, some of these decisions *are* guided by inputs to the phylo4d
constructor, and the two slots are populated in a way that tries to
preserve those rules. But I see this is fragile. It's easy to break
things afterwards by modifying the slot contents in valid ways, and even
if the slots haven't been modified, the merge can be imperfect (see
below). I'd much prefer to see a data assembly step happen once during
construction (and as needed for data addition and/or replacement), with
those decisions unambiguously reflected in the structure of that single,
unified table.
* The only requirement of the tip.data and node.data slots is that they
are data frames in which row.names exactly match the IDs of tips and
internal nodes, respectively. Merging these can have unexpected side
effects, making this a hard-to-debug process. Some examples:
1. Merge discards row names and makes new "automatic" ones, which is
dangerous in our case because the row names are node IDs! It usually
works because the automatic row names are 1:N, which matches how we
currently produce node IDs. But it doesn't work now if the tip.data
and/or node.data rows are not in node order (this is perfectly legal),
and of course it won't work if the scheme for numbering IDs ever changes
(I know, not a huge deal, but still...). This is a good reason to
assemble and validate the data frame up front, rather than doing it on
the fly.
2. The current approach can produce column name inconsistencies in
what tdata() returns. Below, note the difference in names between the
tip/internal vs all cases:
phy <- phylo4(rtree(3))
phyd <- phylo4d(phy, tip.data=data.frame(a=1:3),
node.data=data.frame(a=4:5), match.data=FALSE, merge.data=FALSE)
tdata(phyd, "tip")
## a a
## t3 1 <NA>
## t1 2 <NA>
## t2 3 <NA>
tdata(phyd, "internal")
## a a
## 4 NA d
## 5 NA e
tdata(phyd, "all")
## a a.1
## t3 1 <NA>
## t1 2 <NA>
## t2 3 <NA>
## 4 NA d
## 5 NA e
3. The current approach can produce factor/character inconsistencies.
In particular, if you supply a constructor method node.data containing a
factor column, along with either tip.data or all.data, the factor gets
converted to character:
phy <- phylo4(rtree(3))
phyd <- phylo4d(phy, tip.data=data.frame(tdt=letters[1:3]),
node.data=data.frame(ndt=letters[4:5]), match.data=FALSE)
sapply(tdata(phyd), class)
## tdt ndt
## "factor" "character"
* In my mind, tip data and internal data may have different scientific
interpretations, but are otherwise equivalent. Node data (read: data for
both tip and internal nodes) are node data, regardless of what kind of
nodes we're talking about. Keeping them separate adds coding complexity.
Of course, for convenience, we absolutely should continue to provide
separate programmatic access to tip vs internal data via the arguments
of tdata() and related methods.
Other than the fact that it will take some programmer time (I am happy
to contribute as much as is necessary), the only possible disadvantage I
can think of is that we'd explicitly have to store NAs in cases where a
particular data column (or any data at all) is missing for one of the
node types. But at worst this means a doubling of the amount of data
stored. For what it's worth, I think the constructor methods essentially
do this already (i.e., creating NAs columns for variables that don't
exist for one node type but do for the other).
Practicalities: This would require some modification of the following:
tdata, addData, formatData, .phylo4Data, and checkPhylo4Data. On the
construction side, I think most of the necessary logic is somewhere in
those functions already. As for the tdata accessor, it's just a matter
of replacing the current merges with row indexing to retrieve the
appropriate data. On the whole I think this change would make these
functions *simpler*. Meanwhile, because tdata would still return the
same thing as before (but with even greater consistency!), the change
would have minimal downstream effects -- by my count, direct tip.data or
node.data slot access occurs only 6 times in the code, ignoring the
functions above, and these could all be fixed with trivial changes. We'd
also have to update geospiza, the docs, and tests, naturally.
Addendum: In case anyone else's mind happens to wander in this
direction, yes, I think a similar argument could be made for combining
the slots for tip and internal _labels_ into a single label slot,
because each label is now unambiguously identified by its name (node
ID). Seems like the separation is a historical artifact? Combining them
would simplify the corresponding accessor/replace methods, which
currently have to look conditionally in either tip.label or node.label
depending on the arguments. And it wouldn't be hard at all to make this
change in the code base. Of course, I'm not going to ask for the moon
*and* the stars, but if someone else proposed it... :)
And, shoot, while I'm thinking about slots, I can't resist: what purpose
does the Nnode slot serve?
Thanks,
Jim
More information about the Phylobase-devl
mailing list