[Phylobase-devl] unification of tree data slots
Jim Regetz
regetz at nceas.ucsb.edu
Sat Sep 26 01:32:49 CEST 2009
Hi everyone,
Last but not least, phylo4d now has a single 'data' slot in the
slot-mods branch. The code for tdata, tdata<-, and .phylo4Data is
definitely shorter now. The formatData function is mostly unchanged,
although I did move a small section of code from .phylo4Data to
formatData. I updated the documentation, geospiza.rda, and tests to
match, including reinstating some unit tests that were failing before. I
otherwise only had to make very minimal changes to repair indirect
breakage associated with this set of modifications.
As part of this process, I did make some user-level changes. Details are
below for anyone interested. How should we proceed on determining
whether to apply these changes to trunk? We also have Peter's NA->0
changes in the queue, and if we adopt both, we'll need to coordinate our
merges-to-trunk in a sensible way.
One somewhat superficial question -- I called the new slot 'data'. Any
strong feelings on whether it should be 'tdata' instead?
Cheers,
Jim
User-level changes:
(1) The phyo4d tree construction is slightly different. I re-wrote the
documentation to match the changes (I think), but here is a recap:
- If both tip.data and node.data are supplied, *any* columns with
common names are merged if merge.data=TRUE (the default), and they are
kept separate but the names appended with .tip and .node if
merge.data=FALSE. If columns of different types are merged, standard R
rules will apply, so it's up to the user to make sure the desired
outcome is achieved. Columns with different names will always be kept
separate, of course. Previously, the data were merged only if *all*
columns had the same name and merge.data=TRUE.
(Note: if multiple columns have the same name *within* input data
frame, things get complicated albeit in a predictable way, especially if
these columns are also involved in tip-internal data merging. For now
I'm happy to just leave that as user-beware.)
- The old documentation said that all.data was split up and matched to
tip.data and node.data before being matched to the actual tree node
labels or numbers. I'm not sure if that's truly how it was recently
working, but in any case, now each of tip.data, node.data, and all.data
are separately matched to the tree nodes as specified by function
arguments (all via formatData), then the cleaned-up data frames are
combined as appropriate.
- After the data are combined, any/all rows containing only NA values
are removed before being assigned to the data slot.
- It is now an error if any columns of all.data have the same name as
columns in either tip.data or node.data. If people feel strongly, I
could relax this error and instead e.g. make it so that ".all" is
appended to the offending all.data column name, but (not surprisingly :)
I think the stricter approach is reasonable in this case.
(2) tdata is also a little different:
- It *always* returns a data.frame in which the number of rows is
equal to the number of nodes of the specified type (tip, internal, or
all). If label.type="row.names" and labels exist, the row names are node
labels, otherwise the row names are node numbers. This is exactly as
before if data actually exist, BUT is slightly different if there are no
data. Example:
## let's say phyd has no data
> phyd at data
data frame with 0 columns and 0 rows
## accessor still returns a data frame with rows, just no columns
> tdata(phyd, "tip")
data frame with 0 columns and 4 rows
## and the row names of this otherwise empty data.frame are still
## the labels (or node numbers if no labels)
> row.names(tdata(phyd, "tip"))
[1] "A" "B" "D" "E"
- I changed type option "allnode" to "all", completing this
transition for all methods (which I believe folks agreed to in list
discussion some months ago, and which I also prefer).
(3) Miscellany:
* addData basically acts just like the constructor in how it combines
its multiple data arguments. The resulting data frame is then merged to
the existing @data. If there are column name conflicts, ".old" is
appended to the original column name and ".new" is appended to the new
column name. I'm happy to make this an error (or throw a warning?)
instead, similar to what you get if all.data has a column name conflict
with either tip.data and node.data. I guess I figured the difference is
that it's *really* an error to have name conflicts across arguments
within a single function call (phylo4d constructor), but we can be more
lenient in the case of adding data after the fact (addData method).
* In treePlot, if plot.data=TRUE but the tree !hasTipData, plot.data is
now automatically switched to FALSE with a warning. This "unbreaks" one
of the test cases in test/phylo4dtests.R. An alternative way to achieve
the same result would be to change the default plot.data value from
is(phy, 'phylo4d') to hasTipData, as long as we also added a new
hasTipData method for phylo4 that is hard-coded to return FALSE.
* In checkPhylo4Data, I commented out the part that checks that the
number of data rows matches the number of corresponding nodes. In the
new formulation, you don't need to have data for all (tip or internal)
nodes. The only real @data slot requirement is that all row names in the
data frame correspond to existing node numbers in the tree. If no one
disagrees, we can remove the commented-out code entirely.
* tbind didn't seem to be working in the first place, but I nevertheless
updated it to work using the new slot configuration, mostly by replacing
a direct tip.data access with tdata(x, "tip"), and replacing a check for
node data with hasNodeData. A few other pieces of the function look
slightly fishy (like, just because an object is phylo4d doesn't mean it
actually has data), but I didn't muck around any more.
TODO:
(These can all be post-release, and I'll try to get around to putting
them in the tracker.)
* Make prune(phylo4) a wrapper for prune(phylo4d), rather than the
reverse. Currently prune(phylo4d) has calls prune(phylo4) twice, once to
prune the user-supplied tree, and another time to prune a fully labeled
tree in order to reconstruct which tips/nodes were pruned. (This was
necessary when prune was just a wrapper for ape::drop.tip.) I plan to
make prune(phylo4d) do the work that prune(phylo4) now does, but also
deal with the data (if it exists), then switch prune(phylo4) to wrap
prune(phylo4d).
* Add tipData and nodeData accessor methods for convenience, then
probably change tdata to have default type "all"? This would be similar
to the labels-and-friends accessors.
* Re-examine checkTree and related validity methods to make sure we're
checking everything that should be checked, while not being stricter
than necessary.
* Write some addData unit tests.
More information about the Phylobase-devl
mailing list