[Phylobase-devl] unification of tree data slots

Sat Sep 26 01:32:49 CEST 2009

Hi everyone,

Last but not least, phylo4d now has a single 'data' slot in the 
slot-mods branch. The code for tdata, tdata<-, and .phylo4Data is 
definitely shorter now. The formatData function is mostly unchanged, 
although I did move a small section of code from .phylo4Data to 
formatData. I updated the documentation, geospiza.rda, and tests to 
match, including reinstating some unit tests that were failing before. I 
otherwise only had to make very minimal changes to repair indirect 
breakage associated with this set of modifications.

As part of this process, I did make some user-level changes. Details are 
below for anyone interested. How should we proceed on determining 
whether to apply these changes to trunk? We also have Peter's NA->0 
changes in the queue, and if we adopt both, we'll need to coordinate our 
merges-to-trunk in a sensible way.

One somewhat superficial question -- I called the new slot 'data'. Any 
strong feelings on whether it should be 'tdata' instead?

Cheers,
Jim

User-level changes:

(1) The phyo4d tree construction is slightly different. I re-wrote the 
documentation to match the changes (I think), but here is a recap:

  - If both tip.data and node.data are supplied, *any* columns with 
common names are merged if merge.data=TRUE (the default), and they are 
kept separate but the names appended with .tip and .node if 
merge.data=FALSE. If columns of different types are merged, standard R 
rules will apply, so it's up to the user to make sure the desired 
outcome is achieved. Columns with different names will always be kept 
separate, of course. Previously, the data were merged only if *all* 
columns had the same name and merge.data=TRUE.
     (Note: if multiple columns have the same name *within* input data 
frame, things get complicated albeit in a predictable way, especially if 
these columns are also involved in tip-internal data merging. For now 
I'm happy to just leave that as user-beware.)

  - The old documentation said that all.data was split up and matched to 
tip.data and node.data before being matched to the actual tree node 
labels or numbers. I'm not sure if that's truly how it was recently 
working, but in any case, now each of tip.data, node.data, and all.data 
are separately matched to the tree nodes as specified by function 
arguments (all via formatData), then the cleaned-up data frames are 
combined as appropriate.

  - After the data are combined, any/all rows containing only NA values 
are removed before being assigned to the data slot.

  - It is now an error if any columns of all.data have the same name as 
columns in either tip.data or node.data. If people feel strongly, I 
could relax this error and instead e.g. make it so that ".all" is 
appended to the offending all.data column name, but (not surprisingly :) 
I think the stricter approach is reasonable in this case.

(2) tdata is also a little different:

   - It *always* returns a data.frame in which the number of rows is 
equal to the number of nodes of the specified type (tip, internal, or 
all). If label.type="row.names" and labels exist, the row names are node 
labels, otherwise the row names are node numbers. This is exactly as 
before if data actually exist, BUT is slightly different if there are no 
data. Example:

## let's say phyd has no data
 > phyd at data
data frame with 0 columns and 0 rows

## accessor still returns a data frame with rows, just no columns
 > tdata(phyd, "tip")
data frame with 0 columns and 4 rows

## and the row names of this otherwise empty data.frame are still
## the labels (or node numbers if no labels)
 > row.names(tdata(phyd, "tip"))
[1] "A" "B" "D" "E"

   - I changed type option "allnode" to "all", completing this 
transition for all methods (which I believe folks agreed to in list 
discussion some months ago, and which I also prefer).

(3) Miscellany:

* addData basically acts just like the constructor in how it combines 
its multiple data arguments. The resulting data frame is then merged to 
the existing @data. If there are column name conflicts, ".old" is 
appended to the original column name and ".new" is appended to the new 
column name. I'm happy to make this an error (or throw a warning?) 
instead, similar to what you get if all.data has a column name conflict 
with either tip.data and node.data. I guess I figured the difference is 
that it's *really* an error to have name conflicts across arguments 
within a single function call (phylo4d constructor), but we can be more 
lenient in the case of adding data after the fact (addData method).

* In treePlot, if plot.data=TRUE but the tree !hasTipData, plot.data is 
now automatically switched to FALSE with a warning. This "unbreaks" one 
of the test cases in test/phylo4dtests.R. An alternative way to achieve 
the same result would be to change the default plot.data value from 
is(phy, 'phylo4d') to hasTipData, as long as we also added a new 
hasTipData method for phylo4 that is hard-coded to return FALSE.

* In checkPhylo4Data, I commented out the part that checks that the 
number of data rows matches the number of corresponding nodes. In the 
new formulation, you don't need to have data for all (tip or internal) 
nodes. The only real @data slot requirement is that all row names in the 
data frame correspond to existing node numbers in the tree. If no one 
disagrees, we can remove the commented-out code entirely.

* tbind didn't seem to be working in the first place, but I nevertheless 
updated it to work using the new slot configuration, mostly by replacing 
a direct tip.data access with tdata(x, "tip"), and replacing a check for 
node data with hasNodeData. A few other pieces of the function look 
slightly fishy (like, just because an object is phylo4d doesn't mean it 
actually has data), but I didn't muck around any more.

TODO:

(These can all be post-release, and I'll try to get around to putting 
them in the tracker.)

* Make prune(phylo4) a wrapper for prune(phylo4d), rather than the 
reverse. Currently prune(phylo4d) has calls prune(phylo4) twice, once to 
prune the user-supplied tree, and another time to prune a fully labeled 
tree in order to reconstruct which tips/nodes were pruned. (This was 
necessary when prune was just a wrapper for ape::drop.tip.) I plan to 
make prune(phylo4d) do the work that prune(phylo4) now does, but also 
deal with the data (if it exists), then switch prune(phylo4) to wrap 
prune(phylo4d).

* Add tipData and nodeData accessor methods for convenience, then 
probably change tdata to have default type "all"? This would be similar 
to the labels-and-friends accessors.

* Re-examine checkTree and related validity methods to make sure we're 
checking everything that should be checked, while not being stricter 
than necessary.

* Write some addData unit tests.