[Phylobase-devl] unification of tree data slots

Sat Sep 26 06:59:33 CEST 2009

On Sep 25, 2009, at 4:32 PM, Jim Regetz wrote:

> Hi everyone,
>
> Last but not least, phylo4d now has a single 'data' slot in the
> slot-mods branch. The code for tdata, tdata<-, and .phylo4Data is
> definitely shorter now. The formatData function is mostly unchanged,
> although I did move a small section of code from .phylo4Data to
> formatData. I updated the documentation, geospiza.rda, and tests to
> match, including reinstating some unit tests that were failing  
> before. I
> otherwise only had to make very minimal changes to repair indirect
> breakage associated with this set of modifications.
>
> As part of this process, I did make some user-level changes. Details  
> are
> below for anyone interested. How should we proceed on determining
> whether to apply these changes to trunk? We also have Peter's NA->0
> changes in the queue, and if we adopt both, we'll need to coordinate  
> our
> merges-to-trunk in a sensible way.

Yeah, I think that the NA->0 involves fewer changes, so whether that  
should be first or second, I'm not sure.  I guessing that the merge  
will actually be straight forward with only one conflict  
(geospiza.rda, because it's a binary file there's no simple merge).

> One somewhat superficial question -- I called the new slot 'data'. Any
> strong feelings on whether it should be 'tdata' instead?

+1 for data

> Cheers,
> Jim
>
>
> User-level changes:
>
> (1) The phyo4d tree construction is slightly different. I re-wrote the
> documentation to match the changes (I think), but here is a recap:
>
> - If both tip.data and node.data are supplied, *any* columns with
> common names are merged if merge.data=TRUE (the default), and they are
> kept separate but the names appended with .tip and .node if
> merge.data=FALSE. If columns of different types are merged, standard R
> rules will apply, so it's up to the user to make sure the desired
> outcome is achieved. Columns with different names will always be kept
> separate, of course. Previously, the data were merged only if *all*
> columns had the same name and merge.data=TRUE.
>    (Note: if multiple columns have the same name *within* input data
> frame, things get complicated albeit in a predictable way,  
> especially if
> these columns are also involved in tip-internal data merging. For now
> I'm happy to just leave that as user-beware.)
>
> - The old documentation said that all.data was split up and matched to
> tip.data and node.data before being matched to the actual tree node
> labels or numbers. I'm not sure if that's truly how it was recently
> working, but in any case, now each of tip.data, node.data, and  
> all.data
> are separately matched to the tree nodes as specified by function
> arguments (all via formatData), then the cleaned-up data frames are
> combined as appropriate.
>
> - After the data are combined, any/all rows containing only NA values
> are removed before being assigned to the data slot.
>
> - It is now an error if any columns of all.data have the same name as
> columns in either tip.data or node.data. If people feel strongly, I
> could relax this error and instead e.g. make it so that ".all" is
> appended to the offending all.data column name, but (not  
> surprisingly :)
> I think the stricter approach is reasonable in this case.
>
> (2) tdata is also a little different:
>
>  - It *always* returns a data.frame in which the number of rows is
> equal to the number of nodes of the specified type (tip, internal, or
> all). If label.type="row.names" and labels exist, the row names are  
> node
> labels, otherwise the row names are node numbers. This is exactly as
> before if data actually exist, BUT is slightly different if there  
> are no
> data. Example:
>
> ## let's say phyd has no data
>> phyd at data
> data frame with 0 columns and 0 rows
>
> ## accessor still returns a data frame with rows, just no columns
>> tdata(phyd, "tip")
> data frame with 0 columns and 4 rows
>
> ## and the row names of this otherwise empty data.frame are still
> ## the labels (or node numbers if no labels)
>> row.names(tdata(phyd, "tip"))
> [1] "A" "B" "D" "E"
>
>  - I changed type option "allnode" to "all", completing this
> transition for all methods (which I believe folks agreed to in list
> discussion some months ago, and which I also prefer).
>
> (3) Miscellany:
>
> * addData basically acts just like the constructor in how it combines
> its multiple data arguments. The resulting data frame is then merged  
> to
> the existing @data. If there are column name conflicts, ".old" is
> appended to the original column name and ".new" is appended to the new
> column name. I'm happy to make this an error (or throw a warning?)
> instead, similar to what you get if all.data has a column name  
> conflict
> with either tip.data and node.data. I guess I figured the difference  
> is
> that it's *really* an error to have name conflicts across arguments
> within a single function call (phylo4d constructor), but we can be  
> more
> lenient in the case of adding data after the fact (addData method).
>
> * In treePlot, if plot.data=TRUE but the tree !hasTipData, plot.data  
> is
> now automatically switched to FALSE with a warning. This "unbreaks"  
> one
> of the test cases in test/phylo4dtests.R. An alternative way to  
> achieve
> the same result would be to change the default plot.data value from
> is(phy, 'phylo4d') to hasTipData, as long as we also added a new
> hasTipData method for phylo4 that is hard-coded to return FALSE.

why not plot.data = is(phy, 'phylo4d) && hasTipData.  I don't think we  
need to warn when users try to plot data that isn't there.   Though  
this wouldn't handle the case of a user setting the plot.data=TRUE  
even when there is no data (which the first solution does).   Perhaps  
both the first change and the class + data check are in order.

> * In checkPhylo4Data, I commented out the part that checks that the
> number of data rows matches the number of corresponding nodes. In the
> new formulation, you don't need to have data for all (tip or internal)
> nodes. The only real @data slot requirement is that all row names in  
> the
> data frame correspond to existing node numbers in the tree. If no one
> disagrees, we can remove the commented-out code entirely.
>
> * tbind didn't seem to be working in the first place, but I  
> nevertheless
> updated it to work using the new slot configuration, mostly by  
> replacing
> a direct tip.data access with tdata(x, "tip"), and replacing a check  
> for
> node data with hasNodeData. A few other pieces of the function look
> slightly fishy (like, just because an object is phylo4d doesn't mean  
> it
> actually has data), but I didn't muck around any more.
>
> TODO:
>
> (These can all be post-release, and I'll try to get around to putting
> them in the tracker.)
>
> * Make prune(phylo4) a wrapper for prune(phylo4d), rather than the
> reverse. Currently prune(phylo4d) has calls prune(phylo4) twice,  
> once to
> prune the user-supplied tree, and another time to prune a fully  
> labeled
> tree in order to reconstruct which tips/nodes were pruned. (This was
> necessary when prune was just a wrapper for ape::drop.tip.) I plan to
> make prune(phylo4d) do the work that prune(phylo4) now does, but also
> deal with the data (if it exists), then switch prune(phylo4) to wrap
> prune(phylo4d).
>
> * Add tipData and nodeData accessor methods for convenience, then
> probably change tdata to have default type "all"? This would be  
> similar
> to the labels-and-friends accessors.

+ 1

> * Re-examine checkTree and related validity methods to make sure we're
> checking everything that should be checked, while not being stricter
> than necessary.
>
> * Write some addData unit tests.
>
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl