[Phylobase-devl] unification of tree data slots

Thu Sep 17 03:32:06 CEST 2009

On Sep 16, 2009, at 5:47 PM, Ben Bolker wrote:

>
> To the best of my recollection, our reason for keeping them separate
> was that we thought that the majority of trees would have tip data  
> only,
> and that having node data would typically double the amount of data
> storage ... ?  or am I missing something here?

Ben, that's one issue, and there are some related ones.  See my  
comments below.

> François Michonneau wrote:
>>
>> Hi Jim,
>>
>> I totally agree with you. After rewriting the phylo4d constructor it
>> seemed to me that the distinction in the code between tip and node  
>> data
>> was unnecessary. It also seems to me that the advantages of having a
>> single slots for the data outweighs the annoyances it could create.
>>
>> I am also in favor on having a single slots for the labels. As for  
>> the
>> data, the same functions are used to build them and match them, so it
>> would be natural to have them in a single place.
>>
>> Finally, I think the Nnode slot was there to match the structure of
>> phylo objects. But, it seems superfluous with the tests that S4  
>> methods
>> bring.

That's my recollection also, +1 for removing it.

>> I'm quite busy but would be keen on finding some time to help
>> implementing these changes if other people agree...
>>
>> Cheers,
>>
>> -- François
>>
>> On Wed, 2009-09-16 at 15:17 -0700, Jim Regetz wrote:
>>> Hi all,
>>>
>>> Promises, promises. I know I swore off any more pre-release  
>>> proposals,
>>> but because this involves class modification, I see this as a  
>>> speak now
>>> or forever hold my peace sort of thing!

Please, keep the proposed improvements coming!  That said, I'm not  
sure this is the last chance to make this type of change.  There are  
good accessors to both of these slots, so we should be able to change  
what is really an implementation detail at a later date.  I know it  
wishful thinking, but people shouldn't complain about class changes,  
if we give them a way to be future proofed.

The other thing lurking in the background here is the implementation  
of pdata which is/was to be a class for storing data for trees.  It  
would be capable of handling more complicated data types (such as a  
vector for a tip trait instead of a single value.  It never really got  
going, and I don't know what it's future is...

>>> While fleshing out unit tests, I ran into some minor issues  
>>> involving
>>> tdata. What I thought would be quick fixes raised thorny questions  
>>> about
>>> how to coordinate node.data and tip.data more robustly, and I kept
>>> coming back to the idea that one structural change in the phylo4d  
>>> class
>>> would do wonders for simplifying tree data management:
>>>
>>> What would folks think about having a single 'data' slot (it could  
>>> be
>>> called something else), rather than keeping 'tip.data' and  
>>> 'node.data'
>>> in separate slots?

I think this was discussed at least once in the past.  Others will  
need to fill in the details of that conversation.

>>> This basically amounts to the difference between maintaining data  
>>> in two
>>> places (then having to merge every time all data are needed) versus
>>> maintaining a single data frame (then simply extracting  
>>> appropriate rows
>>> whenever someone requests tip or node data). I can come up with a  
>>> bunch
>>> of reasons to make this change, mostly revolving around the fact  
>>> that
>>> this merge step is a minefield:
>>>
>>> * Putting the tip and node data together involves decisions. Should
>>> tip.data and node.data columns with the same name be treated as  
>>> the same
>>> variable?

I'd tend to say no, that we should append .node or .tip to duplicate  
names the way merge does, but I can envision situations where you  
would like the names to be the same (ancestral trait reconstruction  
for one)

>>> What if there are also two columns with that name *within*
>>> tip.data? What if the data types are different? These are  
>>> decisions that
>>> need to be made based on user input at the time of construction. In
>>> fact, some of these decisions *are* guided by inputs to the phylo4d
>>> constructor, and the two slots are populated in a way that tries to
>>> preserve those rules. But I see this is fragile. It's easy to break
>>> things afterwards by modifying the slot contents in valid ways,  
>>> and even
>>> if the slots haven't been modified, the merge can be imperfect (see
>>> below). I'd much prefer to see a data assembly step happen once  
>>> during
>>> construction (and as needed for data addition and/or replacement),  
>>> with
>>> those decisions unambiguously reflected in the structure of that  
>>> single,
>>> unified table.
>>>
>>> * The only requirement of the tip.data and node.data slots is that  
>>> they
>>> are data frames in which row.names exactly match the IDs of tips and
>>> internal nodes, respectively. Merging these can have unexpected side
>>> effects, making this a hard-to-debug process. Some examples:
>>>  1. Merge discards row names and makes new "automatic" ones, which  
>>> is
>>> dangerous in our case because the row names are node IDs! It usually
>>> works because the automatic row names are 1:N, which matches how we
>>> currently produce node IDs. But it doesn't work now if the tip.data
>>> and/or node.data rows are not in node order (this is perfectly  
>>> legal),
>>> and of course it won't work if the scheme for numbering IDs ever  
>>> changes
>>> (I know, not a huge deal, but still...). This is a good reason to
>>> assemble and validate the data frame up front, rather than doing  
>>> it on
>>> the fly.

This is a problem, and while it appears possible to preserve them,  
it's not the most elegant [^1]

[1] <http://tolstoy.newcastle.edu.au/R/help/05/11/14978.html >

>>>  2. The current approach can produce column name inconsistencies in
>>> what tdata() returns. Below, note the difference in names between  
>>> the
>>> tip/internal vs all cases:
>>>
>>> phy <- phylo4(rtree(3))
>>> phyd <- phylo4d(phy, tip.data=data.frame(a=1:3),
>>>    node.data=data.frame(a=4:5), match.data=FALSE, merge.data=FALSE)
>>> tdata(phyd, "tip")
>>> ##    a    a
>>> ## t3 1 <NA>
>>> ## t1 2 <NA>
>>> ## t2 3 <NA>
>>> tdata(phyd, "internal")
>>> ##    a a
>>> ## 4 NA d
>>> ## 5 NA e
>>> tdata(phyd, "all")
>>> ##     a  a.1
>>> ## t3  1 <NA>
>>> ## t1  2 <NA>
>>> ## t2  3 <NA>
>>> ## 4  NA    d
>>> ## 5  NA    e
>>>
>>>  3. The current approach can produce factor/character  
>>> inconsistencies.
>>> In particular, if you supply a constructor method node.data  
>>> containing a
>>> factor column, along with either tip.data or all.data, the factor  
>>> gets
>>> converted to character:
>>>
>>> phy <- phylo4(rtree(3))
>>> phyd <- phylo4d(phy, tip.data=data.frame(tdt=letters[1:3]),
>>>    node.data=data.frame(ndt=letters[4:5]), match.data=FALSE)
>>> sapply(tdata(phyd), class)
>>> ##         tdt         ndt
>>> ##    "factor" "character"
>>>
>>> * In my mind, tip data and internal data may have different  
>>> scientific
>>> interpretations, but are otherwise equivalent. Node data (read:  
>>> data for
>>> both tip and internal nodes) are node data, regardless of what  
>>> kind of
>>> nodes we're talking about. Keeping them separate adds coding  
>>> complexity.
>>> Of course, for convenience, we absolutely should continue to provide
>>> separate programmatic access to tip vs internal data via the  
>>> arguments
>>> of tdata() and related methods.

Agreed

>>> Other than the fact that it will take some programmer time (I am  
>>> happy
>>> to contribute as much as is necessary), the only possible  
>>> disadvantage I
>>> can think of is that we'd explicitly have to store NAs in cases  
>>> where a
>>> particular data column (or any data at all) is missing for one of  
>>> the
>>> node types. But at worst this means a doubling of the amount of data
>>> stored. For what it's worth, I think the constructor methods  
>>> essentially
>>> do this already (i.e., creating NAs columns for variables that don't
>>> exist for one node type but do for the other).
>>>
>>> Practicalities: This would require some modification of the  
>>> following:
>>> tdata, addData, formatData, .phylo4Data, and checkPhylo4Data. On the
>>> construction side, I think most of the necessary logic is  
>>> somewhere in
>>> those functions already. As for the tdata accessor, it's just a  
>>> matter
>>> of replacing the current merges with row indexing to retrieve the
>>> appropriate data. On the whole I think this change would make these
>>> functions *simpler*. Meanwhile, because tdata would still return the
>>> same thing as before (but with even greater consistency!), the  
>>> change
>>> would have minimal downstream effects -- by my count, direct  
>>> tip.data or
>>> node.data slot access occurs only 6 times in the code, ignoring the
>>> functions above, and these could all be fixed with trivial  
>>> changes. We'd
>>> also have to update geospiza, the docs, and tests, naturally.
>>>
>>> Addendum: In case anyone else's mind happens to wander in this
>>> direction, yes, I think a similar argument could be made for  
>>> combining
>>> the slots for tip and internal _labels_ into a single label slot,
>>> because each label is now unambiguously identified by its name (node
>>> ID). Seems like the separation is a historical artifact? Combining  
>>> them
>>> would simplify the corresponding accessor/replace methods, which
>>> currently have to look conditionally in either tip.label or  
>>> node.label
>>> depending on the arguments. And it wouldn't be hard at all to make  
>>> this
>>> change in the code base. Of course, I'm not going to ask for the  
>>> moon
>>> *and* the stars, but if someone else proposed it... :)

Again, I think performance was the reason here.  The assumption that  
more often than not trees will not have any internal node labels.

>>> And, shoot, while I'm thinking about slots, I can't resist: what  
>>> purpose
>>> does the Nnode slot serve?
>>>
>>> Thanks,
>>> Jim
>>> _______________________________________________
>>> Phylobase-devl mailing list
>>> Phylobase-devl at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>
>
> -- 
> Ben Bolker
> Associate professor, Biology Dep't, Univ. of Florida
> bolker at ufl.edu / www.zoology.ufl.edu/bolker
> GPG key: www.zoology.ufl.edu/bolker/benbolker-publickey.asc
>
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl