[Phylobase-devl] unification of tree data slots

Thu Sep 17 03:11:23 CEST 2009

If truly you mean most trees will have tip data *only*, as opposed to 
tip data *mostly*, a solution is that we don't require every node to 
have a row in the data frame. So if you only supply tip data, then we'll 
only store rows for the tips, and for internal tips the NAs will be 
implied (but not stored) for all columns.

The caveat is that if you want even just a single column of data for 
nodes, then you'll actually have to store NAs for all the other columns 
as well.

So for a 3-tip (2-internal node) tree, if you only have tip data (but no 
node data) for columns a and b, then this is your space-efficient data 
frame:

@data
     a   b
1   x   x
2   x   x
3   x   x

But if you decide to add a column c for node data, then yes, we'd need 
to expand it to this:

@data:
     a   b   c
1   x   x  NA
2   x   x  NA
3   x   x  NA
4  NA  NA   x
5  NA  NA   x

Does that help at all? I guess I don't really blink much at the idea of 
"doubling the amount of data storage", but if others have very practical 
reasons for avoiding this, I understand.

BTW, I think the phylo4d constructor basically does the second thing 
above already if you supply both tip and node data, but it would store 
the first 3 rows in the @tip.data data frame and the last 2 rows in the 
@node.data data frame.

Jim

Ben Bolker wrote:
>   To the best of my recollection, our reason for keeping them separate
> was that we thought that the majority of trees would have tip data only,
> and that having node data would typically double the amount of data
> storage ... ?  or am I missing something here?
> 
>   Ben
> 
> François Michonneau wrote:
>> Hi Jim,
>>
>>   I totally agree with you. After rewriting the phylo4d constructor it
>> seemed to me that the distinction in the code between tip and node data
>> was unnecessary. It also seems to me that the advantages of having a
>> single slots for the data outweighs the annoyances it could create.
>>
>>   I am also in favor on having a single slots for the labels. As for the
>> data, the same functions are used to build them and match them, so it
>> would be natural to have them in a single place.
>>
>>   Finally, I think the Nnode slot was there to match the structure of
>> phylo objects. But, it seems superfluous with the tests that S4 methods
>> bring.
>>
>>   I'm quite busy but would be keen on finding some time to help
>> implementing these changes if other people agree...
>>
>>   Cheers,
>>
>>   -- François 
>>
>> On Wed, 2009-09-16 at 15:17 -0700, Jim Regetz wrote:
>>> Hi all,
>>>
>>> Promises, promises. I know I swore off any more pre-release proposals, 
>>> but because this involves class modification, I see this as a speak now 
>>> or forever hold my peace sort of thing!
>>>
>>> While fleshing out unit tests, I ran into some minor issues involving 
>>> tdata. What I thought would be quick fixes raised thorny questions about 
>>> how to coordinate node.data and tip.data more robustly, and I kept 
>>> coming back to the idea that one structural change in the phylo4d class 
>>> would do wonders for simplifying tree data management:
>>>
>>> What would folks think about having a single 'data' slot (it could be 
>>> called something else), rather than keeping 'tip.data' and 'node.data' 
>>> in separate slots?
>>>
>>> This basically amounts to the difference between maintaining data in two 
>>> places (then having to merge every time all data are needed) versus 
>>> maintaining a single data frame (then simply extracting appropriate rows 
>>> whenever someone requests tip or node data). I can come up with a bunch 
>>> of reasons to make this change, mostly revolving around the fact that 
>>> this merge step is a minefield:
>>>
>>> * Putting the tip and node data together involves decisions. Should 
>>> tip.data and node.data columns with the same name be treated as the same 
>>> variable? What if there are also two columns with that name *within* 
>>> tip.data? What if the data types are different? These are decisions that 
>>> need to be made based on user input at the time of construction. In 
>>> fact, some of these decisions *are* guided by inputs to the phylo4d 
>>> constructor, and the two slots are populated in a way that tries to 
>>> preserve those rules. But I see this is fragile. It's easy to break 
>>> things afterwards by modifying the slot contents in valid ways, and even 
>>> if the slots haven't been modified, the merge can be imperfect (see 
>>> below). I'd much prefer to see a data assembly step happen once during 
>>> construction (and as needed for data addition and/or replacement), with 
>>> those decisions unambiguously reflected in the structure of that single, 
>>> unified table.
>>>
>>> * The only requirement of the tip.data and node.data slots is that they 
>>> are data frames in which row.names exactly match the IDs of tips and 
>>> internal nodes, respectively. Merging these can have unexpected side 
>>> effects, making this a hard-to-debug process. Some examples:
>>>    1. Merge discards row names and makes new "automatic" ones, which is 
>>> dangerous in our case because the row names are node IDs! It usually 
>>> works because the automatic row names are 1:N, which matches how we 
>>> currently produce node IDs. But it doesn't work now if the tip.data 
>>> and/or node.data rows are not in node order (this is perfectly legal), 
>>> and of course it won't work if the scheme for numbering IDs ever changes 
>>> (I know, not a huge deal, but still...). This is a good reason to 
>>> assemble and validate the data frame up front, rather than doing it on 
>>> the fly.
>>>    2. The current approach can produce column name inconsistencies in 
>>> what tdata() returns. Below, note the difference in names between the 
>>> tip/internal vs all cases:
>>>
>>> phy <- phylo4(rtree(3))
>>> phyd <- phylo4d(phy, tip.data=data.frame(a=1:3),
>>>      node.data=data.frame(a=4:5), match.data=FALSE, merge.data=FALSE)
>>> tdata(phyd, "tip")
>>> ##    a    a
>>> ## t3 1 <NA>
>>> ## t1 2 <NA>
>>> ## t2 3 <NA>
>>> tdata(phyd, "internal")
>>> ##    a a
>>> ## 4 NA d
>>> ## 5 NA e
>>> tdata(phyd, "all")
>>> ##     a  a.1
>>> ## t3  1 <NA>
>>> ## t1  2 <NA>
>>> ## t2  3 <NA>
>>> ## 4  NA    d
>>> ## 5  NA    e
>>>
>>>    3. The current approach can produce factor/character inconsistencies. 
>>> In particular, if you supply a constructor method node.data containing a 
>>> factor column, along with either tip.data or all.data, the factor gets 
>>> converted to character:
>>>
>>> phy <- phylo4(rtree(3))
>>> phyd <- phylo4d(phy, tip.data=data.frame(tdt=letters[1:3]),
>>>      node.data=data.frame(ndt=letters[4:5]), match.data=FALSE)
>>> sapply(tdata(phyd), class)
>>> ##         tdt         ndt
>>> ##    "factor" "character"
>>>
>>> * In my mind, tip data and internal data may have different scientific 
>>> interpretations, but are otherwise equivalent. Node data (read: data for 
>>> both tip and internal nodes) are node data, regardless of what kind of 
>>> nodes we're talking about. Keeping them separate adds coding complexity. 
>>> Of course, for convenience, we absolutely should continue to provide 
>>> separate programmatic access to tip vs internal data via the arguments 
>>> of tdata() and related methods.
>>>
>>> Other than the fact that it will take some programmer time (I am happy 
>>> to contribute as much as is necessary), the only possible disadvantage I 
>>> can think of is that we'd explicitly have to store NAs in cases where a 
>>> particular data column (or any data at all) is missing for one of the 
>>> node types. But at worst this means a doubling of the amount of data 
>>> stored. For what it's worth, I think the constructor methods essentially 
>>> do this already (i.e., creating NAs columns for variables that don't 
>>> exist for one node type but do for the other).
>>>
>>> Practicalities: This would require some modification of the following: 
>>> tdata, addData, formatData, .phylo4Data, and checkPhylo4Data. On the 
>>> construction side, I think most of the necessary logic is somewhere in 
>>> those functions already. As for the tdata accessor, it's just a matter 
>>> of replacing the current merges with row indexing to retrieve the 
>>> appropriate data. On the whole I think this change would make these 
>>> functions *simpler*. Meanwhile, because tdata would still return the 
>>> same thing as before (but with even greater consistency!), the change 
>>> would have minimal downstream effects -- by my count, direct tip.data or 
>>> node.data slot access occurs only 6 times in the code, ignoring the 
>>> functions above, and these could all be fixed with trivial changes. We'd 
>>> also have to update geospiza, the docs, and tests, naturally.
>>>
>>> Addendum: In case anyone else's mind happens to wander in this 
>>> direction, yes, I think a similar argument could be made for combining 
>>> the slots for tip and internal _labels_ into a single label slot, 
>>> because each label is now unambiguously identified by its name (node 
>>> ID). Seems like the separation is a historical artifact? Combining them 
>>> would simplify the corresponding accessor/replace methods, which 
>>> currently have to look conditionally in either tip.label or node.label 
>>> depending on the arguments. And it wouldn't be hard at all to make this 
>>> change in the code base. Of course, I'm not going to ask for the moon 
>>> *and* the stars, but if someone else proposed it... :)
>>>
>>> And, shoot, while I'm thinking about slots, I can't resist: what purpose 
>>> does the Nnode slot serve?
>>>
>>> Thanks,
>>> Jim
>>> _______________________________________________
>>> Phylobase-devl mailing list
>>> Phylobase-devl at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl