[Phylobase-devl] Issues with NCL and/or NCL interface
Peter Cowan
pdc at berkeley.edu
Thu Mar 11 23:50:42 CET 2010
On Mar 11, 2010, at 2:00 PM, Brian O'Meara wrote:
[snip]
>>>> 2. polymorphic.convert=TRUE/FALSE (if TRUE converts polymorphic
>>>> characters to missing characters)
>>>> 2.1. polymorphic characters
>>>> In this case, the string returned by ReadCharsWithNCL differ depending
>>>> on the option. If polymorphic.convert=TRUE, NA are returned for
>>>> polymorphic states. If polymorphic.convert=FALSE, then
>>>> ReadCharsWithNCL returns all the states using curly brackets (e.g.
>>>> {0,1}), which produces an error message when evaluated within R. I
>>>> wrote a workaround (in R) for this problem that I should be able to
>>>> commit tomorrow. So, at least for now, it's not a crucial issue.
>>>
>>> Good. When writing this part of phylobase, I wanted to keep the option
>>> of using polymorphic characters, though I don't think any R
>>> phylogenetic packages could use this (but maybe I'm wrong). Coding
>>> this to use whatever is standard in R for showing polymorphism would
>>> be good.
>>
>> How are you handling this on the R side? I don't think we've really discussed polymorphic characters before, have we? Are there any functions out there that can handle them? Should we add some functions for checking for or removing polymorphic data?
>
> François, Ben, and I talked about this separately, and it seems that a new level will be made for polymorphic data: {0,1} becomes 01. But this might become a todo for later. I don't think any functions handle polymorphic data, but some phylogenetic algorithms do (for example, dealing with biogeography, where something could be in both Africa and South America). I imagine that whatever solution we come up with would be the standard for this in R phylogenetics software. Has this come up with BioConductor or DNA data elsewhere in R?
That sounds like a reasonable solution. If we wanted to we could also add something to the metadata slot to reconstruct the states, for functions that need this information. I don't know what if anything other packages might have done.
>>>> 2.2. factor levels
>>>> Another somewhat related issue is the way the data frame based on the
>>>> data contained in the NEXUS file is created. Each character is treated
>>>> as a factor which is constructed using a call like:
>>>> Test1=factor(c(1,NA,1,1,0,1,0,NA,NA,1,0,1,0,1,1,NA,
>>>> 0,0),levels=c(0,1,2,3),labels=c("test1A","test1B","","")
>>>> However, this kind of call produces warning messages because
>>>> duplicated labels aren't allowed anymore. The string created by
>>>> ReadCharsWithNCL creates unnecessary levels. The number of levels is
>>>> the same for all the characters in the data set. From the few tests I
>>>> have run, it looks that this number matches the maximum number of
>>>> states for a given character +1 (in the example file, only the
>>>> character "Test3" has 3 levels). I have also written workaround this
>>>> problem but there is the risk that this problem will turn into an
>>>> error message in the next few releases of R.
>>>
>>> It's good to fix the problem of duplicated labels. As for having the
>>> number of levels the same for all characters, regardless of how many
>>> states they have, this was deliberate. For example, you might have a
>>> data matrix for colors of flower parts, and use the same state coding
>>> (0=red, 1=white, 2=yellow) for three different flower parts (inner
>>> whorl of petals, outer whorl, stamen).
>>
>> Does the NEXUS format allow one state specification for multiple characters like that? Or, does each character get its own "translation table" like the file Francios uploaded, which has this line:
>>
>> 1 Test1 / test1A test1B, 2 Test2 / test2A test2B, 3 Test3 / test3A test3B test3C ;
>
> No, it doesn't, but people often don't write state names into the actual nexus file but instead keep them in their head or written somewhere else (i.e., in the Excel spreadsheet they used when coding characters, they might have a header that says "flower symmetry (0=actinomorphic, 1=zygomorphic)").
Ah yes this makes things clearer, I assume then, that in these cases most programs don't bother to record this as:
1 flower symmetry / 0, 1;
>>> If the first two parts are any
>>> of the three colors, and stamens are only red (0) or yellow (2), you
>>> don't want to recode it so that the 0 and 2 for the stamen in nexus
>>> become a 0 and 1 in R. This would make plants that have yellow petals
>>> and stamens (222) be recoded as having white stamens (221), which
>>> could affect later analyses.
>>
>> Do you mean later analyses in R, or outside of R? Once they are imported into R users shouldn't try to refer to the underlying factor levels, but use the labels that we associate with the character for them.
>
> True, but I think a common use case will have people loading files without state names, just 0, 1, 2, etc., and our default labels would match levels, right?
Hmm, I haven't checked to see what happens if no state labels are provided by the nexus file. If the unused levels becomes a problem in the future, we might be able to drop levels in cases where state names are provided, but not otherwise.
>>> If you do want this recoding so that
>>> characters with two states only have two levels, you could use
>>> levels.uniform=FALSE. levels.uniform=TRUE is the default because this
>>> is how most people code traits.
>>
>> Interesting, but I'm a bit confused (probably because I've never coded traits). If I'm making a character matrix for a flower with two traits (pubescent leaves and flower color), both traits could have states TRUE, FALSE, RED, and WHITE? Even though the first two states are only ever associated with the first trait, and like wise for the second?
>
> No, but you might think "pubescent and white are both primitive states for the group, so I'll give them 0, and the other states will be 1". But the behavior of phylobase would be the same in this case regardless of levels.uniform, because both traits have the same number of states. But if you had states red, white, and yellow for color, but just true/false for pubescent leaves, levels.uniform=TRUE would make three levels for each char, while levels.uniform=FALSE would make two levels for leaves but three for color. Perhaps we should give a user a warning whenever the number of observed states for all the characters are not the same, as this is confusing. Pseudocode:
>
> if (minnumstates != maxnumstates) {
> if (levels.uniform==TRUE) {
> warning("Some characters had fewer states than others. All characters were assumed to have the same number of states (so a character with states 0 and 2 has a missing state 1 included, for example). To change this behavior, set levels.uniform=FALSE in readNexus")
> }
> else {
> warning("Some characters had fewer states than others. Each character was set to have its own required number of levels. If state labels were not given, the default labels may be incorrect. For example, a character with states 0 and 2 may be recoded to have labels 0 and 1. To change this behavior, set levels.uniform=TRUE in readNexus")
> }
> }
I don't know enough about the character matrices that most people make, but this might end up producing a warning anytime a file has differing numbers of states (which might be pretty common right?). Either way, this could be a good thing to add to the details section of the readNexus help file.
Peter
> I forget how we set default labels, and whether a character with states 0 and 2 gets labels 0 and 1 or 0 and 2.
>
> Best,
> Brian
>
>>
>>> Hope this helps,
>>
>> It's helping me, thanks!
>>
>> Peter
>>
>>> Brian
>>>
>>>
>>>> _______________________________________________
>>>> Phylobase-devl mailing list
>>>> Phylobase-devl at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>>>
>>> ------------------------------------------------------
>>> Brian O'Meara
>>> http://www.brianomeara.info
>>> Assistant Prof.
>>> Dept. Ecology & Evolutionary Biology
>>> U. of Tennessee, Knoxville
>>>
>>> _______________________________________________
>>> Phylobase-devl mailing list
>>> Phylobase-devl at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>>
>> _______________________________________________
>> Phylobase-devl mailing list
>> Phylobase-devl at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>
> ------------------------------------------------------
> Brian O'Meara
> http://www.brianomeara.info
> Assistant Prof.
> Dept. Ecology & Evolutionary Biology
> U. of Tennessee, Knoxville
>
More information about the Phylobase-devl
mailing list