[Phylobase-devl] Issues with NCL and/or NCL interface

Thu Mar 11 23:00:49 CET 2010

On Mar 11, 2010, at 4:37 PM, Peter Cowan wrote:

>
> On Mar 11, 2010, at 8:36 AM, Brian O'Meara wrote:
>
>>
>> On Mar 10, 2010, at 9:47 PM, François Michonneau wrote:
>>
>>> Hello all,
>>>
>>> While writing tests for readNexus I faced a few bugs in the way data
>>> included in NEXUS files are imported in phylobase. I am definitely
>>> more familiar with trees than with data when it comes to NEXUS files
>>> so I might have done something wrong.
>>>
>>> I created another NEXUS file with Mesquite which includes
>>> polymorphic characters and excluded characters (file
>>> treeplucharV02.nex). I am not sure if the problems described below  
>>> are
>>> caused by NCL or by the interface, so it would be great if someone
>>> with more knowledge could look into it.
>>>
>>> Let me know if you want more details/clarifications about these
>>> issues.
>>>
>>> Cheers,
>>> -- François
>>
>> Thanks for working on this, François.
>
> Yes, thanks for pushing this forward.
>
>>>
>>> 1. char.all=TRUE/FALSE (if TRUE includes even excluded characters in
>>> the NEXUS file)
>>> This doesn't seem to work. In the example file, the character  
>>> Test3 is
>>> supposed to be excluded (in the ASSUMPTIONS block), but the option  
>>> has
>>> no effect on the string returned by ReadCharsWithNCL. We could
>>> temporarily remove this option.
>>
>>
>> This is due to changes in the NCL. The way we got all vs some chars
>> (NCLInterface.cpp) is
>>
>> 			if (allchar) {
>> 				nchartoreturn=characters->GetNCharTotal();
>> 			}
>> 			else {
>> 				nchartoreturn=characters->GetNChar();
>> 			}
>>
>> but
>>
>> nxscharactersblock.h:|	The old GetNChar() function is now called
>> GetNumIncludedChars();
>>
>> Changing GetNChar to GetNumIncludedChars should help (I haven't coded
>> in phylobase lately, so I don't want to start committing code, but
>> this is where I'd start looking).
>>
>>
>>>
>>> 2. polymorphic.convert=TRUE/FALSE (if TRUE converts polymorphic
>>> characters to missing characters)
>>> 2.1. polymorphic characters
>>> In this case, the string returned by ReadCharsWithNCL differ  
>>> depending
>>> on the option. If polymorphic.convert=TRUE, NA are returned for
>>> polymorphic states. If polymorphic.convert=FALSE, then
>>> ReadCharsWithNCL returns all the states using curly brackets (e.g.
>>> {0,1}), which produces an error message when evaluated within R. I
>>> wrote a workaround (in R) for this problem that I should be able to
>>> commit tomorrow. So, at least for now, it's not a crucial issue.
>>
>> Good. When writing this part of phylobase, I wanted to keep the  
>> option
>> of using polymorphic characters, though I don't think any R
>> phylogenetic packages could use this (but maybe I'm wrong). Coding
>> this to use whatever is standard in R for showing polymorphism would
>> be good.
>
> How are you handling this on the R side?  I don't think we've really  
> discussed polymorphic characters before, have we?  Are there any  
> functions out there that can handle them?  Should we add some  
> functions for checking for or removing polymorphic data?

François, Ben, and I talked about this separately, and it seems that a  
new level will be made for polymorphic data: {0,1} becomes 01. But  
this might become a todo for later. I don't think any functions handle  
polymorphic data, but some phylogenetic algorithms do (for example,  
dealing with biogeography, where something could be in both Africa and  
South America). I imagine that whatever solution we come up with would  
be the standard for this in R phylogenetics software. Has this come up  
with BioConductor or DNA data elsewhere in R?

>
>>> 2.2. factor levels
>>> Another somewhat related issue is the way the data frame based on  
>>> the
>>> data contained in the NEXUS file is created. Each character is  
>>> treated
>>> as a factor which is constructed using a call like:
>>> Test1=factor(c(1,NA,1,1,0,1,0,NA,NA,1,0,1,0,1,1,NA,
>>> 0,0),levels=c(0,1,2,3),labels=c("test1A","test1B","","")
>>> However, this kind of call produces warning messages because
>>> duplicated labels aren't allowed anymore. The string created by
>>> ReadCharsWithNCL creates unnecessary levels. The number of levels is
>>> the same for all the characters in the data set. From the few  
>>> tests I
>>> have run, it looks that this number matches the maximum number of
>>> states for a given character +1 (in the example file, only the
>>> character "Test3" has 3 levels). I have also written workaround this
>>> problem but there is the risk that this problem will turn into an
>>> error message in the next few releases of R.
>>
>> It's good to fix the problem of duplicated labels. As for having the
>> number of levels the same for all characters, regardless of how many
>> states they have, this was deliberate. For example, you might have a
>> data matrix for colors of flower parts, and use the same state coding
>> (0=red, 1=white, 2=yellow) for three different flower parts (inner
>> whorl of petals, outer whorl, stamen).
>
> Does the NEXUS format allow one state specification for multiple  
> characters like that? Or, does each character get its own  
> "translation table" like the file Francios uploaded, which has this  
> line:
>
> 1 Test1 /  test1A test1B, 2 Test2 /  test2A test2B, 3 Test3 /   
> test3A test3B test3C ;

No, it doesn't, but people often don't write state names into the  
actual nexus file but instead keep them in their head or written  
somewhere else (i.e., in the Excel spreadsheet they used when coding  
characters, they might have a header that says "flower symmetry  
(0=actinomorphic, 1=zygomorphic)").

>
>> If the first two parts are any
>> of the three colors, and stamens are only red (0) or yellow (2), you
>> don't want to recode it so that the 0 and 2 for the stamen in nexus
>> become a 0 and 1 in R. This would make plants that have yellow petals
>> and stamens (222) be recoded as having white stamens (221), which
>> could affect later analyses.
>
> Do you mean later analyses in R, or outside of R?  Once they are  
> imported into R users shouldn't try to refer to the underlying  
> factor levels, but use the labels that we associate with the  
> character for them.

True, but I think a common use case will have people loading files  
without state names, just 0, 1, 2, etc., and our default labels would  
match levels, right?

>
>> If you do want this recoding so that
>> characters with two states only have two levels, you could use
>> levels.uniform=FALSE. levels.uniform=TRUE is the default because this
>> is how most people code traits.
>
> Interesting, but I'm a bit confused (probably because I've never  
> coded traits).  If I'm making a character matrix for a flower with  
> two traits (pubescent leaves and flower color), both traits could  
> have states TRUE, FALSE, RED, and WHITE? Even though the first two  
> states are only ever associated with the first trait, and like wise  
> for the second?

No, but you might think "pubescent and white are both primitive states  
for the group, so I'll give them 0, and the other states will be 1".  
But the behavior of phylobase would be the same in this case  
regardless of levels.uniform, because both traits have the same number  
of states. But if you had states red, white, and yellow for color, but  
just true/false for pubescent leaves, levels.uniform=TRUE would make  
three levels for each char, while levels.uniform=FALSE would make two  
levels for leaves but three for color. Perhaps we should give a user a  
warning whenever the number of observed states for all the characters  
are not the same, as this is confusing. Pseudocode:

if (minnumstates != maxnumstates) {
	if (levels.uniform==TRUE) {
		warning("Some characters had fewer states than others. All  
characters were assumed to have the same number of states (so a  
character with states 0 and 2 has a missing state 1 included, for  
example). To change this behavior, set levels.uniform=FALSE in  
readNexus")
	}
	else {
		warning("Some characters had fewer states than others. Each  
character was set to have its own required number of levels. If state  
labels were not given, the default labels may be incorrect. For  
example, a character with states 0 and 2 may be recoded to have labels  
0 and 1. To change this behavior, set levels.uniform=TRUE in readNexus")
	}
}

I forget how we set default labels, and whether a character with  
states 0 and 2 gets labels 0 and 1 or 0 and 2.

Best,
Brian

>
>> Hope this helps,
>
> It's helping me, thanks!
>
> Peter
>
>> Brian
>>
>>
>>> _______________________________________________
>>> Phylobase-devl mailing list
>>> Phylobase-devl at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>>
>> ------------------------------------------------------
>> Brian O'Meara
>> http://www.brianomeara.info
>> Assistant Prof.
>> Dept. Ecology & Evolutionary Biology
>> U. of Tennessee, Knoxville
>>
>> _______________________________________________
>> Phylobase-devl mailing list
>> Phylobase-devl at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl

------------------------------------------------------
Brian O'Meara
http://www.brianomeara.info
Assistant Prof.
Dept. Ecology & Evolutionary Biology
U. of Tennessee, Knoxville