[Phylobase-devl] Issues with NCL and/or NCL interface

Thu Mar 11 23:51:20 CET 2010

>> 1. char.all=TRUE/FALSE (if TRUE includes even excluded characters in
>> the NEXUS file)
>> This doesn't seem to work. In the example file, the character Test3 is
>> supposed to be excluded (in the ASSUMPTIONS block), but the option has
>> no effect on the string returned by ReadCharsWithNCL. We could
>> temporarily remove this option.
>
>
> This is due to changes in the NCL. The way we got all vs some chars
> (NCLInterface.cpp) is
>
>                        if (allchar) {
>                                nchartoreturn=characters->GetNCharTotal();
>                        }
>                        else {
>                                nchartoreturn=characters->GetNChar();
>                        }
>
> but
>
> nxscharactersblock.h:|  The old GetNChar() function is now called
> GetNumIncludedChars();
>
> Changing GetNChar to GetNumIncludedChars should help (I haven't coded in
> phylobase lately, so I don't want to start committing code, but this is
> where I'd start looking).

>> 2. polymorphic.convert=TRUE/FALSE (if TRUE converts polymorphic
>> characters to missing characters)
>> 2.1. polymorphic characters
>> In this case, the string returned by ReadCharsWithNCL differ depending
>> on the option. If polymorphic.convert=TRUE, NA are returned for
>> polymorphic states. If polymorphic.convert=FALSE, then
>> ReadCharsWithNCL returns all the states using curly brackets (e.g.
>> {0,1}), which produces an error message when evaluated within R. I
>> wrote a workaround (in R) for this problem that I should be able to
>> commit tomorrow. So, at least for now, it's not a crucial issue.
>
> Good. When writing this part of phylobase, I wanted to keep the option of
> using polymorphic characters, though I don't think any R phylogenetic
> packages could use this (but maybe I'm wrong). Coding this to use whatever
> is standard in R for showing polymorphism would be good.
>
>> 2.2. factor levels
>> Another somewhat related issue is the way the data frame based on the
>> data contained in the NEXUS file is created. Each character is treated
>> as a factor which is constructed using a call like:
>>
>> Test1=factor(c(1,NA,1,1,0,1,0,NA,NA,1,0,1,0,1,1,NA,0,0),levels=c(0,1,2,3),labels=c("test1A","test1B","","")
>> However, this kind of call produces warning messages because
>> duplicated labels aren't allowed anymore. The string created by
>> ReadCharsWithNCL creates unnecessary levels. The number of levels is
>> the same for all the characters in the data set. From the few tests I
>> have run, it looks that this number matches the maximum number of
>> states for a given character +1 (in the example file, only the
>> character "Test3" has 3 levels). I have also written workaround this
>> problem but there is the risk that this problem will turn into an
>> error message in the next few releases of R.
>
> It's good to fix the problem of duplicated labels. As for having the number
> of levels the same for all characters, regardless of how many states they
> have, this was deliberate. For example, you might have a data matrix for
> colors of flower parts, and use the same state coding (0=red, 1=white,
> 2=yellow) for three different flower parts (inner whorl of petals, outer
> whorl, stamen). If the first two parts are any of the three colors, and
> stamens are only red (0) or yellow (2), you don't want to recode it so that
> the 0 and 2 for the stamen in nexus become a 0 and 1 in R. This would make
> plants that have yellow petals and stamens (222) be recoded as having white
> stamens (221), which could affect later analyses. If you do want this
> recoding so that characters with two states only have two levels, you could
> use levels.uniform=FALSE. levels.uniform=TRUE is the default because this is
> how most people code traits.

I changed the way the characters are returned to R by the NCL
interface and it should behave as it was originally intended (I hope).

I put quotes around the levels (i.e. states) of the characters. Then,
it becomes unnecessary to use the argument 'levels'. Indeed, if
levels.uniform is FALSE, then R does what it's supposed to do and
create unique levels for each character. If levels.uniform is TRUE,
then I force a posteriori all characters to have the same levels (the
code for this part isn't the most elegant but it seems to do the job).

Using the quotes, also allows to return polymorphic characters "as is"
(i.e. with the curly brackets); and these polymorphic characters are
thus treated as additional levels of the factors. It seems to me that
the user should be able to deal with it if s/he wants to use the
polymorphism in the analysis.

It seemed that levels.uniform wasn't really doing what it was supposed
to do before the changes I committed today. Instead it was returning
the labels associated with the character states. I thus added the new
option 'return.labels' to readNexus to do this. Instead of returning
the code for the state (e.g. 1) it returns its value (e.g.
"nocturnal").

Obviously, this feature doesn't play nice with polymorphic characters.
So, if you try to use 'return.labels' with a dataset that includes
polymorphic characters you obtain an error message saying that it's
not implemented.

I have to bring a few changes to my unit tests that I'll commit tomorrow.

  Cheers,
  -- François