[Phylobase-devl] Issues with NCL and/or NCL interface

Fri Mar 12 00:00:29 CET 2010

I realized that an example might illustrate the results of these
changes better. Below 2 cases for the same data set, the first with
"polymorphic.convert=FALSE, return.labels=FALSE" and the second with
"polymorphic.convert=TRUE, return.labels=TRUE"

readNexus(file="treepluscharV02.nex", polymorphic.convert=F,
levels.uniform=F, return.labels=F)
                     label node ancestor edge.length node.type Test1 Test2
1   Myrmecocystussemirufus    1       27    1.724765       tip     0     0
2   Myrmecocystusplacodops    2       27    1.724765       tip     0     0
3      Myrmecocystusmendax    3       26    4.650818       tip     1     0
4    Myrmecocystuskathjuli    4       28    1.083870       tip     1     0
5    Myrmecocystuswheeleri    5       28    1.083870       tip     0     0
6     Myrmecocystusmimicus    6       30    2.708942       tip  <NA>  <NA>
7     Myrmecocystusdepilis    7       30    2.708942       tip     1     0
8    Myrmecocystusromainei    8       32    2.193845       tip     1     1
9  Myrmecocystusnequazcatl    9       32    2.193845       tip     1     0
10       Myrmecocystusyuma   10       31    4.451425       tip     0     1
11   Myrmecocystuskennedyi   11       23    6.044804       tip     0     1
12 Myrmecocystuscreightoni   12       22   10.569191       tip  <NA> {0,1}
13  Myrmecocystussnellingi   13       33    2.770378       tip     1  <NA>
14 Myrmecocystustenuinodis   14       33    2.770378       tip     1     0
15  Myrmecocystustestaceus   15       20   12.300701       tip  <NA>  <NA>
16  Myrmecocystusmexicanus   16       34    5.724923       tip     0     0
17   Myrmecocystuscfnavajo   17       35    2.869547       tip     1 {0,1}
18     Myrmecocystusnavajo   18       35    2.869547       tip  <NA>     1

readNexus(treepluscharV02.nex", polymorphic.convert=T,
levels.uniform=F, return.labels=T)
                     label node ancestor edge.length node.type  Test1  Test2
1   Myrmecocystussemirufus    1       27    1.724765       tip test1A test2A
2   Myrmecocystusplacodops    2       27    1.724765       tip test1A test2A
3      Myrmecocystusmendax    3       26    4.650818       tip test1B test2A
4    Myrmecocystuskathjuli    4       28    1.083870       tip test1B test2A
5    Myrmecocystuswheeleri    5       28    1.083870       tip test1A test2A
6     Myrmecocystusmimicus    6       30    2.708942       tip   <NA>   <NA>
7     Myrmecocystusdepilis    7       30    2.708942       tip test1B test2A
8    Myrmecocystusromainei    8       32    2.193845       tip test1B test2B
9  Myrmecocystusnequazcatl    9       32    2.193845       tip test1B test2A
10       Myrmecocystusyuma   10       31    4.451425       tip test1A test2B
11   Myrmecocystuskennedyi   11       23    6.044804       tip test1A test2B
12 Myrmecocystuscreightoni   12       22   10.569191       tip   <NA>   <NA>
13  Myrmecocystussnellingi   13       33    2.770378       tip test1B   <NA>
14 Myrmecocystustenuinodis   14       33    2.770378       tip test1B test2A
15  Myrmecocystustestaceus   15       20   12.300701       tip   <NA>   <NA>
16  Myrmecocystusmexicanus   16       34    5.724923       tip test1A test2A
17   Myrmecocystuscfnavajo   17       35    2.869547       tip test1B   <NA>
18     Myrmecocystusnavajo   18       35    2.869547       tip   <NA> test2B

On Thu, Mar 11, 2010 at 17:51, François Michonneau
<francois.michonneau at gmail.com> wrote:
>>> 1. char.all=TRUE/FALSE (if TRUE includes even excluded characters in
>>> the NEXUS file)
>>> This doesn't seem to work. In the example file, the character Test3 is
>>> supposed to be excluded (in the ASSUMPTIONS block), but the option has
>>> no effect on the string returned by ReadCharsWithNCL. We could
>>> temporarily remove this option.
>>
>>
>> This is due to changes in the NCL. The way we got all vs some chars
>> (NCLInterface.cpp) is
>>
>>                        if (allchar) {
>>                                nchartoreturn=characters->GetNCharTotal();
>>                        }
>>                        else {
>>                                nchartoreturn=characters->GetNChar();
>>                        }
>>
>> but
>>
>> nxscharactersblock.h:|  The old GetNChar() function is now called
>> GetNumIncludedChars();
>>
>> Changing GetNChar to GetNumIncludedChars should help (I haven't coded in
>> phylobase lately, so I don't want to start committing code, but this is
>> where I'd start looking).
>
>
>>> 2. polymorphic.convert=TRUE/FALSE (if TRUE converts polymorphic
>>> characters to missing characters)
>>> 2.1. polymorphic characters
>>> In this case, the string returned by ReadCharsWithNCL differ depending
>>> on the option. If polymorphic.convert=TRUE, NA are returned for
>>> polymorphic states. If polymorphic.convert=FALSE, then
>>> ReadCharsWithNCL returns all the states using curly brackets (e.g.
>>> {0,1}), which produces an error message when evaluated within R. I
>>> wrote a workaround (in R) for this problem that I should be able to
>>> commit tomorrow. So, at least for now, it's not a crucial issue.
>>
>> Good. When writing this part of phylobase, I wanted to keep the option of
>> using polymorphic characters, though I don't think any R phylogenetic
>> packages could use this (but maybe I'm wrong). Coding this to use whatever
>> is standard in R for showing polymorphism would be good.
>>
>>> 2.2. factor levels
>>> Another somewhat related issue is the way the data frame based on the
>>> data contained in the NEXUS file is created. Each character is treated
>>> as a factor which is constructed using a call like:
>>>
>>> Test1=factor(c(1,NA,1,1,0,1,0,NA,NA,1,0,1,0,1,1,NA,0,0),levels=c(0,1,2,3),labels=c("test1A","test1B","","")
>>> However, this kind of call produces warning messages because
>>> duplicated labels aren't allowed anymore. The string created by
>>> ReadCharsWithNCL creates unnecessary levels. The number of levels is
>>> the same for all the characters in the data set. From the few tests I
>>> have run, it looks that this number matches the maximum number of
>>> states for a given character +1 (in the example file, only the
>>> character "Test3" has 3 levels). I have also written workaround this
>>> problem but there is the risk that this problem will turn into an
>>> error message in the next few releases of R.
>>
>> It's good to fix the problem of duplicated labels. As for having the number
>> of levels the same for all characters, regardless of how many states they
>> have, this was deliberate. For example, you might have a data matrix for
>> colors of flower parts, and use the same state coding (0=red, 1=white,
>> 2=yellow) for three different flower parts (inner whorl of petals, outer
>> whorl, stamen). If the first two parts are any of the three colors, and
>> stamens are only red (0) or yellow (2), you don't want to recode it so that
>> the 0 and 2 for the stamen in nexus become a 0 and 1 in R. This would make
>> plants that have yellow petals and stamens (222) be recoded as having white
>> stamens (221), which could affect later analyses. If you do want this
>> recoding so that characters with two states only have two levels, you could
>> use levels.uniform=FALSE. levels.uniform=TRUE is the default because this is
>> how most people code traits.
>
> I changed the way the characters are returned to R by the NCL
> interface and it should behave as it was originally intended (I hope).
>
> I put quotes around the levels (i.e. states) of the characters. Then,
> it becomes unnecessary to use the argument 'levels'. Indeed, if
> levels.uniform is FALSE, then R does what it's supposed to do and
> create unique levels for each character. If levels.uniform is TRUE,
> then I force a posteriori all characters to have the same levels (the
> code for this part isn't the most elegant but it seems to do the job).
>
> Using the quotes, also allows to return polymorphic characters "as is"
> (i.e. with the curly brackets); and these polymorphic characters are
> thus treated as additional levels of the factors. It seems to me that
> the user should be able to deal with it if s/he wants to use the
> polymorphism in the analysis.
>
> It seemed that levels.uniform wasn't really doing what it was supposed
> to do before the changes I committed today. Instead it was returning
> the labels associated with the character states. I thus added the new
> option 'return.labels' to readNexus to do this. Instead of returning
> the code for the state (e.g. 1) it returns its value (e.g.
> "nocturnal").
>
> Obviously, this feature doesn't play nice with polymorphic characters.
> So, if you try to use 'return.labels' with a dataset that includes
> polymorphic characters you obtain an error message saying that it's
> not implemented.
>
> I have to bring a few changes to my unit tests that I'll commit tomorrow.
>
>  Cheers,
>  -- François
>