[Phylobase-devl] Issues with NCL and/or NCL interface

Fri Mar 12 16:24:30 CET 2010

On Mar 11, 2010, at 5:51 PM, François Michonneau wrote:
>> <snip>
>>
>>> 2.2. factor levels
>>> Another somewhat related issue is the way the data frame based on  
>>> the
>>> data contained in the NEXUS file is created. Each character is  
>>> treated
>>> as a factor which is constructed using a call like:
>>>
>>> Test1=factor(c(1,NA,1,1,0,1,0,NA,NA,1,0,1,0,1,1,NA, 
>>> 0,0),levels=c(0,1,2,3),labels=c("test1A","test1B","","")
>>> However, this kind of call produces warning messages because
>>> duplicated labels aren't allowed anymore. The string created by
>>> ReadCharsWithNCL creates unnecessary levels. The number of levels is
>>> the same for all the characters in the data set. From the few  
>>> tests I
>>> have run, it looks that this number matches the maximum number of
>>> states for a given character +1 (in the example file, only the
>>> character "Test3" has 3 levels). I have also written workaround this
>>> problem but there is the risk that this problem will turn into an
>>> error message in the next few releases of R.
>>
>> It's good to fix the problem of duplicated labels. As for having  
>> the number
>> of levels the same for all characters, regardless of how many  
>> states they
>> have, this was deliberate. For example, you might have a data  
>> matrix for
>> colors of flower parts, and use the same state coding (0=red,  
>> 1=white,
>> 2=yellow) for three different flower parts (inner whorl of petals,  
>> outer
>> whorl, stamen). If the first two parts are any of the three colors,  
>> and
>> stamens are only red (0) or yellow (2), you don't want to recode it  
>> so that
>> the 0 and 2 for the stamen in nexus become a 0 and 1 in R. This  
>> would make
>> plants that have yellow petals and stamens (222) be recoded as  
>> having white
>> stamens (221), which could affect later analyses. If you do want this
>> recoding so that characters with two states only have two levels,  
>> you could
>> use levels.uniform=FALSE. levels.uniform=TRUE is the default  
>> because this is
>> how most people code traits.
>
> I changed the way the characters are returned to R by the NCL
> interface and it should behave as it was originally intended (I hope).

Phylobase is still new enough that I think it's worth, in cases like  
this where people probably haven't used it much, to go with what will  
be the most useful for users. Hopefully, this and the original  
intended behavior are the same, but much of the original behavior in  
this particular area was designed by me during the original hackathon,  
and I'm more than open to having it changed for better utility.

>
> I put quotes around the levels (i.e. states) of the characters. Then,
> it becomes unnecessary to use the argument 'levels'. Indeed, if
> levels.uniform is FALSE, then R does what it's supposed to do and
> create unique levels for each character. If levels.uniform is TRUE,
> then I force a posteriori all characters to have the same levels (the
> code for this part isn't the most elegant but it seems to do the job).
>
> Using the quotes, also allows to return polymorphic characters "as is"
> (i.e. with the curly brackets); and these polymorphic characters are
> thus treated as additional levels of the factors. It seems to me that
> the user should be able to deal with it if s/he wants to use the
> polymorphism in the analysis.

Nice.

>
> It seemed that levels.uniform wasn't really doing what it was supposed
> to do before the changes I committed today. Instead it was returning
> the labels associated with the character states. I thus added the new
> option 'return.labels' to readNexus to do this. Instead of returning
> the code for the state (e.g. 1) it returns its value (e.g.
> "nocturnal").

Good idea.

>
> Obviously, this feature doesn't play nice with polymorphic characters.

Why not? Is there a difference between '{0, 1}' and '{nocturnal,  
diurnal}'? The latter would only be an issue if some state names had  
commas in them, but that's such an infrequent use case that we could  
just have a warning if a comma in a state name is detected and there  
is a polymorphic character.

> So, if you try to use 'return.labels' with a dataset that includes
> polymorphic characters you obtain an error message saying that it's
> not implemented.
>
> I have to bring a few changes to my unit tests that I'll commit  
> tomorrow.

Thanks again for your work on this.

Best,
Brian

>
>  Cheers,
>  -- François

------------------------------------------------------
Brian O'Meara
http://www.brianomeara.info
Assistant Prof.
Dept. Ecology & Evolutionary Biology
U. of Tennessee, Knoxville

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20100312/dd488ddd/attachment.htm