[Phylobase-devl] Issues with NCL and/or NCL interface
François Michonneau
francois.michonneau at gmail.com
Thu Mar 11 03:47:25 CET 2010
Hello all,
While writing tests for readNexus I faced a few bugs in the way data
included in NEXUS files are imported in phylobase. I am definitely
more familiar with trees than with data when it comes to NEXUS files
so I might have done something wrong.
I created another NEXUS file with Mesquite which includes
polymorphic characters and excluded characters (file
treeplucharV02.nex). I am not sure if the problems described below are
caused by NCL or by the interface, so it would be great if someone
with more knowledge could look into it.
Let me know if you want more details/clarifications about these issues.
Cheers,
-- François
1. char.all=TRUE/FALSE (if TRUE includes even excluded characters in
the NEXUS file)
This doesn't seem to work. In the example file, the character Test3 is
supposed to be excluded (in the ASSUMPTIONS block), but the option has
no effect on the string returned by ReadCharsWithNCL. We could
temporarily remove this option.
2. polymorphic.convert=TRUE/FALSE (if TRUE converts polymorphic
characters to missing characters)
2.1. polymorphic characters
In this case, the string returned by ReadCharsWithNCL differ depending
on the option. If polymorphic.convert=TRUE, NA are returned for
polymorphic states. If polymorphic.convert=FALSE, then
ReadCharsWithNCL returns all the states using curly brackets (e.g.
{0,1}), which produces an error message when evaluated within R. I
wrote a workaround (in R) for this problem that I should be able to
commit tomorrow. So, at least for now, it's not a crucial issue.
2.2. factor levels
Another somewhat related issue is the way the data frame based on the
data contained in the NEXUS file is created. Each character is treated
as a factor which is constructed using a call like:
Test1=factor(c(1,NA,1,1,0,1,0,NA,NA,1,0,1,0,1,1,NA,0,0),levels=c(0,1,2,3),labels=c("test1A","test1B","","")
However, this kind of call produces warning messages because
duplicated labels aren't allowed anymore. The string created by
ReadCharsWithNCL creates unnecessary levels. The number of levels is
the same for all the characters in the data set. From the few tests I
have run, it looks that this number matches the maximum number of
states for a given character +1 (in the example file, only the
character "Test3" has 3 levels). I have also written workaround this
problem but there is the risk that this problem will turn into an
error message in the next few releases of R.
More information about the Phylobase-devl
mailing list