[Phylobase-devl] New phylobase build approach using static libncl (Was: Rcpp and OS X compiliation)

Wed May 5 11:54:05 CEST 2010

Hi,

I'd agree with this - we have to be able to cope with any taxon names and, ideally, if we can avoid having to make the taxon names syntactically valid R, then we should change them as little as possible (apart from the obligatory unquoted underscore to space translation) and keep them as character strings.

What would NCL pass as a token given these two?

'taxon\1'
'taxon"1'

I think these are the problem cases - R needs to escape the backslash and double quote, but handles the double quote better than the backslash:

> x <- c('taxon"1', 'taxon\1', 'taxon\\1', 'taxon\one', 'taxon\\one')
Warning messages:
1: '\o' is an unrecognized escape in a character string
2: unrecognized escape removed from "taxon\one"
> x
[1] "taxon\"1"   "taxon\001"  "taxon\\1"   "taxonone"   "taxon\\one"

Cheers,
David

On 29 Apr 2010, at 20:19, Mark Holder wrote:

Hi,
I'll chime in despite being fairly ignorant of a lot of the use cases for phylobase.

My experience dealing with files from users is that you have to be prepared for anything occurring in a taxon name.  So I'd say that you have to have a system in which the name can be any character string, not a system in which the taxon name can be a variable name in R.  So I would recommend getting the charcter string from NCL, and then using whatever R-specific character escaping conventions are needed to generate the equivalent string literal in R (you'll have to use \ before any quote or \ characters).

Phylobase *could* strip problematic symbols out of the names easily enough but:
1. it is confusing to user's because the name of the taxon changes (usually only slightly, but it still changes), and
2. then you have to deal with name clashes (if a file has 'a b' and 'ab' as taxa names, then stripping spaces will cause a clash).

NEXUS has syntactic rules for tokenizing.  They usually don't change the internal representation much, but NCL uses the syntactic rules to build a character string in memory.  In the following table I'll use double-quotes to delimit the character string that NCL would store in memory (the quotes are not part of the name).

Syntax                Internal representation
t1                    "t1"
't 2'                 "t 2"
t_2                   "t 2"
t.3                   "t.3"
't.3'                 "t.3"
t-4                   not a valid name this is 3 tokens "t", "-" and "4"
't-4'                 "t-4"
't''5 x'              "t'5 x"
't_6'                 "t_6"

Basically, the syntactic rules just let you group tokens with token-breaking symbols (white space and punctuation).  The exceptions are the case t_2  and 't''5 x' above.  Both of those case involve substitution (in the second case the internal pair of single-quotes is collapsed to single quote in the internal representation).

What really confuses users is the fact that in some formats a_b might refer to the name "a_b", but in NEXUS it is "a b" So the same syntax has different interpretation in different formatting rules.  NCL uses the NEXUS tokenizing for NEXUS files and newick tree files (NCL supports fasta, phylip, and relaxed phylip formats, but does not impose NEXUS quoting rules on those formats). So if you use NCL to read a FASTA and NEXUS file both of which use the name a_b (without any quotes) then in the taxa gleaned from the FASTA file, the internal name will be "a_b", but the same name will have the internal representation of "a b" when read from NEXUS.  That invariably trips people up, but I don't think that there is anyway to avoid it because that behavior is mandated by both standards.

I think that if phylobase just include a brief discussion of this issue in a README or other documentation, then that is about as good as you can do.

all the best,
Mark

On Apr 29, 2010, at 10:51 AM, Orme, David wrote:

Sorry - I take it back about the 'not problematic in a character string' - the quotes and the backslash are of course fairly problematic.

Cheers
David

On 29 Apr 2010, at 16:43, Orme, David wrote:

Hi,

I know this isn't the bible - but the PAUP manual specifies the following:

"Identifiers" are simply names given to taxa, characters, and other PAUP input elements such as character-sets, taxon-sets, and exclusion-sets. They may include any combination of upper- and lower-case alphabetic characters, digits, and punctuation. If the identifier contains any of the following characters:
( ) [ ] { } / \ , ; : = * ' "` + - < >
or a blank, the entire identifier must be enclosed in single quotes.

They're going to be rare but they will happen. Any of those are problematic in a valid R name - although not in a character string. The taxon identifiers could come into R as character vectors just as they appear in the Nexus file (possibly stripping the enclosing single quotes). The next question is then whether we need them to be valid R names rather than character strings - in which case make.names() could be invoked.

Cheers,
David

On 28 Apr 2010, at 22:30, François Michonneau wrote:

Hi all,

 Sorry if this is a dumb question, but why do we need to remove spaces
and underscore from the species names when building the data frame? The
only character that I can think of that could be an issue is ", and I
don't think that it's allowed by software using NEXUS/used.

 In other words, do we really need to use RemoveUnderscoresAndSpaces in
NCLInterface.cpp?

 Thanks,
 -- François

On Mon, 2010-04-26 at 12:11 +0100, Orme, David wrote:
I'd guess we want the names to be syntactically valid R names - and ideally that would be through running make.names() across them. The problem is then that the NCLInterface can easily pass the raw PAUP identifiers for the data (which we can then make.names()) but that the tree input is currently via a text string. Again, probably easy enough to have the raw PAUP names in the string but these would be horrible to extract with regex. Is there any way that NCLInterface can pass the tree using numeric symbols and then pass a translate block as a vector? Then make.names() could be run easily on both the data names and the tree names...

Cheers,
David

On 23 Apr 2010, at 14:30, François Michonneau wrote:

Hi,

Ouch! We need to fix this.

There might be some hope if we use Rcpp to build the data frame
instead of building and parsing a string.

Let me talk to Dirk about it and see what we can do.

Cheers,
-- François

On Fri, 2010-04-23 at 13:59 +0100, Orme, David wrote:
Hi all,