[Phylobase-devl] New phylobase build approach using static libncl (Was: Rcpp and OS X compiliation)

Thu Apr 29 21:19:12 CEST 2010

Hi,
	I'll chime in despite being fairly ignorant of a lot of the use cases for phylobase.

My experience dealing with files from users is that you have to be prepared for anything occurring in a taxon name.  So I'd say that you have to have a system in which the name can be any character string, not a system in which the taxon name can be a variable name in R.  So I would recommend getting the charcter string from NCL, and then using whatever R-specific character escaping conventions are needed to generate the equivalent string literal in R (you'll have to use \ before any quote or \ characters).

Phylobase *could* strip problematic symbols out of the names easily enough but:
	1. it is confusing to user's because the name of the taxon changes (usually only slightly, but it still changes), and
	2. then you have to deal with name clashes (if a file has 'a b' and 'ab' as taxa names, then stripping spaces will cause a clash).

NEXUS has syntactic rules for tokenizing.  They usually don't change the internal representation much, but NCL uses the syntactic rules to build a character string in memory.  In the following table I'll use double-quotes to delimit the character string that NCL would store in memory (the quotes are not part of the name).

Syntax                Internal representation
t1                    "t1"
't 2'                 "t 2"
t_2                   "t 2"
t.3                   "t.3"
't.3'                 "t.3"
t-4                   not a valid name this is 3 tokens "t", "-" and "4"
't-4'                 "t-4"
't''5 x'              "t'5 x"
't_6'                 "t_6" 

Basically, the syntactic rules just let you group tokens with token-breaking symbols (white space and punctuation).  The exceptions are the case t_2  and 't''5 x' above.  Both of those case involve substitution (in the second case the internal pair of single-quotes is collapsed to single quote in the internal representation).

What really confuses users is the fact that in some formats a_b might refer to the name "a_b", but in NEXUS it is "a b" So the same syntax has different interpretation in different formatting rules.  NCL uses the NEXUS tokenizing for NEXUS files and newick tree files (NCL supports fasta, phylip, and relaxed phylip formats, but does not impose NEXUS quoting rules on those formats). So if you use NCL to read a FASTA and NEXUS file both of which use the name a_b (without any quotes) then in the taxa gleaned from the FASTA file, the internal name will be "a_b", but the same name will have the internal representation of "a b" when read from NEXUS.  That invariably trips people up, but I don't think that there is anyway to avoid it because that behavior is mandated by both standards.

I think that if phylobase just include a brief discussion of this issue in a README or other documentation, then that is about as good as you can do.

all the best,
Mark

On Apr 29, 2010, at 10:51 AM, Orme, David wrote:

> Sorry - I take it back about the 'not problematic in a character string' - the quotes and the backslash are of course fairly problematic.
> 
> Cheers
> David
> 
> On 29 Apr 2010, at 16:43, Orme, David wrote:
> 
>> Hi,
>> 
>> I know this isn't the bible - but the PAUP manual specifies the following:
>> 
>> "Identifiers" are simply names given to taxa, characters, and other PAUP input elements such as character-sets, taxon-sets, and exclusion-sets. They may include any combination of upper- and lower-case alphabetic characters, digits, and punctuation. If the identifier contains any of the following characters:
>> ( ) [ ] { } / \ , ; : = * ' "` + - < >
>> or a blank, the entire identifier must be enclosed in single quotes.
>> 
>> They're going to be rare but they will happen. Any of those are problematic in a valid R name - although not in a character string. The taxon identifiers could come into R as character vectors just as they appear in the Nexus file (possibly stripping the enclosing single quotes). The next question is then whether we need them to be valid R names rather than character strings - in which case make.names() could be invoked.
>> 
>> Cheers,
>> David
>> 
>> 
>> On 28 Apr 2010, at 22:30, François Michonneau wrote:
>> 
>>> 
>>> Hi all,
>>> 
>>>  Sorry if this is a dumb question, but why do we need to remove spaces
>>> and underscore from the species names when building the data frame? The
>>> only character that I can think of that could be an issue is ", and I
>>> don't think that it's allowed by software using NEXUS/used.
>>> 
>>>  In other words, do we really need to use RemoveUnderscoresAndSpaces in
>>> NCLInterface.cpp?
>>> 
>>>  Thanks,
>>>  -- François 
>>> 
>>> On Mon, 2010-04-26 at 12:11 +0100, Orme, David wrote:
>>>> I'd guess we want the names to be syntactically valid R names - and ideally that would be through running make.names() across them. The problem is then that the NCLInterface can easily pass the raw PAUP identifiers for the data (which we can then make.names()) but that the tree input is currently via a text string. Again, probably easy enough to have the raw PAUP names in the string but these would be horrible to extract with regex. Is there any way that NCLInterface can pass the tree using numeric symbols and then pass a translate block as a vector? Then make.names() could be run easily on both the data names and the tree names...
>>>> 
>>>> Cheers,
>>>> David
>>>> 
>>>> 
>>>> 
>>>> On 23 Apr 2010, at 14:30, François Michonneau wrote:
>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Ouch! We need to fix this.
>>>>> 
>>>>> There might be some hope if we use Rcpp to build the data frame
>>>>> instead of building and parsing a string.
>>>>> 
>>>>> Let me talk to Dirk about it and see what we can do.
>>>>> 
>>>>> Cheers,
>>>>> -- François
>>>>> 
>>>>> On Fri, 2010-04-23 at 13:59 +0100, Orme, David wrote:
>>>>>> Hi all,
>>>>>> 
>>>>>> From an e-mail on 03/03/10:
>>>>>> 
>>>>>> Mark then Peter
>>>>>> 
>>>>>>>> The main potential problems that I see with the ways that phylobase is using NCL now are:
>>>>>>>> 	1. in NCLInterface.cpp there are lots of call to RemoveUnderscoresAndSpaces to get rid of spaces and _ in names.  That makes names easier to deal with, but at some point will bite you (somebody will have dataset with a taxon labelled "AB" and another with "A B", after transformation there will be a name clash).
>>>>>>> 
>>>>>>> I agree that this is something to address.  Not only might there be clashes but changing names, will be annoying to users.  Brian or Derrick could answer better, but I assume this is because some of the code used to parse the tree string can't handle the underscores and spaces.
>>>>>> 
>>>>>> Has just bitten me! There is a deeper problem here in that readNexus uses the NCLInterface code to get the data frame as parsable R code - with stripped spaces and underscores - but the tree block is passed over as a block of raw text from the file. These names _aren't_ then stripped of underscores and spaces by read.nexustreestring() and so the name checking throws an error. 
>>>>>> 
>>>>>> Obviously there is an ongoing deeper discussion about how to handle passing the tree from NCL and how to handle the dismayingly wide range of official valid PAUP identifiers using regex but currently we've got a simpler problem of different handling. Underscores in names are very commonly used to avoid the quoting problem with spaces so I think this current problem will come up a lot. 
>>>>>> 
>>>>>> Cheers,
>>>>>> David
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Phylobase-devl mailing list
>>>>>> Phylobase-devl at lists.r-forge.r-project.org
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> Phylobase-devl mailing list
>>>> Phylobase-devl at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
>> 
>> <ATT00002..txt>
> 
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl

Mark Holder

mtholder at ku.edu
http://phylo.bio.ku.edu/mark-holder

==============================================
Department of Ecology and Evolutionary Biology
University of Kansas
6031 Haworth Hall
1200 Sunnyside Avenue
Lawrence, Kansas 66045

lab phone:  785.864.5789

fax (shared): 785.864.5860
==============================================

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20100429/f7abd9d1/attachment-0001.htm>