[Phylobase-devl] New phylobase build approach using static libncl (Was: Rcpp and OS X compiliation)

François Michonneau francois.michonneau at gmail.com
Tue May 18 19:13:07 CEST 2010


Thanks Mark!

  I'm working on putting and testing the code into phylobase. I'll keep
you all posted.

  Cheers,
  -- François 

On Tue, 2010-05-18 at 10:14 -0500, Mark Holder wrote:
> Hi,
> Sorry for the slow response.
> 
> 
> If you look at:
> 
> 
> URL:
> https://ncl.svn.sourceforge.net/svnroot/ncl/branches/v2.1/example/phylobaseinterface/NCLInterface.cpp
> URL:
> https://ncl.svn.sourceforge.net/svnroot/ncl/branches/v2.1/example/phylobaseinterface/NCLInterface.h
> SVN Revision: 524
> 
> 
> you'll see what I would do if were a phylobase developer.  
> 
> 
> I wrote a generic function for quoting NEXUS labels to R string
> literals. The function would work on most strings, but it does not
> deal with things (like alarms, backspace...) which are not expressible
> in NEXUS tokens.  Here it is:
> 
> 
> 
> 
> /************************************************************/
> std::string GetRStringLiteral(const std::string & inp)
> {
> std::string withQuotes;
> unsigned len = (unsigned)inp.length();
> withQuotes.reserve(len + 4);
> withQuotes.append(1,'\"');
> for (std::string::const_iterator sIt = inp.begin(); sIt != inp.end();
> sIt++)
> {
> if (strchr("\'\"\\\n\t", *sIt) == NULL)
> withQuotes.append(1, *sIt);
> else if (strchr("\'\"\\", *sIt) != NULL)
> {
> withQuotes.append(1,'\\');
> withQuotes.append(1, *sIt);
> }
> else if (*sIt == '\t')
> {
> withQuotes.append(1,'\\');
> withQuotes.append(1, 't');
> }
> else
> {
> withQuotes.append(1,'\\');
> withQuotes.append(1, 'n');
> }
> }
> withQuotes.append(1,'\"');
> return withQuotes;
> }
> /************************************************************/
> 
> 
> 
> 
> Then instead of the construct:
> 
> 
> outputforR+='"';
> outputforR+=RemoveUnderscoresAndSpaces(taxa->GetTaxonLabel(taxon));
> outputforR+='"';
> 
> 
> 
> 
> I suggest just calling:
> 
> outputforR+=GetRStringLiteral(taxa->GetTaxonLabel(taxon));
> 
> 
> 
> 
> Note:
> To get this to look reasonable (I still have not fully tested it, but
> the output looks OK to me), I had to comment out a line like this (on
> line 373 and 1164 of the version of NCLInterface.cpp that is in the
> phylobase repo):
> 
> 
> nexuscharacters=RemoveUnderscoresAndSpaces(nexuscharacters);
> 
> 
> Commenting this out means that:
> labels that used to come across to R as dnaalignment1, now would come
> across as dna_alignment_1 . This looks like it was an unintentional
> extra call to RemoveUnderscoresAndSpaces (the labels dna_alignment_1
> is written to the nexuscharacters with the underscores, but then they
> are stripped).
> 
> 
> 
> 
> 
> 
> I tested the NCL code on the NEXUS content below, and it seemed to
> deal with the funky names.
> 
> 
> all the best,
> Mark Holder
> 
> 
> 
> 
> 
> 
> 
> 
> #NEXUS
> 
> 
> begin data;
>    dimensions ntax=17 nchar=432;
>    format datatype=dna missing=?;
>    matrix
>    'h uman'
> ctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaacgtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccagaggttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaaggtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggacaacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggcaaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaatgccctggcccacaagtatcac
>    t_arsier
> ctgactgctgaagagaaggccgccgtcactgccctgtggggcaaggtagacgtggaagatgttggtggtgaggccctgggcaggctgctggtcgtctacccatggacccagaggttctttgactcctttggggacctgtccactcctgccgctgttatgagcaatgctaaggtcaaggcccatggcaaaaaggtgctgaacgcctttagtgacggcatggctcatctggacaacctcaagggcacctttgctaagctgagtgagctgcactgtgacaaattgcacgtggatcctgagaatttcaggctcttgggcaatgtgctggtgtgtgtgctggcccaccactttggcaaagaattcaccccgcaggttcaggctgcctatcagaaggtggtggctggtgtggctactgccttggctcacaagtaccac
>    'b_ushbaby'
>  ctgactcctgatgagaagaatgccgtttgtgccctgtggggcaaggtgaatgtggaagaagttggtggtgaggccctgggcaggctgctggttgtctacccatggacccagaggttctttgactcctttggggacctgtcctctccttctgctgttatgggcaaccctaaagtgaaggcccacggcaagaaggtgctgagtgcctttagcgagggcctgaatcacctggacaacctcaagggcacctttgctaagctgagtgagctgcattgtgacaagctgcacgtggaccctgagaacttcaggctcctgggcaacgtgctggtggttgtcctggctcaccactttggcaaggatttcaccccacaggtgcaggctgcctatcagaaggtggtggctggtgtggctactgccctggctcacaaataccac
>    'ha re'
>  ctgtccggtgaggagaagtctgcggtcactgccctgtggggcaaggtgaatgtggaagaagttggtggtgagaccctgggcaggctgctggttgtctacccatggacccagaggttcttcgagtcctttggggacctgtccactgcttctgctgttatgggcaaccctaaggtgaaggctcatggcaagaaggtgctggctgccttcagtgagggtctgagtcacctggacaacctcaaaggcaccttcgctaagctgagtgaactgcattgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggttattgtgctgtctcatcactttggcaaagaattcactcctcaggtgcaggctgcctatcagaaggtggtggctggtgtggccaatgccctggctcacaaataccac
>    'ra\bbit'
>  ctgtccagtgaggagaagtctgcggtcactgccctgtggggcaaggtgaatgtggaagaagttggtggtgaggccctgggcaggctgctggttgtctacccatggacccagaggttcttcgagtcctttggggacctgtcctctgcaaatgctgttatgaacaatcctaaggtgaaggctcatggcaagaaggtgctggctgccttcagtgagggtctgagtcacctggacaacctcaaaggcacctttgctaagctgagtgaactgcactgtgacaagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggttattgtgctgtctcatcattttggcaaagaattcactcctcaggtgcaggctgcctatcagaaggtggtggctggtgtggccaatgccctggctcacaaataccac
>    'co''w'
> ctgactgctgaggagaaggctgccgtcaccgccttttggggcaaggtgaaagtggatgaagttggtggtgaggccctgggcaggctgctggttgtctacccctggactcagaggttctttgagtcctttggggacttgtccactgctgatgctgttatgaacaaccctaaggtgaaggcccatggcaagaaggtgctagattcctttagtaatggcatgaagcatctcgatgacctcaagggcacctttgctgcgctgagtgagctgcactgtgataagctgcatgtggatcctgagaacttcaagctcctgggcaacgtgctagtggttgtgctggctcgcaattttggcaaggaattcaccccggtgctgcaggctgactttcagaaggtggtggctggtgtggccaatgccctggcccacagatatcat
>    'sh"eep'
> ctgactgctgaggagaaggctgccgtcaccggcttctggggcaaggtgaaagtggatgaagttggtgctgaggccctgggcaggctgctggttgtctacccctggactcagaggttctttgagcactttggggacttgtccaatgctgatgctgttatgaacaaccctaaggtgaaggcccatggcaagaaggtgctagactcctttagtaacggcatgaagcatctcgatgacctcaagggcacctttgctcagctgagtgagctgcactgtgataagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgctggtggttgtgctggctcgccaccatggcaatgaattcaccccggtgctgcaggctgactttcagaaggtggtggctggtgttgccaatgccctggcccacaaatatcac
>    pig
> ctgtctgctgaggagaaggaggccgtcctcggcctgtggggcaaagtgaatgtggacgaagttggtggtgaggccctgggcaggctgctggttgtctacccctggactcagaggttcttcgagtcctttggggacctgtccaatgccgatgccgtcatgggcaatcccaaggtgaaggcccacggcaagaaggtgctccagtccttcagtgacggcctgaaacatctcgacaacctcaagggcacctttgctaagctgagcgagctgcactgtgaccagctgcacgtggatcctgagaacttcaggctcctgggcaacgtgatagtggttgttctggctcgccgccttggccatgacttcaacccgaatgtgcaggctgcttttcagaaggtggtggctggtgttgctaatgccctggcccacaagtaccac
>    elephseal
> ttgacggcggaggagaagtctgccgtcacctccctgtggggcaaagtgaaggtggatgaagttggtggtgaagccctgggcaggctgctggttgtctacccctggactcagaggttctttgactcctttggggacctgtcctctcctaatgctattatgagcaaccccaaggtcaaggcccatggcaagaaggtgctgaattcctttagtgatggcctgaagaatctggacaacctcaagggcacctttgctaagctcagtgagctgcactgtgaccagctgcatgtggatcccgagaacttcaagctcctgggcaatgtgctggtgtgtgtgctggcccgccactttggcaaggaattcaccccacagatgcagggtgcctttcagaaggtggtagctggtgtggccaatgccctcgcccacaaatatcac
>    rat
> ctaactgatgctgagaaggctgctgttaatgccctgtggggaaaggtgaaccctgatgatgttggtggcgaggccctgggcaggctgctggttgtctacccttggacccagaggtactttgatagctttggggacctgtcctctgcctctgctatcatgggtaaccctaaggtgaaggcccatggcaagaaggtgataaacgccttcaatgatggcctgaaacacttggacaacctcaagggcacctttgctcatctgagtgaactccactgtgacaagctgcatgtggatcctgagaacttcaggctcctgggcaatatgattgtgattgtgttgggccaccacctgggcaaggaattcaccccctgtgcacaggctgccttccagaaggtggtggctggagtggccagtgccctggctcacaagtaccac
>    mouse
> ctgactgatgctgagaagtctgctgtctcttgcctgtgggcaaaggtgaaccccgatgaagttggtggtgaggccctgggcaggctgctggttgtctacccttggacccagcggtactttgatagctttggagacctatcctctgcctctgctatcatgggtaatcccaaggtgaaggcccatggcaaaaaggtgataactgcctttaacgagggcctgaaaaacctggacaacctcaagggcacctttgccagcctcagtgagctccactgtgacaagctgcatgtggatcctgagaacttcaggctcctaggcaatgcgatcgtgattgtgctgggccaccacctgggcaaggatttcacccctgctgcacaggctgccttccagaaggtggtggctggagtggccactgccctggctcacaagtaccac
>    hamster
> ctgactgatgctgagaaggcccttgtcactggcctgtggggaaaggtgaacgccgatgcagttggcgctgaggccctgggcaggttgctggttgtctacccttggacccagaggttctttgaacactttggagacctgtctctgccagttgctgtcatgaataacccccaggtgaaggcccatggcaagaaggtgatccactccttcgctgatggcctgaaacacctggacaacctgaagggcgccttttccagcctgagtgagctccactgtgacaagctgcacgtggatcctgagaacttcaagctcctgggcaatatgatcatcattgtgctgatccacgacctgggcaaggacttcactcccagtgcacagtctgcctttcataaggtggtggctggtgtggccaatgccctggctcacaagtaccac
>    marsupial
> ttgacttctgaggagaagaactgcatcactaccatctggtctaaggtgcaggttgaccagactggtggtgaggcccttggcaggatgctcgttgtctacccctggaccaccaggttttttgggagctttggtgatctgtcctctcctggcgctgtcatgtcaaattctaaggttcaagcccatggtgctaaggtgttgacctccttcggtgaagcagtcaagcatttggacaacctgaagggtacttatgccaagttgagtgagctccactgtgacaagctgcatgtggaccctgagaacttcaagatgctggggaatatcattgtgatctgcctggctgagcactttggcaaggattttactcctgaatgtcaggttgcttggcagaagctcgtggctggagttgcccatgccctggcccacaagtaccac
>    duck
>  tggacagccgaggagaagcagctcatcaccggcctctggggcaaggtcaatgtggccgactgtggagctgaggccctggccaggctgctgatcgtctacccctggacccagaggttcttcgcctccttcgggaacctgtccagccccactgccatccttggcaaccccatggtccgtgcccatggcaagaaagtgctcacctccttcggagatgctgtgaagaacctggacaacatcaagaacaccttcgcccagctgtccgagctgcactgcgacaagctgcacgtggaccctgagaacttcaggctcctgggtgacatcctcatcatcgtcctggccgcccacttcaccaaggatttcactcctgactgccaggccgcctggcagaagctggtccgcgtggtggcccacgctctggcccgcaagtaccac
>    chicken
> tggactgctgaggagaagcagctcatcaccggcctctggggcaaggtcaatgtggccgaatgtggggccgaagccctggccaggctgctgatcgtctacccctggacccagaggttctttgcgtcctttgggaacctctccagccccactgccatccttggcaaccccatggtccgcgcccacggcaagaaagtgctcacctcctttggggatgctgtgaagaacctggacaacatcaagaacaccttctcccaactgtccgaactgcattgtgacaagctgcatgtggaccccgagaacttcaggctcctgggtgacatcctcatcattgtcctggccgcccacttcagcaaggacttcactcctgaatgccaggctgcctggcagaagctggtccgcgtggtggcccatgccctggctcgcaagtaccac
>    xenlaev
> tggacagctgaagagaaggccgccatcacttctgtatggcagaaggtcaatgtagaacatgatggccatgatgccctgggcaggctgctgattgtgtacccctggacccagagatacttcagtaactttggaaacctctccaattcagctgctgttgctggaaatgccaaggttcaagcccatggcaagaaggttctttcagctgttggcaatgccattagccatattgacagtgtgaagtcctctctccaacaactcagtaagatccatgccactgaactgtttgtggaccctgagaactttaagcgttttggtggagttctggtcattgtcttgggtgccaaactgggaactgccttcactcctaaagttcaggctgcttgggagaaattcattgcagttttggttgatggtcttagccagggctataac
>    xentrop
> tggacagctgaagaaaaagcaaccattgcttctgtgtgggggaaagtcgacattgaacaggatggccatgatgcattatccaggctgctggttgtttatccctggactcagaggtacttcagcagttttggaaacctctccaatgtctccgctgtctctggaaatgtcaaggttaaagcccatggaaataaagtcctgtcagctgttggcagtgcaatccagcatctggatgatgtgaagagccaccttaaaggtcttagcaagagccatgctgaggatcttcatgtggatcccgaaaacttcaagcgccttgcggatgttctggtgatcgttctggctgccaaacttggatctgccttcactccccaagtccaagctgtctgggagaagctcaatgcaactctggtggctgctcttagccatggctacttc
>    ;
> end;
> 
> 
> begin mrbayes;
>    [The following block illustrates how to set up two data partitions
>     and use different models for the different partitions.]
>    charset non_coding = 1-90 358-432;
>    charset coding     = 91-357;
>    partition region = 2:non_coding,coding;
>    set partition = region;
>    
>    [The following lines set a codon model for the second data
> partition (coding) and
>     allows the non_coding and coding partitions to have different
> overall rates.]
>    lset applyto=(2) nucmodel=codon;
>    prset ratepr=variable;
>    
>    [Codon models are computationally complex so the following lines
> set the parameters
>     of the MCMC such that only 1 chain is run for 100 generations and
> results are printed
>     to screen and to file every tenth generation. To start this chain,
> you need to type
>     'mcmc' after executing this block. You need to run the chain
> longer to get adequate
>     convergence.]
>    mcmcp ngen=100 nchains=1 printfreq=10 samplefreq=10;
> end;
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> [################################NEXUS############################################]
> 
> 
> 
> On May 5, 2010, at 4:54 AM, Orme, David wrote:
> 
> > Hi,
> > 
> > 
> > I'd agree with this - we have to be able to cope with any taxon
> > names and, ideally, if we can avoid having to make the taxon names
> > syntactically valid R, then we should change them as little as
> > possible (apart from the obligatory unquoted underscore to space
> > translation) and keep them as character strings.
> > 
> > 
> > What would NCL pass as a token given these two?
> > 
> > 
> > 'taxon\1'
> > 'taxon"1'
> > 
> > 
> > I think these are the problem cases - R needs to escape the
> > backslash and double quote, but handles the double quote better than
> > the backslash:
> > 
> > 
> > > x <- c('taxon"1', 'taxon\1', 'taxon\\1', 'taxon\one', 'taxon\
> > \one')
> > Warning messages:
> > 1: '\o' is an unrecognized escape in a character string 
> > 2: unrecognized escape removed from "taxon\one" 
> > > x
> > [1] "taxon\"1"   "taxon\001"  "taxon\\1"   "taxonone"   "taxon\\one"
> > 
> > 
> > Cheers,
> > David
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On 29 Apr 2010, at 20:19, Mark Holder wrote:
> > 
> > > Hi,
> > > I'll chime in despite being fairly ignorant of a lot of the use
> > > cases for phylobase.
> > > 
> > > 
> > > My experience dealing with files from users is that you have to be
> > > prepared for anything occurring in a taxon name.  So I'd say that
> > > you have to have a system in which the name can be any character
> > > string, not a system in which the taxon name can be a variable
> > > name in R.  So I would recommend getting the charcter string from
> > > NCL, and then using whatever R-specific character escaping
> > > conventions are needed to generate the equivalent string literal
> > > in R (you'll have to use \ before any quote or \ characters).
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Phylobase *could* strip problematic symbols out of the names
> > > easily enough but:
> > > 1. it is confusing to user's because the name of the taxon changes
> > > (usually only slightly, but it still changes), and
> > > 2. then you have to deal with name clashes (if a file has 'a b'
> > > and 'ab' as taxa names, then stripping spaces will cause a clash).
> > > 
> > > 
> > > NEXUS has syntactic rules for tokenizing.  They usually don't
> > > change the internal representation much, but NCL uses the
> > > syntactic rules to build a character string in memory.  In the
> > > following table I'll use double-quotes to delimit the character
> > > string that NCL would store in memory (the quotes are not part of
> > > the name).
> > > 
> > > 
> > > Syntax                Internal representation
> > > t1                    "t1"
> > > 't 2'                 "t 2"
> > > t_2                   "t 2"
> > > t.3                   "t.3"
> > > 't.3'                 "t.3"
> > > t-4                   not a valid name this is 3 tokens "t", "-"
> > > and "4"
> > > 't-4'                 "t-4"
> > > 't''5 x'              "t'5 x"
> > > 't_6'                 "t_6" 
> > > 
> > > 
> > > Basically, the syntactic rules just let you group tokens with
> > > token-breaking symbols (white space and punctuation).  The
> > > exceptions are the case t_2  and 't''5 x' above.  Both of those
> > > case involve substitution (in the second case the internal pair of
> > > single-quotes is collapsed to single quote in the internal
> > > representation).
> > > 
> > > 
> > > 
> > > 
> > > What really confuses users is the fact that in some formats a_b
> > > might refer to the name "a_b", but in NEXUS it is "a b" So the
> > > same syntax has different interpretation in different formatting
> > > rules.  NCL uses the NEXUS tokenizing for NEXUS files and newick
> > > tree files (NCL supports fasta, phylip, and relaxed phylip
> > > formats, but does not impose NEXUS quoting rules on those
> > > formats). So if you use NCL to read a FASTA and NEXUS file both of
> > > which use the name a_b (without any quotes) then in the taxa
> > > gleaned from the FASTA file, the internal name will be "a_b", but
> > > the same name will have the internal representation of "a b" when
> > > read from NEXUS.  That invariably trips people up, but I don't
> > > think that there is anyway to avoid it because that behavior is
> > > mandated by both standards.
> > > 
> > > 
> > > I think that if phylobase just include a brief discussion of this
> > > issue in a README or other documentation, then that is about as
> > > good as you can do.
> > > 
> > > 
> > > 
> > > 
> > > all the best,
> > > Mark
> > > 
> > > 
> > > On Apr 29, 2010, at 10:51 AM, Orme, David wrote:
> > > 
> > > > Sorry - I take it back about the 'not problematic in a character
> > > > string' - the quotes and the backslash are of course fairly
> > > > problematic.
> > > > 
> > > > 
> > > > Cheers
> > > > David
> > > > 
> > > > On 29 Apr 2010, at 16:43, Orme, David wrote:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > 
> > > > > I know this isn't the bible - but the PAUP manual specifies
> > > > > the following:
> > > > > 
> > > > > 
> > > > > "Identifiers" are simply names given to taxa, characters, and
> > > > > other PAUP input elements such as character-sets, taxon-sets,
> > > > > and exclusion-sets. They may include any combination of upper-
> > > > > and lower-case alphabetic characters, digits, and punctuation.
> > > > > If the identifier contains any of the following characters:
> > > > > ( ) [ ] { } / \ , ; : = * ' "` + - < >
> > > > > or a blank, the entire identifier must be enclosed in single
> > > > > quotes.
> > > > > 
> > > > > 
> > > > > They're going to be rare but they will happen. Any of those
> > > > > are problematic in a valid R name - although not in a
> > > > > character string. The taxon identifiers could come into R as
> > > > > character vectors just as they appear in the Nexus file
> > > > > (possibly stripping the enclosing single quotes). The next
> > > > > question is then whether we need them to be valid R names
> > > > > rather than character strings - in which case make.names()
> > > > > could be invoked.
> > > > > 
> > > > > 
> > > > > Cheers,
> > > > > David
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > On 28 Apr 2010, at 22:30, François Michonneau wrote:
> > > > > 
> > > > > > 
> > > > > > Hi all,
> > > > > > 
> > > > > >  Sorry if this is a dumb question, but why do we need to
> > > > > > remove spaces
> > > > > > and underscore from the species names when building the data
> > > > > > frame? The
> > > > > > only character that I can think of that could be an issue is
> > > > > > ", and I
> > > > > > don't think that it's allowed by software using NEXUS/used.
> > > > > > 
> > > > > >  In other words, do we really need to use
> > > > > > RemoveUnderscoresAndSpaces in
> > > > > > NCLInterface.cpp?
> > > > > > 
> > > > > >  Thanks,
> > > > > >  -- François 
> > > > > > 
> > > > > > On Mon, 2010-04-26 at 12:11 +0100, Orme, David wrote:
> > > > > > > I'd guess we want the names to be syntactically valid R
> > > > > > > names - and ideally that would be through running
> > > > > > > make.names() across them. The problem is then that the
> > > > > > > NCLInterface can easily pass the raw PAUP identifiers for
> > > > > > > the data (which we can then make.names()) but that the
> > > > > > > tree input is currently via a text string. Again, probably
> > > > > > > easy enough to have the raw PAUP names in the string but
> > > > > > > these would be horrible to extract with regex. Is there
> > > > > > > any way that NCLInterface can pass the tree using numeric
> > > > > > > symbols and then pass a translate block as a vector? Then
> > > > > > > make.names() could be run easily on both the data names
> > > > > > > and the tree names...
> > > > > > > 
> > > > > > > Cheers,
> > > > > > > David
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > On 23 Apr 2010, at 14:30, François Michonneau wrote:
> > > > > > > 
> > > > > > > > 
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > Ouch! We need to fix this.
> > > > > > > > 
> > > > > > > > There might be some hope if we use Rcpp to build the
> > > > > > > > data frame
> > > > > > > > instead of building and parsing a string.
> > > > > > > > 
> > > > > > > > Let me talk to Dirk about it and see what we can do.
> > > > > > > > 
> > > > > > > > Cheers,
> > > > > > > > -- François
> > > > > > > > 
> > > > > > > > On Fri, 2010-04-23 at 13:59 +0100, Orme, David wrote:
> > > > > > > > > Hi all,
> > > > > > > > > 
> > > > > > > > > From an e-mail on 03/03/10:
> > > > > > > > > 
> > > > > > > > > Mark then Peter
> > > > > > > > > 
> > > > > > > > > > > The main potential problems that I see with the
> > > > > > > > > > > ways that phylobase is using NCL now are:
> > > > > > > > > > > 1. in NCLInterface.cpp there are lots of call to
> > > > > > > > > > > RemoveUnderscoresAndSpaces to get rid of spaces
> > > > > > > > > > > and _ in names.  That makes names easier to deal
> > > > > > > > > > > with, but at some point will bite you (somebody
> > > > > > > > > > > will have dataset with a taxon labelled "AB" and
> > > > > > > > > > > another with "A B", after transformation there
> > > > > > > > > > > will be a name clash).
> > > > > > > > > > 
> > > > > > > > > > I agree that this is something to address.  Not only
> > > > > > > > > > might there be clashes but changing names, will be
> > > > > > > > > > annoying to users.  Brian or Derrick could answer
> > > > > > > > > > better, but I assume this is because some of the
> > > > > > > > > > code used to parse the tree string can't handle the
> > > > > > > > > > underscores and spaces.
> > > > > > > > > 
> > > > > > > > > Has just bitten me! There is a deeper problem here in
> > > > > > > > > that readNexus uses the NCLInterface code to get the
> > > > > > > > > data frame as parsable R code - with stripped spaces
> > > > > > > > > and underscores - but the tree block is passed over as
> > > > > > > > > a block of raw text from the file. These names
> > > > > > > > > _aren't_ then stripped of underscores and spaces by
> > > > > > > > > read.nexustreestring() and so the name checking throws
> > > > > > > > > an error. 
> > > > > > > > > 
> > > > > > > > > Obviously there is an ongoing deeper discussion about
> > > > > > > > > how to handle passing the tree from NCL and how to
> > > > > > > > > handle the dismayingly wide range of official valid
> > > > > > > > > PAUP identifiers using regex but currently we've got a
> > > > > > > > > simpler problem of different handling. Underscores in
> > > > > > > > > names are very commonly used to avoid the quoting
> > > > > > > > > problem with spaces so I think this current problem
> > > > > > > > > will come up a lot. 
> > > > > > > > > 
> > > > > > > > > Cheers,
> > > > > > > > > David
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > _______________________________________________
> > > > > > > > > Phylobase-devl mailing list
> > > > > > > > > Phylobase-devl at lists.r-forge.r-project.org
> > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
> > > > > > > > 
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > Phylobase-devl mailing list
> > > > > > > Phylobase-devl at lists.r-forge.r-project.org
> > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
> > > > > 
> > > > > 
> > > > > <ATT00002..txt>
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Phylobase-devl mailing list
> > > > Phylobase-devl at lists.r-forge.r-project.org
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
> > > 
> > > Mark Holder
> > > 
> > > 
> > > mtholder at ku.edu
> > > http://phylo.bio.ku.edu/mark-holder
> > > 
> > > 
> > > ==============================================
> > > Department of Ecology and Evolutionary Biology
> > > University of Kansas
> > > 6031 Haworth Hall
> > > 1200 Sunnyside Avenue
> > > Lawrence, Kansas 66045
> > > 
> > > 
> > > lab phone:  785.864.5789
> > > 
> > > 
> > > fax (shared): 785.864.5860
> > > ==============================================
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > _______________________________________________
> > Phylobase-devl mailing list
> > Phylobase-devl at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
> 
> 
> _______________________________________________
> Phylobase-devl mailing list
> Phylobase-devl at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
URL: <http://lists.r-forge.r-project.org/pipermail/phylobase-devl/attachments/20100518/bb913c7c/attachment.pgp>


More information about the Phylobase-devl mailing list