No subject


Thu Apr 29 12:28:57 CEST 2010


Mark then Peter

The main potential problems that I see with the ways that phylobase is usin=
g NCL now are:
1. in NCLInterface.cpp there are lots of call to RemoveUnderscoresAndSpaces=
 to get rid of spaces and _ in names.  That makes names easier to deal with=
, but at some point will bite you (somebody will have dataset with a taxon =
labelled "AB" and another with "A B", after transformation there will be a =
name clash).

I agree that this is something to address.  Not only might there be clashes=
 but changing names, will be annoying to users.  Brian or Derrick could ans=
wer better, but I assume this is because some of the code used to parse the=
 tree string can't handle the underscores and spaces.

Has just bitten me! There is a deeper problem here in that readNexus uses t=
he NCLInterface code to get the data frame as parsable R code - with stripp=
ed spaces and underscores - but the tree block is passed over as a block of=
 raw text from the file. These names _aren't_ then stripped of underscores =
and spaces by read.nexustreestring() and so the name checking throws an err=
or.

Obviously there is an ongoing deeper discussion about how to handle passing=
 the tree from NCL and how to handle the dismayingly wide range of official=
 valid PAUP identifiers using regex but currently we've got a simpler probl=
em of different handling. Underscores in names are very commonly used to av=
oid the quoting problem with spaces so I think this current problem will co=
me up a lot.

Cheers,
David






_______________________________________________
Phylobase-devl mailing list
Phylobase-devl at lists.r-forge.r-project.org<mailto:Phylobase-devl at lists.r-fo=
rge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl


_______________________________________________
Phylobase-devl mailing list
Phylobase-devl at lists.r-forge.r-project.org<mailto:Phylobase-devl at lists.r-fo=
rge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl


--_000_ADBEFB7125624EC782C3F6F003535DF6imperialacuk_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode:=
 space; -webkit-line-break: after-white-space; ">Hi,<div><br></div><div>I k=
now this isn't the bible - but the PAUP manual specifies the following:</di=
v><div><br></div><div><div style=3D"margin-top: 0px; margin-right: 0px; mar=
gin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/normal P=
alatino; ">"Identifiers" are simply names given to taxa, characters, and ot=
her PAUP input elements such as character-sets, taxon-sets, and exclusion-s=
ets. They may include any combination of upper- and lower-case alphabetic c=
haracters, digits, and punctuation. If the identifier contains any of the f=
ollowing characters:</div><div style=3D"margin-top: 0px; margin-right: 0px;=
 margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/norm=
al Palatino; ">( ) [ ] { } / \ , ; : =3D * ' "` + - &lt; &gt;</div><div sty=
le=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: =
0px; font: normal normal normal 12px/normal Palatino; ">or a blank, the ent=
ire identifier must be enclosed in single quotes.</div><div style=3D"margin=
-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: n=
ormal normal normal 12px/normal Palatino; "><br></div><div style=3D"margin-=
top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: no=
rmal normal normal 12px/normal Palatino; "><span class=3D"Apple-style-span"=
 style=3D"font-family: Helvetica; font-size: medium; "><div>They're going t=
o be rare but they will happen.&nbsp;Any of those are problematic in a vali=
d R name - although not in a character string. The taxon identifiers could =
come into R as character vectors just as they appear in the Nexus file (pos=
sibly stripping the enclosing single quotes). The next question is then whe=
ther we need them to be valid R names rather than character strings - in wh=
ich case make.names() could be invoked.</div><div><br></div><div>Cheers,</d=
iv><div>David</div><div><br></div><div><br></div></span></div><div><div>On =
28 Apr 2010, at 22:30, Fran=E7ois Michonneau wrote:</div><br class=3D"Apple=
-interchange-newline"><blockquote type=3D"cite"><div><br>Hi all,<br><br> &n=
bsp;Sorry if this is a dumb question, but why do we need to remove spaces<b=
r>and underscore from the species names when building the data frame? The<b=
r>only character that I can think of that could be an issue is ", and I<br>=
don't think that it's allowed by software using NEXUS/used.<br><br> &nbsp;I=
n other words, do we really need to use RemoveUnderscoresAndSpaces in<br>NC=
LInterface.cpp?<br><br> &nbsp;Thanks,<br> &nbsp;-- Fran=E7ois <br><br>On Mo=
n, 2010-04-26 at 12:11 +0100, Orme, David wrote:<br><blockquote type=3D"cit=
e">I'd guess we want the names to be syntactically valid R names - and idea=
lly that would be through running make.names() across them. The problem is =
then that the NCLInterface can easily pass the raw PAUP identifiers for the=
 data (which we can then make.names()) but that the tree input is currently=
 via a text string. Again, probably easy enough to have the raw PAUP names =
in the string but these would be horrible to extract with regex. Is there a=
ny way that NCLInterface can pass the tree using numeric symbols and then p=
ass a translate block as a vector? Then make.names() could be run easily on=
 both the data names and the tree names...<br></blockquote><blockquote type=
=3D"cite"><br></blockquote><blockquote type=3D"cite">Cheers,<br></blockquot=
e><blockquote type=3D"cite">David<br></blockquote><blockquote type=3D"cite"=
><br></blockquote><blockquote type=3D"cite"><br></blockquote><blockquote ty=
pe=3D"cite"><br></blockquote><blockquote type=3D"cite">On 23 Apr 2010, at 1=
4:30, Fran=E7ois Michonneau wrote:<br></blockquote><blockquote type=3D"cite=
"><br></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><br>=
</blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cit=
e">Hi,<br></blockquote></blockquote><blockquote type=3D"cite"><blockquote t=
ype=3D"cite"><br></blockquote></blockquote><blockquote type=3D"cite"><block=
quote type=3D"cite"> Ouch! We need to fix this.<br></blockquote></blockquot=
e><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockquote></bl=
ockquote><blockquote type=3D"cite"><blockquote type=3D"cite"> There might b=
e some hope if we use Rcpp to build the data frame<br></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite">instead of buildin=
g and parsing a string.<br></blockquote></blockquote><blockquote type=3D"ci=
te"><blockquote type=3D"cite"><br></blockquote></blockquote><blockquote typ=
e=3D"cite"><blockquote type=3D"cite"> Let me talk to Dirk about it and see =
what we can do.<br></blockquote></blockquote><blockquote type=3D"cite"><blo=
ckquote type=3D"cite"><br></blockquote></blockquote><blockquote type=3D"cit=
e"><blockquote type=3D"cite"> Cheers,<br></blockquote></blockquote><blockqu=
ote type=3D"cite"><blockquote type=3D"cite"> -- Fran=E7ois<br></blockquote>=
</blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><br></bloc=
kquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite">On =
Fri, 2010-04-23 at 13:59 +0100, Orme, David wrote:<br></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite">Hi all,<br></blockquote></blockquote></blockquote><blockquote typ=
e=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockq=
uote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite">From an e-mail on 03/03/10:<br></blockq=
uote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blockqu=
ote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D=
"cite">Mark then Peter<br></blockquote></blockquote></blockquote><blockquot=
e type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></b=
lockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote t=
ype=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquot=
e type=3D"cite">The main potential problems that I see with the ways that p=
hylobase is using NCL now are:<br></blockquote></blockquote></blockquote></=
blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"=
><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"ci=
te"><span class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>1. in =
NCLInterface.cpp there are lots of call to RemoveUnderscoresAndSpaces to ge=
t rid of spaces and _ in names. &nbsp;That makes names easier to deal with,=
 but at some point will bite you (somebody will have dataset with a taxon l=
abelled "AB" and another with "A B", after transformation there will be a n=
ame clash).<br></blockquote></blockquote></blockquote></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blockqu=
ote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><block=
quote type=3D"cite"><blockquote type=3D"cite">I agree that this is somethin=
g to address. &nbsp;Not only might there be clashes but changing names, wil=
l be annoying to users. &nbsp;Brian or Derrick could answer better, but I a=
ssume this is because some of the code used to parse the tree string can't =
handle the underscores and spaces.<br></blockquote></blockquote></blockquot=
e></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockqu=
ote type=3D"cite"><br></blockquote></blockquote></blockquote><blockquote ty=
pe=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Has just bi=
tten me! There is a deeper problem here in that readNexus uses the NCLInter=
face code to get the data frame as parsable R code - with stripped spaces a=
nd underscores - but the tree block is passed over as a block of raw text f=
rom the file. These names _aren't_ then stripped of underscores and spaces =
by read.nexustreestring() and so the name checking throws an error. <br></b=
lockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote t=
ype=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></bloc=
kquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite">Obviously there is an ongoing deeper discussion about how to hand=
le passing the tree from NCL and how to handle the dismayingly wide range o=
f official valid PAUP identifiers using regex but currently we've got a sim=
pler problem of different handling. Underscores in names are very commonly =
used to avoid the quoting problem with spaces so I think this current probl=
em will come up a lot. <br></blockquote></blockquote></blockquote><blockquo=
te type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite">Cheers,<br></blockquote></blockquot=
e></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockqu=
ote type=3D"cite">David<br></blockquote></blockquote></blockquote><blockquo=
te type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blo=
ckquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote typ=
e=3D"cite"><br></blockquote></blockquote></blockquote><blockquote type=3D"c=
ite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockquote><=
/blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite=
"><blockquote type=3D"cite"><br></blockquote></blockquote></blockquote><blo=
ckquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><=
br></blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockq=
uote type=3D"cite"><blockquote type=3D"cite">______________________________=
_________________<br></blockquote></blockquote></blockquote><blockquote typ=
e=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Phylobase-de=
vl mailing list<br></blockquote></blockquote></blockquote><blockquote type=
=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><a href=3D"ma=
ilto:Phylobase-devl at lists.r-forge.r-project.org">Phylobase-devl at lists.r-for=
ge.r-project.org</a><br></blockquote></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><a href=
=3D"https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-=
devl">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobas=
e-devl</a><br></blockquote></blockquote></blockquote><blockquote type=3D"ci=
te"><blockquote type=3D"cite"><br></blockquote></blockquote><blockquote typ=
e=3D"cite"><br></blockquote><blockquote type=3D"cite">_____________________=
__________________________<br></blockquote><blockquote type=3D"cite">Phylob=
ase-devl mailing list<br></blockquote><blockquote type=3D"cite"><a href=3D"=
mailto:Phylobase-devl at lists.r-forge.r-project.org">Phylobase-devl at lists.r-f=
orge.r-project.org</a><br></blockquote><blockquote type=3D"cite"><a href=3D=
"https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-dev=
l">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-d=
evl</a><br></blockquote></div></blockquote></div><br></div></body></html>=

--_000_ADBEFB7125624EC782C3F6F003535DF6imperialacuk_--


More information about the Phylobase-devl mailing list