No subject


Thu Apr 29 12:28:57 CEST 2010


Mark then Peter

The main potential problems that I see with the ways that phylobase is usin=
g NCL now are:
1. in NCLInterface.cpp there are lots of call to RemoveUnderscoresAndSpaces=
 to get rid of spaces and _ in names.  That makes names easier to deal with=
, but at some point will bite you (somebody will have dataset with a taxon =
labelled "AB" and another with "A B", after transformation there will be a =
name clash).

I agree that this is something to address.  Not only might there be clashes=
 but changing names, will be annoying to users.  Brian or Derrick could ans=
wer better, but I assume this is because some of the code used to parse the=
 tree string can't handle the underscores and spaces.

Has just bitten me! There is a deeper problem here in that readNexus uses t=
he NCLInterface code to get the data frame as parsable R code - with stripp=
ed spaces and underscores - but the tree block is passed over as a block of=
 raw text from the file. These names _aren't_ then stripped of underscores =
and spaces by read.nexustreestring() and so the name checking throws an err=
or.

Obviously there is an ongoing deeper discussion about how to handle passing=
 the tree from NCL and how to handle the dismayingly wide range of official=
 valid PAUP identifiers using regex but currently we've got a simpler probl=
em of different handling. Underscores in names are very commonly used to av=
oid the quoting problem with spaces so I think this current problem will co=
me up a lot.

Cheers,
David






_______________________________________________
Phylobase-devl mailing list
Phylobase-devl at lists.r-forge.r-project.org<mailto:Phylobase-devl at lists.r-fo=
rge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl


_______________________________________________
Phylobase-devl mailing list
Phylobase-devl at lists.r-forge.r-project.org<mailto:Phylobase-devl at lists.r-fo=
rge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl

<ATT00002..txt>


--_000_4E8511DB568D4382BF5D7B7A03A7847Cimperialacuk_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode:=
 space; -webkit-line-break: after-white-space; ">Sorry - I take it back abo=
ut the 'not problematic in a character string' - the quotes and the backsla=
sh are of course fairly problematic.<div><br></div><div>Cheers</div><div>Da=
vid</div><div><br><div><div>On 29 Apr 2010, at 16:43, Orme, David wrote:</d=
iv><br class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div s=
tyle=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break=
: after-white-space; ">Hi,<div><br></div><div>I know this isn't the bible -=
 but the PAUP manual specifies the following:</div><div><br></div><div><div=
 style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-le=
ft: 0px; font: normal normal normal 12px/normal Palatino; ">"Identifiers" a=
re simply names given to taxa, characters, and other PAUP input elements su=
ch as character-sets, taxon-sets, and exclusion-sets. They may include any =
combination of upper- and lower-case alphabetic characters, digits, and pun=
ctuation. If the identifier contains any of the following characters:</div>=
<div style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margi=
n-left: 0px; font: normal normal normal 12px/normal Palatino; ">( ) [ ] { }=
 / \ , ; : =3D * ' "` + - &lt; &gt;</div><div style=3D"margin-top: 0px; mar=
gin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal n=
ormal 12px/normal Palatino; ">or a blank, the entire identifier must be enc=
losed in single quotes.</div><div style=3D"margin-top: 0px; margin-right: 0=
px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/n=
ormal Palatino; "><br></div><div style=3D"margin-top: 0px; margin-right: 0p=
x; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/no=
rmal Palatino; "><span class=3D"Apple-style-span" style=3D"font-family: Hel=
vetica; font-size: medium; "><div>They're going to be rare but they will ha=
ppen.&nbsp;Any of those are problematic in a valid R name - although not in=
 a character string. The taxon identifiers could come into R as character v=
ectors just as they appear in the Nexus file (possibly stripping the enclos=
ing single quotes). The next question is then whether we need them to be va=
lid R names rather than character strings - in which case make.names() coul=
d be invoked.</div><div><br></div><div>Cheers,</div><div>David</div><div><b=
r></div><div><br></div></span></div><div><div>On 28 Apr 2010, at 22:30, Fra=
n=E7ois Michonneau wrote:</div><br class=3D"Apple-interchange-newline"><blo=
ckquote type=3D"cite"><div><br>Hi all,<br><br> &nbsp;Sorry if this is a dum=
b question, but why do we need to remove spaces<br>and underscore from the =
species names when building the data frame? The<br>only character that I ca=
n think of that could be an issue is ", and I<br>don't think that it's allo=
wed by software using NEXUS/used.<br><br> &nbsp;In other words, do we reall=
y need to use RemoveUnderscoresAndSpaces in<br>NCLInterface.cpp?<br><br> &n=
bsp;Thanks,<br> &nbsp;-- Fran=E7ois <br><br>On Mon, 2010-04-26 at 12:11 +01=
00, Orme, David wrote:<br><blockquote type=3D"cite">I'd guess we want the n=
ames to be syntactically valid R names - and ideally that would be through =
running make.names() across them. The problem is then that the NCLInterface=
 can easily pass the raw PAUP identifiers for the data (which we can then m=
ake.names()) but that the tree input is currently via a text string. Again,=
 probably easy enough to have the raw PAUP names in the string but these wo=
uld be horrible to extract with regex. Is there any way that NCLInterface c=
an pass the tree using numeric symbols and then pass a translate block as a=
 vector? Then make.names() could be run easily on both the data names and t=
he tree names...<br></blockquote><blockquote type=3D"cite"><br></blockquote=
><blockquote type=3D"cite">Cheers,<br></blockquote><blockquote type=3D"cite=
">David<br></blockquote><blockquote type=3D"cite"><br></blockquote><blockqu=
ote type=3D"cite"><br></blockquote><blockquote type=3D"cite"><br></blockquo=
te><blockquote type=3D"cite">On 23 Apr 2010, at 14:30, Fran=E7ois Michonnea=
u wrote:<br></blockquote><blockquote type=3D"cite"><br></blockquote><blockq=
uote type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote>=
<blockquote type=3D"cite"><blockquote type=3D"cite">Hi,<br></blockquote></b=
lockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockqu=
ote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"> Ouch!=
 We need to fix this.<br></blockquote></blockquote><blockquote type=3D"cite=
"><blockquote type=3D"cite"><br></blockquote></blockquote><blockquote type=
=3D"cite"><blockquote type=3D"cite"> There might be some hope if we use Rcp=
p to build the data frame<br></blockquote></blockquote><blockquote type=3D"=
cite"><blockquote type=3D"cite">instead of building and parsing a string.<b=
r></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"c=
ite"><br></blockquote></blockquote><blockquote type=3D"cite"><blockquote ty=
pe=3D"cite"> Let me talk to Dirk about it and see what we can do.<br></bloc=
kquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><br=
></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"ci=
te"> Cheers,<br></blockquote></blockquote><blockquote type=3D"cite"><blockq=
uote type=3D"cite"> -- Fran=E7ois<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote><bloc=
kquote type=3D"cite"><blockquote type=3D"cite">On Fri, 2010-04-23 at 13:59 =
+0100, Orme, David wrote:<br></blockquote></blockquote><blockquote type=3D"=
cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Hi all,<br></bloc=
kquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blockqu=
ote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D=
"cite">From an e-mail on 03/03/10:<br></blockquote></blockquote></blockquot=
e><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"c=
ite"><br></blockquote></blockquote></blockquote><blockquote type=3D"cite"><=
blockquote type=3D"cite"><blockquote type=3D"cite">Mark then Peter<br></blo=
ckquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote typ=
e=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">The main pote=
ntial problems that I see with the ways that phylobase is using NCL now are=
:<br></blockquote></blockquote></blockquote></blockquote></blockquote><bloc=
kquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><b=
lockquote type=3D"cite"><blockquote type=3D"cite"><span class=3D"Apple-tab-=
span" style=3D"white-space:pre">	</span>1. in NCLInterface.cpp there are lo=
ts of call to RemoveUnderscoresAndSpaces to get rid of spaces and _ in name=
s. &nbsp;That makes names easier to deal with, but at some point will bite =
you (somebody will have dataset with a taxon labelled "AB" and another with=
 "A B", after transformation there will be a name clash).<br></blockquote><=
/blockquote></blockquote></blockquote></blockquote><blockquote type=3D"cite=
"><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"c=
ite"><br></blockquote></blockquote></blockquote></blockquote><blockquote ty=
pe=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote=
 type=3D"cite">I agree that this is something to address. &nbsp;Not only mi=
ght there be clashes but changing names, will be annoying to users. &nbsp;B=
rian or Derrick could answer better, but I assume this is because some of t=
he code used to parse the tree string can't handle the underscores and spac=
es.<br></blockquote></blockquote></blockquote></blockquote><blockquote type=
=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockqu=
ote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D=
"cite"><blockquote type=3D"cite">Has just bitten me! There is a deeper prob=
lem here in that readNexus uses the NCLInterface code to get the data frame=
 as parsable R code - with stripped spaces and underscores - but the tree b=
lock is passed over as a block of raw text from the file. These names _aren=
't_ then stripped of underscores and spaces by read.nexustreestring() and s=
o the name checking throws an error. <br></blockquote></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite"><br></blockquote></blockquote></blockquote><blockquote type=3D"ci=
te"><blockquote type=3D"cite"><blockquote type=3D"cite">Obviously there is =
an ongoing deeper discussion about how to handle passing the tree from NCL =
and how to handle the dismayingly wide range of official valid PAUP identif=
iers using regex but currently we've got a simpler problem of different han=
dling. Underscores in names are very commonly used to avoid the quoting pro=
blem with spaces so I think this current problem will come up a lot. <br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blo=
ckquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote typ=
e=3D"cite">Cheers,<br></blockquote></blockquote></blockquote><blockquote ty=
pe=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">David<br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blo=
ckquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote typ=
e=3D"cite"><br></blockquote></blockquote></blockquote><blockquote type=3D"c=
ite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockquote><=
/blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite=
"><blockquote type=3D"cite"><br></blockquote></blockquote></blockquote><blo=
ckquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><=
br></blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockq=
uote type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote>=
</blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquot=
e type=3D"cite">_______________________________________________<br></blockq=
uote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite">Phylobase-devl mailing list<br></blockq=
uote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><a href=3D"mailto:Phylobase-devl at lists.=
r-forge.r-project.org">Phylobase-devl at lists.r-forge.r-project.org</a><br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><a href=3D"https://lists.r-forge.r-=
project.org/cgi-bin/mailman/listinfo/phylobase-devl">https://lists.r-forge.=
r-project.org/cgi-bin/mailman/listinfo/phylobase-devl</a><br></blockquote><=
/blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite=
"><br></blockquote></blockquote><blockquote type=3D"cite"><br></blockquote>=
<blockquote type=3D"cite">_______________________________________________<b=
r></blockquote><blockquote type=3D"cite">Phylobase-devl mailing list<br></b=
lockquote><blockquote type=3D"cite"><a href=3D"mailto:Phylobase-devl at lists.=
r-forge.r-project.org">Phylobase-devl at lists.r-forge.r-project.org</a><br></=
blockquote><blockquote type=3D"cite"><a href=3D"https://lists.r-forge.r-pro=
ject.org/cgi-bin/mailman/listinfo/phylobase-devl">https://lists.r-forge.r-p=
roject.org/cgi-bin/mailman/listinfo/phylobase-devl</a><br></blockquote></di=
v></blockquote></div><br></div></div><span>&lt;ATT00002..txt&gt;</span></bl=
ockquote></div><br></div></body></html>=

--_000_4E8511DB568D4382BF5D7B7A03A7847Cimperialacuk_--


More information about the Phylobase-devl mailing list