No subject
Thu Apr 29 12:28:57 CEST 2010
Mark then Peter
The main potential problems that I see with the ways that phylobase is usin=
g NCL now are:
1. in NCLInterface.cpp there are lots of call to RemoveUnderscoresAndSpaces=
to get rid of spaces and _ in names. That makes names easier to deal with=
, but at some point will bite you (somebody will have dataset with a taxon =
labelled "AB" and another with "A B", after transformation there will be a =
name clash).
I agree that this is something to address. Not only might there be clashes=
but changing names, will be annoying to users. Brian or Derrick could ans=
wer better, but I assume this is because some of the code used to parse the=
tree string can't handle the underscores and spaces.
Has just bitten me! There is a deeper problem here in that readNexus uses t=
he NCLInterface code to get the data frame as parsable R code - with stripp=
ed spaces and underscores - but the tree block is passed over as a block of=
raw text from the file. These names _aren't_ then stripped of underscores =
and spaces by read.nexustreestring() and so the name checking throws an err=
or.
Obviously there is an ongoing deeper discussion about how to handle passing=
the tree from NCL and how to handle the dismayingly wide range of official=
valid PAUP identifiers using regex but currently we've got a simpler probl=
em of different handling. Underscores in names are very commonly used to av=
oid the quoting problem with spaces so I think this current problem will co=
me up a lot.
Cheers,
David
_______________________________________________
Phylobase-devl mailing list
Phylobase-devl at lists.r-forge.r-project.org<mailto:Phylobase-devl at lists.r-fo=
rge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
_______________________________________________
Phylobase-devl mailing list
Phylobase-devl at lists.r-forge.r-project.org<mailto:Phylobase-devl at lists.r-fo=
rge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/phylobase-devl
<ATT00002..txt>
--_000_4E8511DB568D4382BF5D7B7A03A7847Cimperialacuk_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode:=
space; -webkit-line-break: after-white-space; ">Sorry - I take it back abo=
ut the 'not problematic in a character string' - the quotes and the backsla=
sh are of course fairly problematic.<div><br></div><div>Cheers</div><div>Da=
vid</div><div><br><div><div>On 29 Apr 2010, at 16:43, Orme, David wrote:</d=
iv><br class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><div s=
tyle=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break=
: after-white-space; ">Hi,<div><br></div><div>I know this isn't the bible -=
but the PAUP manual specifies the following:</div><div><br></div><div><div=
style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-le=
ft: 0px; font: normal normal normal 12px/normal Palatino; ">"Identifiers" a=
re simply names given to taxa, characters, and other PAUP input elements su=
ch as character-sets, taxon-sets, and exclusion-sets. They may include any =
combination of upper- and lower-case alphabetic characters, digits, and pun=
ctuation. If the identifier contains any of the following characters:</div>=
<div style=3D"margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margi=
n-left: 0px; font: normal normal normal 12px/normal Palatino; ">( ) [ ] { }=
/ \ , ; : =3D * ' "` + - < ></div><div style=3D"margin-top: 0px; mar=
gin-right: 0px; margin-bottom: 0px; margin-left: 0px; font: normal normal n=
ormal 12px/normal Palatino; ">or a blank, the entire identifier must be enc=
losed in single quotes.</div><div style=3D"margin-top: 0px; margin-right: 0=
px; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/n=
ormal Palatino; "><br></div><div style=3D"margin-top: 0px; margin-right: 0p=
x; margin-bottom: 0px; margin-left: 0px; font: normal normal normal 12px/no=
rmal Palatino; "><span class=3D"Apple-style-span" style=3D"font-family: Hel=
vetica; font-size: medium; "><div>They're going to be rare but they will ha=
ppen. Any of those are problematic in a valid R name - although not in=
a character string. The taxon identifiers could come into R as character v=
ectors just as they appear in the Nexus file (possibly stripping the enclos=
ing single quotes). The next question is then whether we need them to be va=
lid R names rather than character strings - in which case make.names() coul=
d be invoked.</div><div><br></div><div>Cheers,</div><div>David</div><div><b=
r></div><div><br></div></span></div><div><div>On 28 Apr 2010, at 22:30, Fra=
n=E7ois Michonneau wrote:</div><br class=3D"Apple-interchange-newline"><blo=
ckquote type=3D"cite"><div><br>Hi all,<br><br> Sorry if this is a dum=
b question, but why do we need to remove spaces<br>and underscore from the =
species names when building the data frame? The<br>only character that I ca=
n think of that could be an issue is ", and I<br>don't think that it's allo=
wed by software using NEXUS/used.<br><br> In other words, do we reall=
y need to use RemoveUnderscoresAndSpaces in<br>NCLInterface.cpp?<br><br> &n=
bsp;Thanks,<br> -- Fran=E7ois <br><br>On Mon, 2010-04-26 at 12:11 +01=
00, Orme, David wrote:<br><blockquote type=3D"cite">I'd guess we want the n=
ames to be syntactically valid R names - and ideally that would be through =
running make.names() across them. The problem is then that the NCLInterface=
can easily pass the raw PAUP identifiers for the data (which we can then m=
ake.names()) but that the tree input is currently via a text string. Again,=
probably easy enough to have the raw PAUP names in the string but these wo=
uld be horrible to extract with regex. Is there any way that NCLInterface c=
an pass the tree using numeric symbols and then pass a translate block as a=
vector? Then make.names() could be run easily on both the data names and t=
he tree names...<br></blockquote><blockquote type=3D"cite"><br></blockquote=
><blockquote type=3D"cite">Cheers,<br></blockquote><blockquote type=3D"cite=
">David<br></blockquote><blockquote type=3D"cite"><br></blockquote><blockqu=
ote type=3D"cite"><br></blockquote><blockquote type=3D"cite"><br></blockquo=
te><blockquote type=3D"cite">On 23 Apr 2010, at 14:30, Fran=E7ois Michonnea=
u wrote:<br></blockquote><blockquote type=3D"cite"><br></blockquote><blockq=
uote type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote>=
<blockquote type=3D"cite"><blockquote type=3D"cite">Hi,<br></blockquote></b=
lockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockqu=
ote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"> Ouch!=
We need to fix this.<br></blockquote></blockquote><blockquote type=3D"cite=
"><blockquote type=3D"cite"><br></blockquote></blockquote><blockquote type=
=3D"cite"><blockquote type=3D"cite"> There might be some hope if we use Rcp=
p to build the data frame<br></blockquote></blockquote><blockquote type=3D"=
cite"><blockquote type=3D"cite">instead of building and parsing a string.<b=
r></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"c=
ite"><br></blockquote></blockquote><blockquote type=3D"cite"><blockquote ty=
pe=3D"cite"> Let me talk to Dirk about it and see what we can do.<br></bloc=
kquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><br=
></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"ci=
te"> Cheers,<br></blockquote></blockquote><blockquote type=3D"cite"><blockq=
uote type=3D"cite"> -- Fran=E7ois<br></blockquote></blockquote><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote><bloc=
kquote type=3D"cite"><blockquote type=3D"cite">On Fri, 2010-04-23 at 13:59 =
+0100, Orme, David wrote:<br></blockquote></blockquote><blockquote type=3D"=
cite"><blockquote type=3D"cite"><blockquote type=3D"cite">Hi all,<br></bloc=
kquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blockqu=
ote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D=
"cite">From an e-mail on 03/03/10:<br></blockquote></blockquote></blockquot=
e><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"c=
ite"><br></blockquote></blockquote></blockquote><blockquote type=3D"cite"><=
blockquote type=3D"cite"><blockquote type=3D"cite">Mark then Peter<br></blo=
ckquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote typ=
e=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">The main pote=
ntial problems that I see with the ways that phylobase is using NCL now are=
:<br></blockquote></blockquote></blockquote></blockquote></blockquote><bloc=
kquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><b=
lockquote type=3D"cite"><blockquote type=3D"cite"><span class=3D"Apple-tab-=
span" style=3D"white-space:pre"> </span>1. in NCLInterface.cpp there are lo=
ts of call to RemoveUnderscoresAndSpaces to get rid of spaces and _ in name=
s. That makes names easier to deal with, but at some point will bite =
you (somebody will have dataset with a taxon labelled "AB" and another with=
"A B", after transformation there will be a name clash).<br></blockquote><=
/blockquote></blockquote></blockquote></blockquote><blockquote type=3D"cite=
"><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"c=
ite"><br></blockquote></blockquote></blockquote></blockquote><blockquote ty=
pe=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote=
type=3D"cite">I agree that this is something to address. Not only mi=
ght there be clashes but changing names, will be annoying to users. B=
rian or Derrick could answer better, but I assume this is because some of t=
he code used to parse the tree string can't handle the underscores and spac=
es.<br></blockquote></blockquote></blockquote></blockquote><blockquote type=
=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockqu=
ote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D=
"cite"><blockquote type=3D"cite">Has just bitten me! There is a deeper prob=
lem here in that readNexus uses the NCLInterface code to get the data frame=
as parsable R code - with stripped spaces and underscores - but the tree b=
lock is passed over as a block of raw text from the file. These names _aren=
't_ then stripped of underscores and spaces by read.nexustreestring() and s=
o the name checking throws an error. <br></blockquote></blockquote></blockq=
uote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=
=3D"cite"><br></blockquote></blockquote></blockquote><blockquote type=3D"ci=
te"><blockquote type=3D"cite"><blockquote type=3D"cite">Obviously there is =
an ongoing deeper discussion about how to handle passing the tree from NCL =
and how to handle the dismayingly wide range of official valid PAUP identif=
iers using regex but currently we've got a simpler problem of different han=
dling. Underscores in names are very commonly used to avoid the quoting pro=
blem with spaces so I think this current problem will come up a lot. <br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blo=
ckquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote typ=
e=3D"cite">Cheers,<br></blockquote></blockquote></blockquote><blockquote ty=
pe=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite">David<br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote></blo=
ckquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquote typ=
e=3D"cite"><br></blockquote></blockquote></blockquote><blockquote type=3D"c=
ite"><blockquote type=3D"cite"><blockquote type=3D"cite"><br></blockquote><=
/blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite=
"><blockquote type=3D"cite"><br></blockquote></blockquote></blockquote><blo=
ckquote type=3D"cite"><blockquote type=3D"cite"><blockquote type=3D"cite"><=
br></blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockq=
uote type=3D"cite"><blockquote type=3D"cite"><br></blockquote></blockquote>=
</blockquote><blockquote type=3D"cite"><blockquote type=3D"cite"><blockquot=
e type=3D"cite">_______________________________________________<br></blockq=
uote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite">Phylobase-devl mailing list<br></blockq=
uote></blockquote></blockquote><blockquote type=3D"cite"><blockquote type=
=3D"cite"><blockquote type=3D"cite"><a href=3D"mailto:Phylobase-devl at lists.=
r-forge.r-project.org">Phylobase-devl at lists.r-forge.r-project.org</a><br></=
blockquote></blockquote></blockquote><blockquote type=3D"cite"><blockquote =
type=3D"cite"><blockquote type=3D"cite"><a href=3D"https://lists.r-forge.r-=
project.org/cgi-bin/mailman/listinfo/phylobase-devl">https://lists.r-forge.=
r-project.org/cgi-bin/mailman/listinfo/phylobase-devl</a><br></blockquote><=
/blockquote></blockquote><blockquote type=3D"cite"><blockquote type=3D"cite=
"><br></blockquote></blockquote><blockquote type=3D"cite"><br></blockquote>=
<blockquote type=3D"cite">_______________________________________________<b=
r></blockquote><blockquote type=3D"cite">Phylobase-devl mailing list<br></b=
lockquote><blockquote type=3D"cite"><a href=3D"mailto:Phylobase-devl at lists.=
r-forge.r-project.org">Phylobase-devl at lists.r-forge.r-project.org</a><br></=
blockquote><blockquote type=3D"cite"><a href=3D"https://lists.r-forge.r-pro=
ject.org/cgi-bin/mailman/listinfo/phylobase-devl">https://lists.r-forge.r-p=
roject.org/cgi-bin/mailman/listinfo/phylobase-devl</a><br></blockquote></di=
v></blockquote></div><br></div></div><span><ATT00002..txt></span></bl=
ockquote></div><br></div></body></html>=
--_000_4E8511DB568D4382BF5D7B7A03A7847Cimperialacuk_--
More information about the Phylobase-devl
mailing list