[datatable-help] can I count on data.table supporting syntactically invalid column names?

Kaupas, George George.Kaupas at spansion.com
Thu Aug 9 21:45:31 CEST 2012


Good catch; I did indeed snag 1.8.2 from R-Forge because I needed something in that version but didn't see it on CRAN at the time; never occurred to me the version would change.

I uninstalled and installed the CRAN version. I get 717 from the tests. However the merge behavior is the same; in one direction it succeeds but changes the column names; in the other direction it fails in setcolorder.

So I should open bug reports, then, eh?

> test.data.table()
Running .../tests.Rraw
x =  10,000 sample from 100 strings (quick test to save load on CRAN servers where tests run every day. In dev we increase n and m a lot for meaningful times.
0.001 : f=factor(x) [high up front cost, plus storage and maintenance of levels]
0.000 : sort.list(,'radix') on f
0.000 : u=unique(x)
0.000 : .Internal(order(u))
0.001 : sort.list(,'radix') on fsorted
-vs-
0.000 : char group on x (ad hoc by)  [slower than radix on f but without up front cost]
0.000 : char sort on x (setkey)  [lower up front cost than factor(x)]
0.000 : char group on xsorted (keyed by)  [faster than sort.list(,'radix') on fsorted, same result]
All 717 tests in test.data.table() completed ok in 15.697sec

> DT1 = data.table(a=letters[1:5], "Illegal(name%)"=1:5, key="a")
> DT2 = data.table(a=letters[1:5], b=6L, key="a")
> merge(DT1,DT2)
   a Illegal.name.. b
1: a              1 6
2: b              2 6
3: c              3 6
4: d              4 6
5: e              5 6
> merge(DT2,DT1)
Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
  neworder is length 4 but x has 3 columns.

-----Original Message-----
From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] 
Sent: Wednesday, August 08, 2012 6:17 PM
To: Kaupas, George
Cc: datatable-help at lists.r-forge.r-project.org
Subject: RE: [datatable-help] can I count on data.table supporting syntactically invalid column names?


Then somehow you don't have the CRAN version of v1.8.2 installed. By any chance did you install 1.8.2 from R-Forge in the few days a slightly earlier version of 1.8.2 existed on R-Forge?  R-Forge also happened to be stale in that time window. The first submission of 1.8.2 to CRAN was reverted due to some difficulties, so it needed a 2nd attempt and took longer than usual.

Please uninstall data.table and reinstall from any CRAN mirror (not
R-Forge) to make sure. A difference between 714 and 717 indicates an installation problem of data.table, not R itself. test.data.table() v1.8.2 must return 717 precisely.

Another way would be to include the SVN rev number in the package version.
But I haven't found a way to do that for packages yet. R itself does that of course, but I don't know how for packages. Since all changes in data.table are accompanied by new tests, the current approach is using the number of tests. And actually running all the tests on your hardware etc is a stronger test everything is working as intended.


> The test.data.table() routine returns 714, not 717.
>
> I'm running data.table 1.8.2.
>
> The only thing not bleeding edge (I think) is R itself which is at 2.15.0.
>
> A search for "merge" on r-forge gets two hits, neither are related; a 
> search for setcolorder gets no hits. Should I file a bug report (or two)?
>
> Here's my output from test.data.table() and sessionInfo():
>
>> test.data.table()
> Running .../tests.Rraw
> Loading required package: hexbin
> Loading required package: grid
> Loading required package: lattice
> x =  10,000 sample from 100 strings (quick test to save load on CRAN 
> servers where tests run every day. In dev we increase n and m a lot 
> for meaningful times.
> 0.002 : f=factor(x) [high up front cost, plus storage and maintenance 
> of levels]
> 0.000 : sort.list(,'radix') on f
> 0.000 : u=unique(x)
> 0.000 : .Internal(order(u))
> 0.000 : sort.list(,'radix') on fsorted
> -vs-
> 0.000 : char group on x (ad hoc by)  [slower than radix on f but 
> without up front cost]
> 0.000 : char sort on x (setkey)  [lower up front cost than factor(x)]
> 0.000 : char group on xsorted (keyed by)  [faster than 
> sort.list(,'radix') on fsorted, same result] All 714 tests in 
> test.data.table() completed ok in 15.272sec
>
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] hexbin_1.26.0    lattice_0.20-6   nlme_3.1-103     ggplot2_0.9.1
> [5] reshape_0.8.4    plyr_1.7.1       data.table_1.8.2
>
> loaded via a namespace (and not attached):
>  [1] colorspace_1.1-1   dichromat_1.2-4    digest_0.5.2       labeling_0.1
>  [5] MASS_7.3-17        memoise_0.1        munsell_0.3
> proto_0.3-9.2
>  [9] RColorBrewer_1.0-5 reshape2_1.2.1     scales_0.2.1
> stringr_0.6.1
>
> -----Original Message-----
> From: Matthew Dowle [mailto:mdowle at mdowle.plus.com]
> Sent: Wednesday, August 08, 2012 4:49 AM
> To: Kaupas, George
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] can I count on data.table supporting 
> syntactically invalid column names?
>
>
> Meant to write 2nd paragraph as follows :
>
>>
>> Hi. Yes you should be able to rely on that. It's useful to have 
>> special characters in column names for latex formatting, and spaces 
>> are allowed too. There are tests for these things. If you need to 
>> refer to such column names as variables, then it's up to you to wrap 
>> with ``; e.g., by=`Illegal(name%)`+1.
>>
>> So yes, if you find problems with special characters, please report 
>> as bugs, and suggest where the documentation needs improving would be 
>> great.
>>
>> I seem to remember a bug fix in this regard, and in particular in 
>> merge (so my first thought is to ask you if you've recently upgraded 
>> to 1.8.2 and if test.data.table returns 717), but as you say R-Forge 
>> is currently down for maintenance...
>>
>> That neworder error looks familiar too. Are you sure you have 1.8.2 
>> running in memory? (Run test.data.table() to see if it returns 717).
>>
>> Matthew
>>
>>> I'm taking advantage of a feature in data.table which lets me get 
>>> away with naming columns with characters that would not survive a 
>>> call to make.names(), e.g.:
>>>
>>>> DT1 = data.table(a=letters[1:5], "Illegal(name%)"=1:5, key="a")
>>>> DT1
>>>    a Illegal(name%)
>>> 1: a              1
>>> 2: b              2
>>> 3: c              3
>>> 4: d              4
>>> 5: e              5
>>>
>>> (The the dcast function from the reshape2 package will also create 
>>> columns named "illegally".)
>>>
>>> But when using merge.data.table, I get two side-effects; either the 
>>> merge works, but the column names appear to be run through 
>>> make.names(), or the merge fails in setcolorder():
>>>
>>>> DT1 = data.table(a=letters[1:5], "Illegal(name%)"=1:5, key="a")
>>>> DT2 = data.table(a=letters[1:5], b=6L, key="a")
>>>
>>>> merge(DT1,DT2)
>>>    a Illegal.name.. b
>>> 1: a              1 6
>>> 2: b              2 6
>>> 3: c              3 6
>>> 4: d              4 6
>>> 5: e              5 6
>>>
>>>> merge(DT2,DT1)
>>> Error in setcolorder(dt, c(setdiff(names(dt), end), end)) :
>>>   neworder is length 4 but x has 3 columns.
>>>
>>> I can't get to datatable.r-forge.r-project.org - getting a 504.
>>>
>>> So... should I NOT rely on being able to use special characters in 
>>> column names?
>>>
>>> Thanks
>>> George
>>>
>>>> sessionInfo()
>>> R version 2.15.0 (2012-03-30)
>>> Platform: x86_64-unknown-linux-gnu (64-bit) [1] data.table_1.8.2
>
>




More information about the datatable-help mailing list