[datatable-help] by=".Col" produces NA column names
Ricardo Saporta
saporta at scarletmail.rutgers.edu
Sun Sep 22 03:44:29 CEST 2013
I submitted the below as bug 4927
I believe the fix is a simple regex modification, but I dont want to mess
with the regex too hastilly and possibly break something. Would someone
care to double check this?
---------------
Issue:
----
Given a data.table with a dot in the column name, using that column name as
an argument to `by=` produces different results when the column name is
quoted than when it is not.
eg:
DT
.Col val
1: A 1
2: B 2
identical(DT[, sum(val), by=.Col],
DT[, sum(val), by=".Col"] )
# [1] FALSE
Specifically, if quotes are used NAs are produced in place of the column
name.
Examples follow at the bottom of this email. I believe the issue is in the
regex pattern in a call to `grep` in "[.data.table"
The line is copied and pasted here.
(currently line 743 in "data.table.r", which is inside "if
(any(bynames=="")){..}")
## ORIGINAL
tt = grep("^eval|^[^[:alpha:]
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
## SHOULD (I believe) BE CHANGED TO
tt = grep("^eval|^[^(\\.|[:alpha:])
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
## ... to allow for the name to start with a period.
## CONTEXT:
if (any(bynames=="")) {
if (length(bysubl)<2) stop("When 'by' or 'keyby' is list()
we expect something inside the brackets")
for (jj in seq_along(bynames)) {
if (bynames[jj]=="") {
# Best guess. Use "month" in the case of
by=month(date), use "a" in the case of by=a%%2
~~~~ THIS LINE ~~~> tt = grep("^eval|^[^[:alpha:]
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
if (!length(tt)) tt = all.vars(bysubl[[jj+1L]])[1L]
bynames[jj] = tt
# if user doesn't like this inferred name, user has
to use by=list() to name the column
}
}
}
---------------------------------------------------
EXAMPLE:
DT <- data.table(.Col = LETTERS[c(1:3, 1:3)], val=1:6)
identical(DT[, sum(val), by=.Col],
DT[, sum(val), by=".Col"] )
# [1] FALSE
## This works as expected
DT[, sum(val), by=.Col]
.Col V1
1: A 5
2: B 7
3: C 9
## Putting the column name within quotes
## produces NA in the column names
DT[, sum(val), by=c(".Col")]
DT[, sum(val), by=".Col"] # both lines, same output
NA V1 <~~~ NOTICE
1: A 5
2: B 7
3: C 9
# notice if we try to use `keyby` we get the following error
DT[, sum(val), keyby=".Col"]
# Error in setkeyv(ans, names(ans)[seq_along(byval)]) :
# Column 'NA' is type 'NULL' which is not (currently) allowed as a key
column type.
## and this works correctly too
DT[, sum(val), by=list(.Col=.Col)]
.Col V1
1: A 5
2: B 7
3: C 9
---------------------------------------------------
Only happen with a dot at the start of the name
## Appears to be only an issue when there is a
DT2 <- data.table(Col. = LETTERS[c(1:3, 1:3)], val=1:6)
DT2[, sum(val), by=Col.]
DT2[, sum(val), by=c("Col.")]
Col. V1 <~~~ As expected
1: A 5
2: B 7
3: C 9
--
Ricardo Saporta
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130921/ec10abc7/attachment.html>
More information about the datatable-help
mailing list