[datatable-help] by=".Col" produces NA column names

Sun Sep 22 03:44:29 CEST 2013

 I submitted the below as bug 4927

I believe the fix is a simple regex modification, but I dont want to mess
with the regex too hastilly and possibly break something.  Would someone
care to double check this?

---------------

Issue:
----
Given a data.table with a dot in the column name, using that column name as
an argument to `by=` produces different results when the column name is
quoted than when it is not.

eg:

  DT
     .Col val
  1:    A   1
  2:    B   2

  identical(DT[, sum(val), by=.Col],
            DT[, sum(val), by=".Col"] )
  # [1] FALSE

Specifically, if quotes are used NAs are produced in place of the column
name.

Examples follow at the bottom of this email.  I believe the issue is in the
regex pattern in a call to `grep` in "[.data.table"

The line is copied and pasted here.
(currently line 743 in "data.table.r", which is inside "if
(any(bynames=="")){..}")

## ORIGINAL
tt = grep("^eval|^[^[:alpha:]
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]

## SHOULD (I believe) BE CHANGED TO
tt = grep("^eval|^[^(\\.|[:alpha:])
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
## ... to allow for the name to start with a period.

## CONTEXT:
            if (any(bynames=="")) {
                if (length(bysubl)<2) stop("When 'by' or 'keyby' is list()
we expect something inside the brackets")
                for (jj in seq_along(bynames)) {
                    if (bynames[jj]=="") {
                        # Best guess. Use "month" in the case of
by=month(date), use "a" in the case of by=a%%2
~~~~ THIS LINE ~~~>     tt = grep("^eval|^[^[:alpha:]
]",all.vars(bysubl[[jj+1L]],functions=TRUE),invert=TRUE,value=TRUE)[1L]
                        if (!length(tt)) tt = all.vars(bysubl[[jj+1L]])[1L]
                        bynames[jj] = tt
                        # if user doesn't like this inferred name, user has
to use by=list() to name the column
                    }
                }
            }

---------------------------------------------------

EXAMPLE:

DT <- data.table(.Col = LETTERS[c(1:3, 1:3)], val=1:6)

identical(DT[, sum(val), by=.Col],
          DT[, sum(val), by=".Col"] )
# [1] FALSE

## This works as expected
DT[, sum(val), by=.Col]

   .Col V1
1:    A  5
2:    B  7
3:    C  9

## Putting the column name within quotes
##   produces NA in the column names
DT[, sum(val), by=c(".Col")]
DT[, sum(val), by=".Col"]  # both lines, same output

   NA V1  <~~~  NOTICE
1:  A  5
2:  B  7
3:  C  9

# notice if we try to use `keyby` we get the following error
DT[, sum(val), keyby=".Col"]
# Error in setkeyv(ans, names(ans)[seq_along(byval)]) :
#   Column 'NA' is type 'NULL' which is not (currently) allowed as a key
column type.

## and this works correctly too
DT[, sum(val), by=list(.Col=.Col)]
   .Col V1
1:    A  5
2:    B  7
3:    C  9

---------------------------------------------------

Only happen with a dot at the start of the name

## Appears to be only an issue when there is a
DT2 <- data.table(Col. = LETTERS[c(1:3, 1:3)], val=1:6)

DT2[, sum(val), by=Col.]
DT2[, sum(val), by=c("Col.")]

   Col. V1    <~~~ As expected
1:    A  5
2:    B  7
3:    C  9

--
Ricardo Saporta
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130921/ec10abc7/attachment.html>