[datatable-help] Fwd: Comments on data.table

Tim Hesterberg rocket at google.com
Tue May 22 19:38:05 CEST 2012


Resending, now that I'm on this list.

---------- Forwarded message ----------
From: Tim Hesterberg <rocket at google.com>
Date: Tue, May 22, 2012 at 9:58 AM
Subject: Comments on data.table
To: Matthew Dowle <mdowle at mdowle.plus.com>
Cc: datatable-help at lists.r-forge.r-project.org, Chris Neff <
caneff at google.com>


Hi Matthew,

Here are some comments I had about data.table.
In the next message I'll forward Chris Neff's response.

Tim Hesterberg

---------- Forwarded message ----------
From: Tim Hesterberg <rocket at google.com>
Date: Sun, Mar 25, 2012 at 10:13 PM
Subject: Re: [R-users] faster aggregate
To: Chris Neff <caneff at google.com>


Hi Chris,

Old thread, just responding to you.  I finally started looking seriously at
data.table, in response
to your posts.  I'm thinking about supporting data.table in the aggregate
package, and about
incorporating one of the nice features you've mentioned into aggregate,
namely making it easier to get results for some columns of an existing
data.frame (or data.table) without copying.

My preliminary impression is a combination of
(a) Cool!
(b) Nicely implemented; I did benchmarks of memory allocations for regular
data.frame code, my dataframe package, data.table, and the combination of
dataframe and data.table.  dataframe
is dramatically better than regular R, data.table is substantially better
yet, and the combination of both is slightly better yet.
(c) Sheer horror and frustration.  Horror at one dangerous design decision.
 Frustration that some
relatively small changes in the package would make the learning curve much
shallower, so this
package could be used more widely, and make its use safer.

Take this with a grain of salt - I haven't used the package enough yet,
maybe I would change my mind about these points.  But I'll share them with
you now.  I'll give this some time to settle,
and try the package more, before sharing these with the author.

(1) Using the second argument to [.data.table for calculating expressions
instead of subscripting.
The inconsistency between [.data.table and [.data.frame increases the
learning curve dramatically, and makes for bugs.

The first argument is also unusual, but in a way that I think makes
more sense.

I suggest using a different function for evaluating expressions,
in particular,
   with.data.table(x, expr, additional arguments)
Then syntax would be:
  Current    Using with.data.table
  DT[, expr]                with(DT, expr)
  DT[K, expr]               with(DT[K,], expr) or with(DT, expr, subset=K)
  DT[, expr, by=foo]        with(DT, expr, by=foo)
  DT[, list(expr1,expr2)]   with(DT, J(expr1, expr2))
  ?not possible now?        with(DT, list(mean(x), quantile(x))
Note that I would not use list(expr1, expr2), but rather explicitly
use data.table(expr1, expr2) or J(expr1, expr2), when someone wants
a data.table returned.  This makes it easier to look at the code
and see what is to be returned.

The inconsistency with normal usage of [ in R also raises questions for
[<-.data.table.  Is this consistent with [.data.table or [<-.data.frame?
(I haven't explored this yet.  [<-.data.table is not documented.)


(2) Having setkey modify the object in place.  This means that one cannot
look for <- (or =) to determine when an object is modified.
It would be safe to instead do
  key(x) <- character vector of key names

As implemented, if you pass a dt to a function and modify the key
there, the modification also affects the original object.
And, you end up with two copies of the object with different names,
and modifying one changes the other.

# Test if setkey called within a function causes problems.
x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")
foo <- function(y){
  setkey(y, b)
  y
}
z <- foo(x)
tables()
# now x also has key b, not a.
setkey(z, "a")
tables()
# z and x both have key a

# Even copying the object without calling a function makes two pointers
x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")
y <- x
setkey(y, b)
tables()
# both x and y have key "b"

(3) Expecting unquoted names where people would normally expect to give
quoted names, like setkey.

(4) Not allowing character data to remain character.

(I deleted earlier messages on the thread.  Some of that is relevant, but
some of it may be confidential.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120522/1b1481c5/attachment.html>


More information about the datatable-help mailing list