[datatable-help] Fwd: Comments on data.table

Tue May 22 19:38:37 CEST 2012

Resending second message, now that I'm on this list.

---------- Forwarded message ----------
From: Tim Hesterberg <rocket at google.com>
Date: Tue, May 22, 2012 at 10:00 AM
Subject: Re: Comments on data.table
To: Matthew Dowle <mdowle at mdowle.plus.com>
Cc: datatable-help at lists.r-forge.r-project.org, Chris Neff <
caneff at google.com>

Here is Chris Neff's response to my previous message.  (He says nice things
about you, at the end.)

---------- Forwarded message ----------
From: Chris Neff <caneff at google.com>
Date: Mon, Mar 26, 2012 at 4:10 AM
Subject: Re: [R-users] faster aggregate
To: Tim Hesterberg <rocket at google.com>

Comments inline

(c) Sheer horror and frustration.  Horror at one dangerous design decision.
>  Frustration that some
> relatively small changes in the package would make the learning curve much
> shallower, so this
> package could be used more widely, and make its use safer.
>

I'll be up front and honest here, I am still sometimes surprised at where R
makes copies, but from what I understand so much of this implementation was
to avoid unnecessary copies of things.  Maybe some of what you describe
already avoids it enough, I just don't have the R pass by value model 100%
solid in my head.  If you have an solid reference page as to really
understanding this model, I would appreciate it.

> Take this with a grain of salt - I haven't used the package enough yet,
> maybe I would change my mind about these points.  But I'll share them with
> you now.  I'll give this some time to settle,
> and try the package more, before sharing these with the author.
>
> (1) Using the second argument to [.data.table for calculating expressions
> instead of subscripting.
> The inconsistency between [.data.table and [.data.frame increases the
> learning curve dramatically, and makes for bugs.
>
> The first argument is also unusual, but in a way that I think makes
> more sense.
>
> I suggest using a different function for evaluating expressions,
> in particular,
>    with.data.table(x, expr, additional arguments)
> Then syntax would be:
>   Current    Using with.data.table
>   DT[, expr]                with(DT, expr)
>   DT[K, expr]               with(DT[K,], expr) or with(DT, expr, subset=K)
>   DT[, expr, by=foo]        with(DT, expr, by=foo)
>   DT[, list(expr1,expr2)]   with(DT, J(expr1, expr2))
>   ?not possible now?        with(DT, list(mean(x), quantile(x))
>
Firstly, since data.tables (and data.frames) are lists, as.list(DT[,
list(mean(x), quantile(x)]) would do what you want.   I think it is an
explicit goal that DT[...] always returns another data.table.

> Note that I would not use list(expr1, expr2), but rather explicitly
> use data.table(expr1, expr2) or J(expr1, expr2), when someone wants
> a data.table returned.  This makes it easier to look at the code
> and see what is to be returned.
>

A data.table operation always returns a data.table.  Besides the simply
case of a single column being collapsed to a vector, I don't think there
are exceptions to that rule.

I can't speak to if there is clear performance issues why with() would be
disfavored (there may be I just can't think of them), but as a daily user
of data.table, I would be really really sad to see it go.  After the
initial learning curve it feels so instinctual and natural.  Also chaining
isn't something that looks good with with(). Compare the following:

DT[, list(y=sum(y), y.mean=mean(y)), by=key1][ y.mean > 10, list(key1, y) ]

vs.

with(with(DT, data.table(y=sum(y), y.mean=mean(y)), by=key1),
data.table(key1, y), subset=y.mean > 10)

Admittedly this is a contrived example, but I do similar things in many
places, and the chaining in [.data.table is just more direct to me.  I'm
sure the author has many other pointers here.

> The inconsistency with normal usage of [ in R also raises questions for
> [<-.data.table.  Is this consistent with [.data.table or [<-.data.frame?
> (I haven't explored this yet.  [<-.data.table is not documented.)
>

No clue here, just my own experience that every non data.table aware
package I have ever used as never had an issue.

>
> (2) Having setkey modify the object in place.  This means that one cannot
> look for <- (or =) to determine when an object is modified.
> It would be safe to instead do
>   key(x) <- character vector of key names
>

The big issue here is that key(x) <- c("a","b","c")  makes a copy of x
(according to the help page for setkeyv). For small use cases that is fine,
and key<- is supported, just with warnings about memory allocation.  But
for some of the really large data sets I work with, having to make a copy
of a 22 million frame data.table just to change the column names is pretty
ridiculous and inefficient.

>
> As implemented, if you pass a dt to a function and modify the key
> there, the modification also affects the original object.
> And, you end up with two copies of the object with different names,
> and modifying one changes the other.
>
> # Test if setkey called within a function causes problems.
> x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")
> foo <- function(y){
>   setkey(y, b)
>   y
> }
> z <- foo(x)
> tables()
> # now x also has key b, not a.
> setkey(z, "a")
> tables()
> # z and x both have key a
>
> # Even copying the object without calling a function makes two pointers
> x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")
> y <- x
> setkey(y, b)
> tables()
> # both x and y have key "b"
>

This is a big contention point, but once again data.table is trying to be
focused towards large scale data processing, and the preference is that if
you really mean to make a copy, be explicit about it.  So:

y <- x

becomes

y <- copy(x)

I know the breaks the normal R model, but this is once again a case where
data.table has always behaved how I wanted it to and I've never been
bitten.  This sort of POV exposes a lot of latent inefficiencies people
make in their base R code without even realizing they are making them.

(3) Expecting unquoted names where people would normally expect to give
> quoted names, like setkey.
>

This is updated in the latest versions, setkeyv behaves as you would expect
(and maybe setkey will be changed in the future but this is for
compatibility I would think).  There is also setattr, setnames,
setcolorder, and set, all of which get around the extra copying of <-.

> (4) Not allowing character data to remain character.
>
>
Fixed in the current devel version (1.8.0).  Characters are now fully
supported, and for the case of many unique levels (like customer ID would
be) it is faster than factors were.  Also fixed is the annoying issue I had
of ordered factors losing their ordering. I've been waiting for this to go
to CRAN before updating third_party, but the third_party version is far
enough behind that maybe I should just do it anyway....

All that being said,  I understand a lot of your points, and I will admit
there is a bit of a learning curve.  But data.table has changed my day to
day R experience more than any other package out there.

The package creator is receptive and responsive to criticism, so if I still
leave many unanswered questions feel free to ping the mailing list.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120522/598831a0/attachment-0001.html>