Resending, now that I'm on this list.<br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Tim Hesterberg</b> <span dir="ltr"><<a href="mailto:rocket@google.com">rocket@google.com</a>></span><br>

Date: Tue, May 22, 2012 at 9:58 AM<br>Subject: Comments on data.table<br>To: Matthew Dowle <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>><br>Cc: <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>, Chris Neff <<a href="mailto:caneff@google.com">caneff@google.com</a>><br>

<br><br>Hi Matthew,<div><br></div><div>Here are some comments I had about data.table.</div><div>In the next message I'll forward Chris Neff's response.</div><div><br></div><div>Tim Hesterberg<br><br><div class="gmail_quote">


---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Tim Hesterberg</b> <span dir="ltr"><<a href="mailto:rocket@google.com" target="_blank">rocket@google.com</a>></span><br>Date: Sun, Mar 25, 2012 at 10:13 PM<br>


Subject: Re: [R-users] faster aggregate<br>To: Chris Neff <<a href="mailto:caneff@google.com" target="_blank">caneff@google.com</a>><br><br><br>Hi Chris,<div><br></div><div>Old thread, just responding to you.  I finally started looking seriously at data.table, in response</div>


<div>to your posts.  I'm thinking about supporting data.table in the aggregate package, and about</div>

<div>incorporating one of the nice features you've mentioned into aggregate, namely making it easier to get results for some columns of an existing data.frame (or data.table) without copying.</div><div><br></div><div>


My preliminary impression is a combination of</div><div>(a) Cool!</div><div>(b) Nicely implemented; I did benchmarks of memory allocations for regular data.frame code, my dataframe package, data.table, and the combination of dataframe and data.table.  dataframe</div>


<div>is dramatically better than regular R, data.table is substantially better yet, and the combination of both is slightly better yet.</div><div>(c) Sheer horror and frustration.  Horror at one dangerous design decision.  Frustration that some</div>


<div>relatively small changes in the package would make the learning curve much shallower, so this</div><div>package could be used more widely, and make its use safer.</div><div><br></div><div>Take this with a grain of salt - I haven't used the package enough yet, maybe I would change my mind about these points.  But I'll share them with you now.  I'll give this some time to settle,</div>


<div>and try the package more, before sharing these with the author.</div><div><br></div><div><div>(1) Using the second argument to [.data.table for calculating expressions</div><div>instead of subscripting.</div><div>The inconsistency between [.data.table and [.data.frame increases the</div>


<div>learning curve dramatically, and makes for bugs.</div><div><br></div><div>The first argument is also unusual, but in a way that I think makes</div><div>more sense.</div><div><br></div><div>I suggest using a different function for evaluating expressions,</div>


<div>in particular,</div><div>   with.data.table(x, expr, additional arguments)</div><div>Then syntax would be:</div><div><font face="'courier new', monospace">  Current<span style="white-space:pre-wrap">          </span>    Using with.data.table</font></div>


<div><font face="'courier new', monospace">  DT[, expr]                with(DT, expr)</font></div><div><font face="'courier new', monospace">  DT[K, expr]               with(DT[K,], expr) or with(DT, expr, subset=K)</font></div>


<div><font face="'courier new', monospace">  DT[, expr, by=foo]        with(DT, expr, by=foo)</font></div><div><font face="'courier new', monospace">  DT[, list(expr1,expr2)]   with(DT, J(expr1, expr2))</font></div>


<div><font face="'courier new', monospace">  ?not possible now?        with(DT, list(mean(x), quantile(x))</font></div><div>Note that I would not use list(expr1, expr2), but rather explicitly</div><div>use data.table(expr1, expr2) or J(expr1, expr2), when someone wants</div>


<div>a data.table returned.  This makes it easier to look at the code</div><div>and see what is to be returned.</div><div><br></div><div>The inconsistency with normal usage of [ in R also raises questions for</div><div>[<-.data.table.  Is this consistent with [.data.table or [<-.data.frame?</div>


<div>(I haven't explored this yet.  [<-.data.table is not documented.)</div><div><br></div></div><div><br></div><div>(2) Having setkey modify the object in place.  This means that one cannot</div><div>look for <- (or =) to determine when an object is modified.</div>


<div>It would be safe to instead do</div><div>  key(x) <- character vector of key names</div><div><br></div><div>As implemented, if you pass a dt to a function and modify the key</div><div>there, the modification also affects the original object.</div>


<div>And, you end up with two copies of the object with different names,</div><div>and modifying one changes the other.</div><div><br></div><div># Test if setkey called within a function causes problems.</div><div>x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")</div>


<div>foo <- function(y){</div><div>  setkey(y, b)</div><div>  y</div><div>}</div><div>z <- foo(x)</div><div>tables()</div><div># now x also has key b, not a.</div><div>setkey(z, "a")</div><div>tables()</div>


<div># z and x both have key a</div><div><br></div><div># Even copying the object without calling a function makes two pointers</div><div>x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")</div><div>y <- x</div>


<div>setkey(y, b)</div><div>tables()</div><div># both x and y have key "b"</div><div><br></div><div>(3) Expecting unquoted names where people would normally expect to give</div><div>quoted names, like setkey.</div>


<div><br></div><div>(4) Not allowing character data to remain character.</div><div><div><div><br></div><div>(I deleted earlier messages on the thread.  Some of that is relevant, but some of it may be confidential.)</div>


</div></div></div><br></div>

</div><br>