Resending second message, now that I'm on this list.<br><br><div class="gmail_quote">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Tim Hesterberg</b> <span dir="ltr"><<a href="mailto:rocket@google.com">rocket@google.com</a>></span><br>

Date: Tue, May 22, 2012 at 10:00 AM<br>Subject: Re: Comments on data.table<br>To: Matthew Dowle <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>><br>Cc: <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a>, Chris Neff <<a href="mailto:caneff@google.com">caneff@google.com</a>><br>

<br><br>Here is Chris Neff's response to my previous message.  (He says nice things about you, at the end.)<br><br><div class="gmail_quote"><div class="im">---------- Forwarded message ----------<br>From: <b class="gmail_sendername">Chris Neff</b> <span dir="ltr"><<a href="mailto:caneff@google.com" target="_blank">caneff@google.com</a>></span><br>


Date: Mon, Mar 26, 2012 at 4:10 AM<br>Subject: Re: [R-users] faster aggregate<br></div>To: Tim Hesterberg <<a href="mailto:rocket@google.com" target="_blank">rocket@google.com</a>><br><br><br>Comments inline<br><br>

<div class="gmail_quote"><div class="im">

<div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>(c) Sheer horror and frustration.  Horror at one dangerous design decision.  Frustration that some</div>


<div>relatively small changes in the package would make the learning curve much shallower, so this</div><div>package could be used more widely, and make its use safer.</div></blockquote><div><br></div></div></div><div>I'll be up front and honest here, I am still sometimes surprised at where R makes copies, but from what I understand so much of this implementation was to avoid unnecessary copies of things.  Maybe some of what you describe already avoids it enough, I just don't have the R pass by value model 100% solid in my head.  If you have an solid reference page as to really understanding this model, I would appreciate it.</div>

<div class="im">

<div>


<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>Take this with a grain of salt - I haven't used the package enough yet, maybe I would change my mind about these points.  But I'll share them with you now.  I'll give this some time to settle,</div>


<div>and try the package more, before sharing these with the author.</div><div><br></div><div><div>(1) Using the second argument to [.data.table for calculating expressions</div><div>instead of subscripting.</div><div>The inconsistency between [.data.table and [.data.frame increases the</div>


<div>learning curve dramatically, and makes for bugs.</div><div><br></div><div>The first argument is also unusual, but in a way that I think makes</div><div>more sense.</div><div><br></div><div>I suggest using a different function for evaluating expressions,</div>


<div>in particular,</div><div>   with.data.table(x, expr, additional arguments)</div><div>Then syntax would be:</div><div><font face="'courier new', monospace">  Current<span style="white-space:pre-wrap">          </span>    Using with.data.table</font></div>


<div><font face="'courier new', monospace">  DT[, expr]                with(DT, expr)</font></div><div><font face="'courier new', monospace">  DT[K, expr]               with(DT[K,], expr) or with(DT, expr, subset=K)</font></div>


<div><font face="'courier new', monospace">  DT[, expr, by=foo]        with(DT, expr, by=foo)</font></div><div><font face="'courier new', monospace">  DT[, list(expr1,expr2)]   with(DT, J(expr1, expr2))</font></div>


<div><font face="'courier new', monospace">  ?not possible now?        with(DT, list(mean(x), quantile(x))</font></div></div></blockquote></div></div><div>Firstly, since data.tables (and data.frames) are lists, as.list(DT[, list(mean(x), quantile(x)]) would do what you want.   I think it is an explicit goal that DT[...] always returns another data.table.</div>

<div class="im">

<div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div>Note that I would not use list(expr1, expr2), but rather explicitly</div><div>use data.table(expr1, expr2) or J(expr1, expr2), when someone wants</div>


<div>a data.table returned.  This makes it easier to look at the code</div><div>and see what is to be returned.</div></div></blockquote><div><br></div></div></div><div>A data.table operation always returns a data.table.  Besides the simply case of a single column being collapsed to a vector, I don't think there are exceptions to that rule. </div>


<div><br></div><div>I can't speak to if there is clear performance issues why with() would be disfavored (there may be I just can't think of them), but as a daily user of data.table, I would be really really sad to see it go.  After the initial learning curve it feels so instinctual and natural.  Also chaining isn't something that looks good with with(). Compare the following:</div>


<div><br></div><div>DT[, list(y=sum(y), y.mean=mean(y)), by=key1][ y.mean > 10, list(key1, y) ]</div><div><br></div><div>vs.</div><div><br></div><div>with(with(DT, data.table(y=sum(y), y.mean=mean(y)), by=key1), data.table(key1, y), subset=y.mean > 10)</div>


<div><br></div><div>Admittedly this is a contrived example, but I do similar things in many places, and the chaining in [.data.table is just more direct to me.  I'm sure the author has many other pointers here.</div>

<div class="im">

<div>

<div>

<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><br></div><div>The inconsistency with normal usage of [ in R also raises questions for</div>


<div>[<-.data.table.  Is this consistent with [.data.table or [<-.data.frame?</div>

<div>(I haven't explored this yet.  [<-.data.table is not documented.)</div></div></blockquote><div><br></div></div></div><div>No clue here, just my own experience that every non data.table aware package I have ever used as never had an issue. </div>

<div class="im">

<div>


<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div><br></div></div><div><br></div><div>(2) Having setkey modify the object in place.  This means that one cannot</div>


<div>look for <- (or =) to determine when an object is modified.</div>

<div>It would be safe to instead do</div><div>  key(x) <- character vector of key names</div></blockquote><div><br></div></div></div><div>The big issue here is that key(x) <- c("a","b","c")  makes a copy of x (according to the help page for setkeyv). For small use cases that is fine, and key<- is supported, just with warnings about memory allocation.  But for some of the really large data sets I work with, having to make a copy of a 22 million frame data.table just to change the column names is pretty ridiculous and inefficient.</div>

<div class="im">

<div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><br></div><div>As implemented, if you pass a dt to a function and modify the key</div><div>there, the modification also affects the original object.</div>


<div>And, you end up with two copies of the object with different names,</div><div>and modifying one changes the other.</div><div><br></div><div># Test if setkey called within a function causes problems.</div><div>x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")</div>


<div>foo <- function(y){</div><div>  setkey(y, b)</div><div>  y</div><div>}</div><div>z <- foo(x)</div><div>tables()</div><div># now x also has key b, not a.</div><div>setkey(z, "a")</div><div>tables()</div>


<div># z and x both have key a</div><div><br></div><div># Even copying the object without calling a function makes two pointers</div><div>x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a")</div><div>y <- x</div>


<div>setkey(y, b)</div><div>tables()</div><div># both x and y have key "b"</div></blockquote><div><br></div></div></div><div>This is a big contention point, but once again data.table is trying to be focused towards large scale data processing, and the preference is that if you really mean to make a copy, be explicit about it.  So:</div>


<div><br></div><div>y <- x</div><div><br></div><div>becomes</div><div><br></div><div>y <- copy(x)</div><div><br></div><div>I know the breaks the normal R model, but this is once again a case where data.table has always behaved how I wanted it to and I've never been bitten.  This sort of POV exposes a lot of latent inefficiencies people make in their base R code without even realizing they are making them. </div>

<div class="im">

<div>


<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div>(3) Expecting unquoted names where people would normally expect to give</div><div>quoted names, like setkey.</div>


</blockquote><div><br></div></div></div><div>This is updated in the latest versions, setkeyv behaves as you would expect (and maybe setkey will be changed in the future but this is for compatibility I would think).  There is also setattr, setnames, setcolorder, and set, all of which get around the extra copying of <-.</div>

<div class="im">

<div>


<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div><br></div><div>(4) Not allowing character data to remain character.</div><div><div><div><br></div></div></div></blockquote><div><br></div></div></div><div>Fixed in the current devel version (1.8.0).  Characters are now fully supported, and for the case of many unique levels (like customer ID would be) it is faster than factors were.  Also fixed is the annoying issue I had of ordered factors losing their ordering. I've been waiting for this to go to CRAN before updating third_party, but the third_party version is far enough behind that maybe I should just do it anyway....</div>


<div><br></div><div><br></div><div>All that being said,  I understand a lot of your points, and I will admit there is a bit of a learning curve.  But data.table has changed my day to day R experience more than any other package out there.  </div>


<div><br></div><div>The package creator is receptive and responsive to criticism, so if I still leave many unanswered questions feel free to ping the mailing list.</div><div><div><div><br></div><div><br></div>

</div></div></div></div>

</div><br>