<br><br><div class="gmail_quote">On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle <span dir="ltr">&lt;<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div class="im">On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:<br>

&gt; On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle<br>

&gt; &lt;<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>&gt; wrote:<br>

&gt;<br>

&gt;         On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:<br>

&gt;         &gt; &gt;From the user&#39;s perspective, DT2 &lt;- DT should either be a<br>

&gt;         new copy or<br>

&gt;         &gt; a new reference.  Anything in between is confusing.<br>

&gt;<br>

&gt;<br>

&gt;         Agreed. With picky caveat: even in base it&#39;s not at this point<br>

&gt;         the copy<br>

&gt;         is taken. It&#39;s later: copy-on-write. It&#39;s setkey and := that<br>

&gt;         don&#39;t copy<br>

&gt;         on write, not the (earlier) &lt;-.<br>

&gt;<br>

&gt;<br>

&gt; Hmm, I would prefer for these to have the same behavior.<br>

<br>

</div>Not sure I follow, please expand.<br></blockquote><div><br></div><div>I would like for DT[, x := foo] and DT$x &lt;- foo to have the same behavior.  i.e. if one preserves the reference, so should the other.</div><div>


 </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="HOEnZb"><div class="h5"><br>

&gt;<br>

&gt;<br>

&gt;         &gt; How about this - add a new argument to data.table(), say<br>

&gt;         max.cols.<br>

&gt;         &gt; max.cols defaults to a couple orders of magnitude above the<br>

&gt;         initial<br>

&gt;         &gt; number of columns.  data.table allocates enough memory for<br>

&gt;         max.cols<br>

&gt;         &gt; column pointers.  If you try to add more than max.cols<br>

&gt;         columns, it is<br>

&gt;         &gt; either an error, or it creates a copy and produces a<br>

&gt;         warning.<br>

&gt;<br>

&gt;<br>

&gt;         Very nice idea. To over allocate by default so that := can add<br>

&gt;         columns<br>

&gt;         fully by reference most of the time seems good to me since<br>

&gt;         there&#39;s a<br>

&gt;         very low cost to over allocating the vector of column<br>

&gt;         pointers. Create<br>

&gt;         the (shallow copy) and issue a warning, I&#39;m thinking, not<br>

&gt;         error. The<br>

&gt;         &quot;max.cols&quot; names seems a bit absolute, could it be<br>

&gt;         &quot;alloc.cols&quot;?  We<br>

&gt;         could have alloc(DT,2,ncol) or rowalloc(DT,n) and<br>

&gt;         colalloc(DT,n), or<br>

&gt;         realloc(...) so users can over alloc themselves before a loop<br>

&gt;         that adds<br>

&gt;         columns or inserts rows.  tables() could also report truenrow,<br>

&gt;         and<br>

&gt;         truencol as well as nrow and ncol.  What should alloc.cols be,<br>

&gt;         by<br>

&gt;         default? How about:  max(100,2*ncol)<br>

&gt;<br>

&gt;<br>

&gt; Fine with me.<br>

&gt;<br>

&gt;         What about as.data.table.data.frame()?  Should that<br>

&gt;         over-allocate, too,<br>

&gt;         or for speed just change the class attribute as it does now.<br>

&gt;<br>

&gt;<br>

&gt; Yeah, I think any method of creating a data table should<br>

&gt; over-allocate.  If people want the speed gains, they can set<br>

&gt; explicitly set alloc.cols.<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;         Maybe checking NAMED would work, in addition. If NAMED was 0,<br>

&gt;         no need to<br>

&gt;         warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) -<br>

&gt;         would the<br>

&gt;         warning be necessary.<br>

&gt;<br>

&gt;<br>

&gt;         &gt;<br>

&gt;         &gt; On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle<br>

&gt;         &gt; &lt;<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>&gt; wrote:<br>

&gt;         &gt;         Interesting one. Adding columns is a bit different<br>

&gt;         to deleting<br>

&gt;         &gt;         and<br>

&gt;         &gt;         modifying columns. Here&#39;s how it works. Could make<br>

&gt;         changes,<br>

&gt;         &gt;         could<br>

&gt;         &gt;         document it, or both, what do people think?<br>

&gt;         &gt;<br>

&gt;         &gt;         Just like data.frame there is a list vector holding<br>

&gt;         pointers<br>

&gt;         &gt;         to the<br>

&gt;         &gt;         column vectors. A delete column op is done with a<br>

&gt;         memmove to<br>

&gt;         &gt;         budge up<br>

&gt;         &gt;         the column pointers above the column by one place.<br>

&gt;         That leaves<br>

&gt;         &gt;         a gap at<br>

&gt;         &gt;         the end. The length attribute of that vector<br>

&gt;         (ncol(DT)) is<br>

&gt;         &gt;         then<br>

&gt;         &gt;         decremented and the spare 4 bytes (or 8 on 64bit)<br>

&gt;         are left<br>

&gt;         &gt;         unused at the<br>

&gt;         &gt;         end.<br>

&gt;         &gt;<br>

&gt;         &gt;         An add column can&#39;t be fully by reference because<br>

&gt;         the list<br>

&gt;         &gt;         vector is<br>

&gt;         &gt;         full. A new list vector has to be allocated, one<br>

&gt;         slot larger,<br>

&gt;         &gt;         the old<br>

&gt;         &gt;         pointers memcpy&#39;d over, and the last spot assigned<br>

&gt;         the pointer<br>

&gt;         &gt;         to the<br>

&gt;         &gt;         new column vector.  This copying is negligible<br>

&gt;         because it&#39;s a<br>

&gt;         &gt;         small list<br>

&gt;         &gt;         of pointers fitting well within one page. [Unless,<br>

&gt;         there are<br>

&gt;         &gt;         many 1000&#39;s<br>

&gt;         &gt;         of columns, which is why it&#39;s done as efficiently as<br>

&gt;         possible<br>

&gt;         &gt;         using<br>

&gt;         &gt;         memcpy].<br>

&gt;         &gt;<br>

&gt;         &gt;         Aside : There is little known (I guess) distinction<br>

&gt;         between<br>

&gt;         &gt;         length and<br>

&gt;         &gt;         truelength in R internals. Base R doesn&#39;t use it,<br>

&gt;         but we could<br>

&gt;         &gt;         in<br>

&gt;         &gt;         data.table. A delete column sets length but leaves<br>

&gt;         truelength<br>

&gt;         &gt;         one<br>

&gt;         &gt;         larger. When the next add column comes along, it<br>

&gt;         could just do<br>

&gt;         &gt;         the budge<br>

&gt;         &gt;         up and insert the column. That may not be so<br>

&gt;         advantageous for<br>

&gt;         &gt;         (a small<br>

&gt;         &gt;         number) of columns,  but the same logic could work<br>

&gt;         for<br>

&gt;         &gt;         insert() and<br>

&gt;         &gt;         delete()ing rows.  Of course, this would mean<br>

&gt;         whether a<br>

&gt;         &gt;         visible copy or<br>

&gt;         &gt;         not is taken depends on what happened previously,<br>

&gt;         rather than<br>

&gt;         &gt;         the<br>

&gt;         &gt;         syntax. That&#39;s something we&#39;ve disliked before, in<br>

&gt;         the same<br>

&gt;         &gt;         way we<br>

&gt;         &gt;         dislike drop=TRUE behaviour and so dropped drop. One<br>

&gt;         way to<br>

&gt;         &gt;         approach<br>

&gt;         &gt;         this might be to advise &quot;:= add *may* not copy. Best<br>

&gt;         to assume<br>

&gt;         &gt;         it<br>

&gt;         &gt;         doesn&#39;t; use copy()&quot;. If you get in the habbit of<br>

&gt;         &gt;         &quot;DT2=copy(DT)&quot; then<br>

&gt;         &gt;         that&#39;ll take a deep copy at the time and you&#39;re<br>

&gt;         safe.<br>

&gt;         &gt;<br>

&gt;         &gt;         To illustrate the partial (maybe shallow copy is<br>

&gt;         better word),<br>

&gt;         &gt;         consider<br>

&gt;         &gt;         the following :<br>

&gt;         &gt;<br>

&gt;         &gt;         &gt; DT = data.table(1:2,3:4)<br>

&gt;         &gt;         &gt; DT2=DT<br>

&gt;         &gt;         &gt; DT2[,y:=10L]<br>

&gt;         &gt;             V1 V2  y<br>

&gt;         &gt;         [1,]  1  3 10<br>

&gt;         &gt;         [2,]  2  4 10<br>

&gt;         &gt;         &gt; DT<br>

&gt;         &gt;             V1 V2<br>

&gt;         &gt;         [1,]  1  3<br>

&gt;         &gt;         [2,]  2  4<br>

&gt;         &gt;         &gt; DT2<br>

&gt;         &gt;             V1 V2  y<br>

&gt;         &gt;         [1,]  1  3 10<br>

&gt;         &gt;         [2,]  2  4 10<br>

&gt;         &gt;         &gt; DT2[1,V1:=99L]<br>

&gt;         &gt;             V1 V2  y<br>

&gt;         &gt;         [1,] 99  3 10<br>

&gt;         &gt;         [2,]  2  4 10<br>

&gt;         &gt;         &gt; DT<br>

&gt;         &gt;             V1 V2<br>

&gt;         &gt;         [1,] 99  3<br>

&gt;         &gt;         [2,]  2  4<br>

&gt;         &gt;         &gt;<br>

&gt;         &gt;<br>

&gt;         &gt;         Matthew<br>

&gt;         &gt;<br>

&gt;         &gt;<br>

&gt;         &gt;         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji<br>

&gt;         wrote:<br>

&gt;         &gt;         &gt; I think this is a bug.  DT.2 &lt;- DT.1 doesn&#39;t seem<br>

&gt;         to make a<br>

&gt;         &gt;         copy in<br>

&gt;         &gt;         &gt; all cases.<br>

&gt;         &gt;         &gt;<br>

&gt;         &gt;         &gt;<br>

&gt;         &gt;         &gt; &gt; DT.1 &lt;- data.table(x=1, y=1)<br>

&gt;         &gt;         &gt; &gt; DT.2 &lt;- DT.1<br>

&gt;         &gt;         &gt; &gt;<br>

&gt;         &gt;         &gt; &gt; # Both DT.1 and DT.2 are changed.<br>

&gt;         &gt;         &gt; &gt; DT.2[, y := NULL]<br>

&gt;         &gt;         &gt;      x<br>

&gt;         &gt;         &gt; [1,] 1<br>

&gt;         &gt;         &gt; &gt; DT.1<br>

&gt;         &gt;         &gt;      x<br>

&gt;         &gt;         &gt; [1,] 1<br>

&gt;         &gt;         &gt; &gt; DT.2<br>

&gt;         &gt;         &gt;      x<br>

&gt;         &gt;         &gt; [1,] 1<br>

&gt;         &gt;         &gt; &gt;<br>

&gt;         &gt;         &gt; &gt; # Only DT.2 is changed<br>

&gt;         &gt;         &gt; &gt; DT.2[, y := x]<br>

&gt;         &gt;         &gt;      x y<br>

&gt;         &gt;         &gt; [1,] 1 1<br>

&gt;         &gt;         &gt; &gt; DT.1<br>

&gt;         &gt;         &gt;      x<br>

&gt;         &gt;         &gt; [1,] 1<br>

&gt;         &gt;         &gt; &gt; DT.2<br>

&gt;         &gt;         &gt;      x y<br>

&gt;         &gt;         &gt; [1,] 1 1<br>

&gt;         &gt;         &gt;<br>

&gt;         &gt;         &gt;<br>

&gt;         &gt;<br>

&gt;         &gt;         &gt; _______________________________________________<br>

&gt;         &gt;         &gt; datatable-help mailing list<br>

&gt;         &gt;         &gt; <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>

&gt;         &gt;         &gt;<br>

&gt;         &gt;<br>

&gt;         <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>

&gt;         &gt;<br>

&gt;         &gt;<br>

&gt;         &gt;<br>

&gt;         &gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt;<br>

<br>

<br>

</div></div></blockquote></div><br>