<div class="gmail_quote">On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im"><br>
On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:<br>
> >From the user's perspective, DT2 <- DT should either be a new copy or<br>
> a new reference. Anything in between is confusing.<br>
<br>
</div>Agreed. With picky caveat: even in base it's not at this point the copy<br>
is taken. It's later: copy-on-write. It's setkey and := that don't copy<br>
on write, not the (earlier) <-.<br></blockquote><div><br></div><div>Hmm, I would prefer for these to have the same behavior.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im"><br>
> How about this - add a new argument to data.table(), say max.cols.<br>
> max.cols defaults to a couple orders of magnitude above the initial<br>
> number of columns. data.table allocates enough memory for max.cols<br>
> column pointers. If you try to add more than max.cols columns, it is<br>
> either an error, or it creates a copy and produces a warning.<br>
<br>
</div>Very nice idea. To over allocate by default so that := can add columns<br>
fully by reference most of the time seems good to me since there's a<br>
very low cost to over allocating the vector of column pointers. Create<br>
the (shallow copy) and issue a warning, I'm thinking, not error. The<br>
"max.cols" names seems a bit absolute, could it be "alloc.cols"? We<br>
could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or<br>
realloc(...) so users can over alloc themselves before a loop that adds<br>
columns or inserts rows. tables() could also report truenrow, and<br>
truencol as well as nrow and ncol. What should alloc.cols be, by<br>
default? How about: max(100,2*ncol)<br></blockquote><div><br></div><div>Fine with me. </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
What about as.data.table.data.frame()? Should that over-allocate, too,<br>
or for speed just change the class attribute as it does now.<br></blockquote><div><br></div><div>Yeah, I think any method of creating a data table should over-allocate. If people want the speed gains, they can set explicitly set alloc.cols.</div>
<div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<br>
Maybe checking NAMED would work, in addition. If NAMED was 0, no need to<br>
warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the<br>
warning be necessary.<br>
<div class="HOEnZb"><div class="h5"><br>
<br>
><br>
> On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle<br>
> <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>> wrote:<br>
> Interesting one. Adding columns is a bit different to deleting<br>
> and<br>
> modifying columns. Here's how it works. Could make changes,<br>
> could<br>
> document it, or both, what do people think?<br>
><br>
> Just like data.frame there is a list vector holding pointers<br>
> to the<br>
> column vectors. A delete column op is done with a memmove to<br>
> budge up<br>
> the column pointers above the column by one place. That leaves<br>
> a gap at<br>
> the end. The length attribute of that vector (ncol(DT)) is<br>
> then<br>
> decremented and the spare 4 bytes (or 8 on 64bit) are left<br>
> unused at the<br>
> end.<br>
><br>
> An add column can't be fully by reference because the list<br>
> vector is<br>
> full. A new list vector has to be allocated, one slot larger,<br>
> the old<br>
> pointers memcpy'd over, and the last spot assigned the pointer<br>
> to the<br>
> new column vector. This copying is negligible because it's a<br>
> small list<br>
> of pointers fitting well within one page. [Unless, there are<br>
> many 1000's<br>
> of columns, which is why it's done as efficiently as possible<br>
> using<br>
> memcpy].<br>
><br>
> Aside : There is little known (I guess) distinction between<br>
> length and<br>
> truelength in R internals. Base R doesn't use it, but we could<br>
> in<br>
> data.table. A delete column sets length but leaves truelength<br>
> one<br>
> larger. When the next add column comes along, it could just do<br>
> the budge<br>
> up and insert the column. That may not be so advantageous for<br>
> (a small<br>
> number) of columns, but the same logic could work for<br>
> insert() and<br>
> delete()ing rows. Of course, this would mean whether a<br>
> visible copy or<br>
> not is taken depends on what happened previously, rather than<br>
> the<br>
> syntax. That's something we've disliked before, in the same<br>
> way we<br>
> dislike drop=TRUE behaviour and so dropped drop. One way to<br>
> approach<br>
> this might be to advise ":= add *may* not copy. Best to assume<br>
> it<br>
> doesn't; use copy()". If you get in the habbit of<br>
> "DT2=copy(DT)" then<br>
> that'll take a deep copy at the time and you're safe.<br>
><br>
> To illustrate the partial (maybe shallow copy is better word),<br>
> consider<br>
> the following :<br>
><br>
> > DT = data.table(1:2,3:4)<br>
> > DT2=DT<br>
> > DT2[,y:=10L]<br>
> V1 V2 y<br>
> [1,] 1 3 10<br>
> [2,] 2 4 10<br>
> > DT<br>
> V1 V2<br>
> [1,] 1 3<br>
> [2,] 2 4<br>
> > DT2<br>
> V1 V2 y<br>
> [1,] 1 3 10<br>
> [2,] 2 4 10<br>
> > DT2[1,V1:=99L]<br>
> V1 V2 y<br>
> [1,] 99 3 10<br>
> [2,] 2 4 10<br>
> > DT<br>
> V1 V2<br>
> [1,] 99 3<br>
> [2,] 2 4<br>
> ><br>
><br>
> Matthew<br>
><br>
><br>
> On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:<br>
> > I think this is a bug. DT.2 <- DT.1 doesn't seem to make a<br>
> copy in<br>
> > all cases.<br>
> ><br>
> ><br>
> > > DT.1 <- data.table(x=1, y=1)<br>
> > > DT.2 <- DT.1<br>
> > ><br>
> > > # Both DT.1 and DT.2 are changed.<br>
> > > DT.2[, y := NULL]<br>
> > x<br>
> > [1,] 1<br>
> > > DT.1<br>
> > x<br>
> > [1,] 1<br>
> > > DT.2<br>
> > x<br>
> > [1,] 1<br>
> > ><br>
> > > # Only DT.2 is changed<br>
> > > DT.2[, y := x]<br>
> > x y<br>
> > [1,] 1 1<br>
> > > DT.1<br>
> > x<br>
> > [1,] 1<br>
> > > DT.2<br>
> > x y<br>
> > [1,] 1 1<br>
> ><br>
> ><br>
><br>
> > _______________________________________________<br>
> > datatable-help mailing list<br>
> > <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> ><br>
> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
><br>
><br>
><br>
><br>
<br>
<br>
</div></div></blockquote></div><br>