<br><br><div class="gmail_quote">On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle <span dir="ltr"><<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:<br>
> On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle<br>
> <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>> wrote:<br>
><br>
> On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:<br>
> > >From the user's perspective, DT2 <- DT should either be a<br>
> new copy or<br>
> > a new reference. Anything in between is confusing.<br>
><br>
><br>
> Agreed. With picky caveat: even in base it's not at this point<br>
> the copy<br>
> is taken. It's later: copy-on-write. It's setkey and := that<br>
> don't copy<br>
> on write, not the (earlier) <-.<br>
><br>
><br>
> Hmm, I would prefer for these to have the same behavior.<br>
<br>
</div>Not sure I follow, please expand.<br></blockquote><div><br></div><div>I would like for DT[, x := foo] and DT$x <- foo to have the same behavior. i.e. if one preserves the reference, so should the other.</div><div>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="HOEnZb"><div class="h5"><br>
><br>
><br>
> > How about this - add a new argument to data.table(), say<br>
> max.cols.<br>
> > max.cols defaults to a couple orders of magnitude above the<br>
> initial<br>
> > number of columns. data.table allocates enough memory for<br>
> max.cols<br>
> > column pointers. If you try to add more than max.cols<br>
> columns, it is<br>
> > either an error, or it creates a copy and produces a<br>
> warning.<br>
><br>
><br>
> Very nice idea. To over allocate by default so that := can add<br>
> columns<br>
> fully by reference most of the time seems good to me since<br>
> there's a<br>
> very low cost to over allocating the vector of column<br>
> pointers. Create<br>
> the (shallow copy) and issue a warning, I'm thinking, not<br>
> error. The<br>
> "max.cols" names seems a bit absolute, could it be<br>
> "alloc.cols"? We<br>
> could have alloc(DT,2,ncol) or rowalloc(DT,n) and<br>
> colalloc(DT,n), or<br>
> realloc(...) so users can over alloc themselves before a loop<br>
> that adds<br>
> columns or inserts rows. tables() could also report truenrow,<br>
> and<br>
> truencol as well as nrow and ncol. What should alloc.cols be,<br>
> by<br>
> default? How about: max(100,2*ncol)<br>
><br>
><br>
> Fine with me.<br>
><br>
> What about as.data.table.data.frame()? Should that<br>
> over-allocate, too,<br>
> or for speed just change the class attribute as it does now.<br>
><br>
><br>
> Yeah, I think any method of creating a data table should<br>
> over-allocate. If people want the speed gains, they can set<br>
> explicitly set alloc.cols.<br>
><br>
><br>
><br>
> Maybe checking NAMED would work, in addition. If NAMED was 0,<br>
> no need to<br>
> warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) -<br>
> would the<br>
> warning be necessary.<br>
><br>
><br>
> ><br>
> > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle<br>
> > <<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>> wrote:<br>
> > Interesting one. Adding columns is a bit different<br>
> to deleting<br>
> > and<br>
> > modifying columns. Here's how it works. Could make<br>
> changes,<br>
> > could<br>
> > document it, or both, what do people think?<br>
> ><br>
> > Just like data.frame there is a list vector holding<br>
> pointers<br>
> > to the<br>
> > column vectors. A delete column op is done with a<br>
> memmove to<br>
> > budge up<br>
> > the column pointers above the column by one place.<br>
> That leaves<br>
> > a gap at<br>
> > the end. The length attribute of that vector<br>
> (ncol(DT)) is<br>
> > then<br>
> > decremented and the spare 4 bytes (or 8 on 64bit)<br>
> are left<br>
> > unused at the<br>
> > end.<br>
> ><br>
> > An add column can't be fully by reference because<br>
> the list<br>
> > vector is<br>
> > full. A new list vector has to be allocated, one<br>
> slot larger,<br>
> > the old<br>
> > pointers memcpy'd over, and the last spot assigned<br>
> the pointer<br>
> > to the<br>
> > new column vector. This copying is negligible<br>
> because it's a<br>
> > small list<br>
> > of pointers fitting well within one page. [Unless,<br>
> there are<br>
> > many 1000's<br>
> > of columns, which is why it's done as efficiently as<br>
> possible<br>
> > using<br>
> > memcpy].<br>
> ><br>
> > Aside : There is little known (I guess) distinction<br>
> between<br>
> > length and<br>
> > truelength in R internals. Base R doesn't use it,<br>
> but we could<br>
> > in<br>
> > data.table. A delete column sets length but leaves<br>
> truelength<br>
> > one<br>
> > larger. When the next add column comes along, it<br>
> could just do<br>
> > the budge<br>
> > up and insert the column. That may not be so<br>
> advantageous for<br>
> > (a small<br>
> > number) of columns, but the same logic could work<br>
> for<br>
> > insert() and<br>
> > delete()ing rows. Of course, this would mean<br>
> whether a<br>
> > visible copy or<br>
> > not is taken depends on what happened previously,<br>
> rather than<br>
> > the<br>
> > syntax. That's something we've disliked before, in<br>
> the same<br>
> > way we<br>
> > dislike drop=TRUE behaviour and so dropped drop. One<br>
> way to<br>
> > approach<br>
> > this might be to advise ":= add *may* not copy. Best<br>
> to assume<br>
> > it<br>
> > doesn't; use copy()". If you get in the habbit of<br>
> > "DT2=copy(DT)" then<br>
> > that'll take a deep copy at the time and you're<br>
> safe.<br>
> ><br>
> > To illustrate the partial (maybe shallow copy is<br>
> better word),<br>
> > consider<br>
> > the following :<br>
> ><br>
> > > DT = data.table(1:2,3:4)<br>
> > > DT2=DT<br>
> > > DT2[,y:=10L]<br>
> > V1 V2 y<br>
> > [1,] 1 3 10<br>
> > [2,] 2 4 10<br>
> > > DT<br>
> > V1 V2<br>
> > [1,] 1 3<br>
> > [2,] 2 4<br>
> > > DT2<br>
> > V1 V2 y<br>
> > [1,] 1 3 10<br>
> > [2,] 2 4 10<br>
> > > DT2[1,V1:=99L]<br>
> > V1 V2 y<br>
> > [1,] 99 3 10<br>
> > [2,] 2 4 10<br>
> > > DT<br>
> > V1 V2<br>
> > [1,] 99 3<br>
> > [2,] 2 4<br>
> > ><br>
> ><br>
> > Matthew<br>
> ><br>
> ><br>
> > On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji<br>
> wrote:<br>
> > > I think this is a bug. DT.2 <- DT.1 doesn't seem<br>
> to make a<br>
> > copy in<br>
> > > all cases.<br>
> > ><br>
> > ><br>
> > > > DT.1 <- data.table(x=1, y=1)<br>
> > > > DT.2 <- DT.1<br>
> > > ><br>
> > > > # Both DT.1 and DT.2 are changed.<br>
> > > > DT.2[, y := NULL]<br>
> > > x<br>
> > > [1,] 1<br>
> > > > DT.1<br>
> > > x<br>
> > > [1,] 1<br>
> > > > DT.2<br>
> > > x<br>
> > > [1,] 1<br>
> > > ><br>
> > > > # Only DT.2 is changed<br>
> > > > DT.2[, y := x]<br>
> > > x y<br>
> > > [1,] 1 1<br>
> > > > DT.1<br>
> > > x<br>
> > > [1,] 1<br>
> > > > DT.2<br>
> > > x y<br>
> > > [1,] 1 1<br>
> > ><br>
> > ><br>
> ><br>
> > > _______________________________________________<br>
> > > datatable-help mailing list<br>
> > > <a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
> > ><br>
> ><br>
> <a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
><br>
><br>
><br>
><br>
<br>
<br>
</div></div></blockquote></div><br>