[datatable-help] Copy on assign broken in some cases
Muhammad Waliji
mhwaliji at google.com
Sat Oct 29 02:59:01 CEST 2011
On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
> On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:
> > On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle
> > <mdowle at mdowle.plus.com> wrote:
> >
> > On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
> > > >From the user's perspective, DT2 <- DT should either be a
> > new copy or
> > > a new reference. Anything in between is confusing.
> >
> >
> > Agreed. With picky caveat: even in base it's not at this point
> > the copy
> > is taken. It's later: copy-on-write. It's setkey and := that
> > don't copy
> > on write, not the (earlier) <-.
> >
> >
> > Hmm, I would prefer for these to have the same behavior.
>
> Not sure I follow, please expand.
>
I would like for DT[, x := foo] and DT$x <- foo to have the same behavior.
i.e. if one preserves the reference, so should the other.
>
> >
> >
> > > How about this - add a new argument to data.table(), say
> > max.cols.
> > > max.cols defaults to a couple orders of magnitude above the
> > initial
> > > number of columns. data.table allocates enough memory for
> > max.cols
> > > column pointers. If you try to add more than max.cols
> > columns, it is
> > > either an error, or it creates a copy and produces a
> > warning.
> >
> >
> > Very nice idea. To over allocate by default so that := can add
> > columns
> > fully by reference most of the time seems good to me since
> > there's a
> > very low cost to over allocating the vector of column
> > pointers. Create
> > the (shallow copy) and issue a warning, I'm thinking, not
> > error. The
> > "max.cols" names seems a bit absolute, could it be
> > "alloc.cols"? We
> > could have alloc(DT,2,ncol) or rowalloc(DT,n) and
> > colalloc(DT,n), or
> > realloc(...) so users can over alloc themselves before a loop
> > that adds
> > columns or inserts rows. tables() could also report truenrow,
> > and
> > truencol as well as nrow and ncol. What should alloc.cols be,
> > by
> > default? How about: max(100,2*ncol)
> >
> >
> > Fine with me.
> >
> > What about as.data.table.data.frame()? Should that
> > over-allocate, too,
> > or for speed just change the class attribute as it does now.
> >
> >
> > Yeah, I think any method of creating a data table should
> > over-allocate. If people want the speed gains, they can set
> > explicitly set alloc.cols.
> >
> >
> >
> > Maybe checking NAMED would work, in addition. If NAMED was 0,
> > no need to
> > warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) -
> > would the
> > warning be necessary.
> >
> >
> > >
> > > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
> > > <mdowle at mdowle.plus.com> wrote:
> > > Interesting one. Adding columns is a bit different
> > to deleting
> > > and
> > > modifying columns. Here's how it works. Could make
> > changes,
> > > could
> > > document it, or both, what do people think?
> > >
> > > Just like data.frame there is a list vector holding
> > pointers
> > > to the
> > > column vectors. A delete column op is done with a
> > memmove to
> > > budge up
> > > the column pointers above the column by one place.
> > That leaves
> > > a gap at
> > > the end. The length attribute of that vector
> > (ncol(DT)) is
> > > then
> > > decremented and the spare 4 bytes (or 8 on 64bit)
> > are left
> > > unused at the
> > > end.
> > >
> > > An add column can't be fully by reference because
> > the list
> > > vector is
> > > full. A new list vector has to be allocated, one
> > slot larger,
> > > the old
> > > pointers memcpy'd over, and the last spot assigned
> > the pointer
> > > to the
> > > new column vector. This copying is negligible
> > because it's a
> > > small list
> > > of pointers fitting well within one page. [Unless,
> > there are
> > > many 1000's
> > > of columns, which is why it's done as efficiently as
> > possible
> > > using
> > > memcpy].
> > >
> > > Aside : There is little known (I guess) distinction
> > between
> > > length and
> > > truelength in R internals. Base R doesn't use it,
> > but we could
> > > in
> > > data.table. A delete column sets length but leaves
> > truelength
> > > one
> > > larger. When the next add column comes along, it
> > could just do
> > > the budge
> > > up and insert the column. That may not be so
> > advantageous for
> > > (a small
> > > number) of columns, but the same logic could work
> > for
> > > insert() and
> > > delete()ing rows. Of course, this would mean
> > whether a
> > > visible copy or
> > > not is taken depends on what happened previously,
> > rather than
> > > the
> > > syntax. That's something we've disliked before, in
> > the same
> > > way we
> > > dislike drop=TRUE behaviour and so dropped drop. One
> > way to
> > > approach
> > > this might be to advise ":= add *may* not copy. Best
> > to assume
> > > it
> > > doesn't; use copy()". If you get in the habbit of
> > > "DT2=copy(DT)" then
> > > that'll take a deep copy at the time and you're
> > safe.
> > >
> > > To illustrate the partial (maybe shallow copy is
> > better word),
> > > consider
> > > the following :
> > >
> > > > DT = data.table(1:2,3:4)
> > > > DT2=DT
> > > > DT2[,y:=10L]
> > > V1 V2 y
> > > [1,] 1 3 10
> > > [2,] 2 4 10
> > > > DT
> > > V1 V2
> > > [1,] 1 3
> > > [2,] 2 4
> > > > DT2
> > > V1 V2 y
> > > [1,] 1 3 10
> > > [2,] 2 4 10
> > > > DT2[1,V1:=99L]
> > > V1 V2 y
> > > [1,] 99 3 10
> > > [2,] 2 4 10
> > > > DT
> > > V1 V2
> > > [1,] 99 3
> > > [2,] 2 4
> > > >
> > >
> > > Matthew
> > >
> > >
> > > On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji
> > wrote:
> > > > I think this is a bug. DT.2 <- DT.1 doesn't seem
> > to make a
> > > copy in
> > > > all cases.
> > > >
> > > >
> > > > > DT.1 <- data.table(x=1, y=1)
> > > > > DT.2 <- DT.1
> > > > >
> > > > > # Both DT.1 and DT.2 are changed.
> > > > > DT.2[, y := NULL]
> > > > x
> > > > [1,] 1
> > > > > DT.1
> > > > x
> > > > [1,] 1
> > > > > DT.2
> > > > x
> > > > [1,] 1
> > > > >
> > > > > # Only DT.2 is changed
> > > > > DT.2[, y := x]
> > > > x y
> > > > [1,] 1 1
> > > > > DT.1
> > > > x
> > > > [1,] 1
> > > > > DT.2
> > > > x y
> > > > [1,] 1 1
> > > >
> > > >
> > >
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org
> > > >
> > >
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > >
> > >
> > >
> >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20111028/fd86efc4/attachment-0001.htm>
More information about the datatable-help
mailing list