[datatable-help] Copy on assign broken in some cases
Matthew Dowle
mdowle at mdowle.plus.com
Sat Oct 29 02:32:01 CEST 2011
On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
> >From the user's perspective, DT2 <- DT should either be a new copy or
> a new reference. Anything in between is confusing.
Agreed. With picky caveat: even in base it's not at this point the copy
is taken. It's later: copy-on-write. It's setkey and := that don't copy
on write, not the (earlier) <-.
> How about this - add a new argument to data.table(), say max.cols.
> max.cols defaults to a couple orders of magnitude above the initial
> number of columns. data.table allocates enough memory for max.cols
> column pointers. If you try to add more than max.cols columns, it is
> either an error, or it creates a copy and produces a warning.
Very nice idea. To over allocate by default so that := can add columns
fully by reference most of the time seems good to me since there's a
very low cost to over allocating the vector of column pointers. Create
the (shallow copy) and issue a warning, I'm thinking, not error. The
"max.cols" names seems a bit absolute, could it be "alloc.cols"? We
could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or
realloc(...) so users can over alloc themselves before a loop that adds
columns or inserts rows. tables() could also report truenrow, and
truencol as well as nrow and ncol. What should alloc.cols be, by
default? How about: max(100,2*ncol)
What about as.data.table.data.frame()? Should that over-allocate, too,
or for speed just change the class attribute as it does now.
Maybe checking NAMED would work, in addition. If NAMED was 0, no need to
warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the
warning be necessary.
>
> On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
> Interesting one. Adding columns is a bit different to deleting
> and
> modifying columns. Here's how it works. Could make changes,
> could
> document it, or both, what do people think?
>
> Just like data.frame there is a list vector holding pointers
> to the
> column vectors. A delete column op is done with a memmove to
> budge up
> the column pointers above the column by one place. That leaves
> a gap at
> the end. The length attribute of that vector (ncol(DT)) is
> then
> decremented and the spare 4 bytes (or 8 on 64bit) are left
> unused at the
> end.
>
> An add column can't be fully by reference because the list
> vector is
> full. A new list vector has to be allocated, one slot larger,
> the old
> pointers memcpy'd over, and the last spot assigned the pointer
> to the
> new column vector. This copying is negligible because it's a
> small list
> of pointers fitting well within one page. [Unless, there are
> many 1000's
> of columns, which is why it's done as efficiently as possible
> using
> memcpy].
>
> Aside : There is little known (I guess) distinction between
> length and
> truelength in R internals. Base R doesn't use it, but we could
> in
> data.table. A delete column sets length but leaves truelength
> one
> larger. When the next add column comes along, it could just do
> the budge
> up and insert the column. That may not be so advantageous for
> (a small
> number) of columns, but the same logic could work for
> insert() and
> delete()ing rows. Of course, this would mean whether a
> visible copy or
> not is taken depends on what happened previously, rather than
> the
> syntax. That's something we've disliked before, in the same
> way we
> dislike drop=TRUE behaviour and so dropped drop. One way to
> approach
> this might be to advise ":= add *may* not copy. Best to assume
> it
> doesn't; use copy()". If you get in the habbit of
> "DT2=copy(DT)" then
> that'll take a deep copy at the time and you're safe.
>
> To illustrate the partial (maybe shallow copy is better word),
> consider
> the following :
>
> > DT = data.table(1:2,3:4)
> > DT2=DT
> > DT2[,y:=10L]
> V1 V2 y
> [1,] 1 3 10
> [2,] 2 4 10
> > DT
> V1 V2
> [1,] 1 3
> [2,] 2 4
> > DT2
> V1 V2 y
> [1,] 1 3 10
> [2,] 2 4 10
> > DT2[1,V1:=99L]
> V1 V2 y
> [1,] 99 3 10
> [2,] 2 4 10
> > DT
> V1 V2
> [1,] 99 3
> [2,] 2 4
> >
>
> Matthew
>
>
> On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:
> > I think this is a bug. DT.2 <- DT.1 doesn't seem to make a
> copy in
> > all cases.
> >
> >
> > > DT.1 <- data.table(x=1, y=1)
> > > DT.2 <- DT.1
> > >
> > > # Both DT.1 and DT.2 are changed.
> > > DT.2[, y := NULL]
> > x
> > [1,] 1
> > > DT.1
> > x
> > [1,] 1
> > > DT.2
> > x
> > [1,] 1
> > >
> > > # Only DT.2 is changed
> > > DT.2[, y := x]
> > x y
> > [1,] 1 1
> > > DT.1
> > x
> > [1,] 1
> > > DT.2
> > x y
> > [1,] 1 1
> >
> >
>
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
More information about the datatable-help
mailing list