[datatable-help] Copy on assign broken in some cases

Matthew Dowle mdowle at mdowle.plus.com
Sat Oct 29 02:32:01 CEST 2011


On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
> >From the user's perspective, DT2 <- DT should either be a new copy or
> a new reference.  Anything in between is confusing.

Agreed. With picky caveat: even in base it's not at this point the copy
is taken. It's later: copy-on-write. It's setkey and := that don't copy
on write, not the (earlier) <-.

> How about this - add a new argument to data.table(), say max.cols.
> max.cols defaults to a couple orders of magnitude above the initial
> number of columns.  data.table allocates enough memory for max.cols
> column pointers.  If you try to add more than max.cols columns, it is
> either an error, or it creates a copy and produces a warning.

Very nice idea. To over allocate by default so that := can add columns
fully by reference most of the time seems good to me since there's a
very low cost to over allocating the vector of column pointers. Create
the (shallow copy) and issue a warning, I'm thinking, not error. The
"max.cols" names seems a bit absolute, could it be "alloc.cols"?  We
could have alloc(DT,2,ncol) or rowalloc(DT,n) and colalloc(DT,n), or
realloc(...) so users can over alloc themselves before a loop that adds
columns or inserts rows.  tables() could also report truenrow, and
truencol as well as nrow and ncol.  What should alloc.cols be, by
default? How about:  max(100,2*ncol)

What about as.data.table.data.frame()?  Should that over-allocate, too,
or for speed just change the class attribute as it does now.

Maybe checking NAMED would work, in addition. If NAMED was 0, no need to
warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) - would the
warning be necessary.


> 
> On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>         Interesting one. Adding columns is a bit different to deleting
>         and
>         modifying columns. Here's how it works. Could make changes,
>         could
>         document it, or both, what do people think?
>         
>         Just like data.frame there is a list vector holding pointers
>         to the
>         column vectors. A delete column op is done with a memmove to
>         budge up
>         the column pointers above the column by one place. That leaves
>         a gap at
>         the end. The length attribute of that vector (ncol(DT)) is
>         then
>         decremented and the spare 4 bytes (or 8 on 64bit) are left
>         unused at the
>         end.
>         
>         An add column can't be fully by reference because the list
>         vector is
>         full. A new list vector has to be allocated, one slot larger,
>         the old
>         pointers memcpy'd over, and the last spot assigned the pointer
>         to the
>         new column vector.  This copying is negligible because it's a
>         small list
>         of pointers fitting well within one page. [Unless, there are
>         many 1000's
>         of columns, which is why it's done as efficiently as possible
>         using
>         memcpy].
>         
>         Aside : There is little known (I guess) distinction between
>         length and
>         truelength in R internals. Base R doesn't use it, but we could
>         in
>         data.table. A delete column sets length but leaves truelength
>         one
>         larger. When the next add column comes along, it could just do
>         the budge
>         up and insert the column. That may not be so advantageous for
>         (a small
>         number) of columns,  but the same logic could work for
>         insert() and
>         delete()ing rows.  Of course, this would mean whether a
>         visible copy or
>         not is taken depends on what happened previously, rather than
>         the
>         syntax. That's something we've disliked before, in the same
>         way we
>         dislike drop=TRUE behaviour and so dropped drop. One way to
>         approach
>         this might be to advise ":= add *may* not copy. Best to assume
>         it
>         doesn't; use copy()". If you get in the habbit of
>         "DT2=copy(DT)" then
>         that'll take a deep copy at the time and you're safe.
>         
>         To illustrate the partial (maybe shallow copy is better word),
>         consider
>         the following :
>         
>         > DT = data.table(1:2,3:4)
>         > DT2=DT
>         > DT2[,y:=10L]
>             V1 V2  y
>         [1,]  1  3 10
>         [2,]  2  4 10
>         > DT
>             V1 V2
>         [1,]  1  3
>         [2,]  2  4
>         > DT2
>             V1 V2  y
>         [1,]  1  3 10
>         [2,]  2  4 10
>         > DT2[1,V1:=99L]
>             V1 V2  y
>         [1,] 99  3 10
>         [2,]  2  4 10
>         > DT
>             V1 V2
>         [1,] 99  3
>         [2,]  2  4
>         >
>         
>         Matthew
>         
>         
>         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:
>         > I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a
>         copy in
>         > all cases.
>         >
>         >
>         > > DT.1 <- data.table(x=1, y=1)
>         > > DT.2 <- DT.1
>         > >
>         > > # Both DT.1 and DT.2 are changed.
>         > > DT.2[, y := NULL]
>         >      x
>         > [1,] 1
>         > > DT.1
>         >      x
>         > [1,] 1
>         > > DT.2
>         >      x
>         > [1,] 1
>         > >
>         > > # Only DT.2 is changed
>         > > DT.2[, y := x]
>         >      x y
>         > [1,] 1 1
>         > > DT.1
>         >      x
>         > [1,] 1
>         > > DT.2
>         >      x y
>         > [1,] 1 1
>         >
>         >
>         
>         > _______________________________________________
>         > datatable-help mailing list
>         > datatable-help at lists.r-forge.r-project.org
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         
>         
> 
> 




More information about the datatable-help mailing list