[datatable-help] Copy on assign broken in some cases

Matthew Dowle mdowle at mdowle.plus.com
Fri Oct 28 10:10:23 CEST 2011


Interesting one. Adding columns is a bit different to deleting and
modifying columns. Here's how it works. Could make changes, could
document it, or both, what do people think?

Just like data.frame there is a list vector holding pointers to the
column vectors. A delete column op is done with a memmove to budge up
the column pointers above the column by one place. That leaves a gap at
the end. The length attribute of that vector (ncol(DT)) is then
decremented and the spare 4 bytes (or 8 on 64bit) are left unused at the
end.

An add column can't be fully by reference because the list vector is
full. A new list vector has to be allocated, one slot larger, the old
pointers memcpy'd over, and the last spot assigned the pointer to the
new column vector.  This copying is negligible because it's a small list
of pointers fitting well within one page. [Unless, there are many 1000's
of columns, which is why it's done as efficiently as possible using
memcpy].

Aside : There is little known (I guess) distinction between length and
truelength in R internals. Base R doesn't use it, but we could in
data.table. A delete column sets length but leaves truelength one
larger. When the next add column comes along, it could just do the budge
up and insert the column. That may not be so advantageous for (a small
number) of columns,  but the same logic could work for insert() and
delete()ing rows.  Of course, this would mean whether a visible copy or
not is taken depends on what happened previously, rather than the
syntax. That's something we've disliked before, in the same way we
dislike drop=TRUE behaviour and so dropped drop. One way to approach
this might be to advise ":= add *may* not copy. Best to assume it
doesn't; use copy()". If you get in the habbit of "DT2=copy(DT)" then
that'll take a deep copy at the time and you're safe.

To illustrate the partial (maybe shallow copy is better word), consider
the following :

> DT = data.table(1:2,3:4)
> DT2=DT
> DT2[,y:=10L]
     V1 V2  y
[1,]  1  3 10
[2,]  2  4 10
> DT
     V1 V2
[1,]  1  3
[2,]  2  4
> DT2
     V1 V2  y
[1,]  1  3 10
[2,]  2  4 10
> DT2[1,V1:=99L]
     V1 V2  y
[1,] 99  3 10
[2,]  2  4 10
> DT
     V1 V2
[1,] 99  3
[2,]  2  4
> 

Matthew


On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji wrote:
> I think this is a bug.  DT.2 <- DT.1 doesn't seem to make a copy in
> all cases.
> 
> 
> > DT.1 <- data.table(x=1, y=1)
> > DT.2 <- DT.1
> > 
> > # Both DT.1 and DT.2 are changed.
> > DT.2[, y := NULL]
>      x
> [1,] 1
> > DT.1
>      x
> [1,] 1
> > DT.2
>      x
> [1,] 1
> > 
> > # Only DT.2 is changed
> > DT.2[, y := x]
>      x y
> [1,] 1 1
> > DT.1
>      x
> [1,] 1
> > DT.2
>      x y
> [1,] 1 1
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list