[datatable-help] Copy on assign broken in some cases

Muhammad Waliji mhwaliji at google.com
Sat Oct 29 02:59:01 CEST 2011


On Fri, Oct 28, 2011 at 5:57 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> On Fri, 2011-10-28 at 17:42 -0700, Muhammad Waliji wrote:
> > On Fri, Oct 28, 2011 at 5:32 PM, Matthew Dowle
> > <mdowle at mdowle.plus.com> wrote:
> >
> >         On Fri, 2011-10-28 at 09:52 -0700, Muhammad Waliji wrote:
> >         > >From the user's perspective, DT2 <- DT should either be a
> >         new copy or
> >         > a new reference.  Anything in between is confusing.
> >
> >
> >         Agreed. With picky caveat: even in base it's not at this point
> >         the copy
> >         is taken. It's later: copy-on-write. It's setkey and := that
> >         don't copy
> >         on write, not the (earlier) <-.
> >
> >
> > Hmm, I would prefer for these to have the same behavior.
>
> Not sure I follow, please expand.
>

I would like for DT[, x := foo] and DT$x <- foo to have the same behavior.
 i.e. if one preserves the reference, so should the other.


>
> >
> >
> >         > How about this - add a new argument to data.table(), say
> >         max.cols.
> >         > max.cols defaults to a couple orders of magnitude above the
> >         initial
> >         > number of columns.  data.table allocates enough memory for
> >         max.cols
> >         > column pointers.  If you try to add more than max.cols
> >         columns, it is
> >         > either an error, or it creates a copy and produces a
> >         warning.
> >
> >
> >         Very nice idea. To over allocate by default so that := can add
> >         columns
> >         fully by reference most of the time seems good to me since
> >         there's a
> >         very low cost to over allocating the vector of column
> >         pointers. Create
> >         the (shallow copy) and issue a warning, I'm thinking, not
> >         error. The
> >         "max.cols" names seems a bit absolute, could it be
> >         "alloc.cols"?  We
> >         could have alloc(DT,2,ncol) or rowalloc(DT,n) and
> >         colalloc(DT,n), or
> >         realloc(...) so users can over alloc themselves before a loop
> >         that adds
> >         columns or inserts rows.  tables() could also report truenrow,
> >         and
> >         truencol as well as nrow and ncol.  What should alloc.cols be,
> >         by
> >         default? How about:  max(100,2*ncol)
> >
> >
> > Fine with me.
> >
> >         What about as.data.table.data.frame()?  Should that
> >         over-allocate, too,
> >         or for speed just change the class attribute as it does now.
> >
> >
> > Yeah, I think any method of creating a data table should
> > over-allocate.  If people want the speed gains, they can set
> > explicitly set alloc.cols.
> >
> >
> >
> >         Maybe checking NAMED would work, in addition. If NAMED was 0,
> >         no need to
> >         warn. Only when NAMED was 1 (or 2) - (not too hot on NAMED) -
> >         would the
> >         warning be necessary.
> >
> >
> >         >
> >         > On Fri, Oct 28, 2011 at 1:10 AM, Matthew Dowle
> >         > <mdowle at mdowle.plus.com> wrote:
> >         >         Interesting one. Adding columns is a bit different
> >         to deleting
> >         >         and
> >         >         modifying columns. Here's how it works. Could make
> >         changes,
> >         >         could
> >         >         document it, or both, what do people think?
> >         >
> >         >         Just like data.frame there is a list vector holding
> >         pointers
> >         >         to the
> >         >         column vectors. A delete column op is done with a
> >         memmove to
> >         >         budge up
> >         >         the column pointers above the column by one place.
> >         That leaves
> >         >         a gap at
> >         >         the end. The length attribute of that vector
> >         (ncol(DT)) is
> >         >         then
> >         >         decremented and the spare 4 bytes (or 8 on 64bit)
> >         are left
> >         >         unused at the
> >         >         end.
> >         >
> >         >         An add column can't be fully by reference because
> >         the list
> >         >         vector is
> >         >         full. A new list vector has to be allocated, one
> >         slot larger,
> >         >         the old
> >         >         pointers memcpy'd over, and the last spot assigned
> >         the pointer
> >         >         to the
> >         >         new column vector.  This copying is negligible
> >         because it's a
> >         >         small list
> >         >         of pointers fitting well within one page. [Unless,
> >         there are
> >         >         many 1000's
> >         >         of columns, which is why it's done as efficiently as
> >         possible
> >         >         using
> >         >         memcpy].
> >         >
> >         >         Aside : There is little known (I guess) distinction
> >         between
> >         >         length and
> >         >         truelength in R internals. Base R doesn't use it,
> >         but we could
> >         >         in
> >         >         data.table. A delete column sets length but leaves
> >         truelength
> >         >         one
> >         >         larger. When the next add column comes along, it
> >         could just do
> >         >         the budge
> >         >         up and insert the column. That may not be so
> >         advantageous for
> >         >         (a small
> >         >         number) of columns,  but the same logic could work
> >         for
> >         >         insert() and
> >         >         delete()ing rows.  Of course, this would mean
> >         whether a
> >         >         visible copy or
> >         >         not is taken depends on what happened previously,
> >         rather than
> >         >         the
> >         >         syntax. That's something we've disliked before, in
> >         the same
> >         >         way we
> >         >         dislike drop=TRUE behaviour and so dropped drop. One
> >         way to
> >         >         approach
> >         >         this might be to advise ":= add *may* not copy. Best
> >         to assume
> >         >         it
> >         >         doesn't; use copy()". If you get in the habbit of
> >         >         "DT2=copy(DT)" then
> >         >         that'll take a deep copy at the time and you're
> >         safe.
> >         >
> >         >         To illustrate the partial (maybe shallow copy is
> >         better word),
> >         >         consider
> >         >         the following :
> >         >
> >         >         > DT = data.table(1:2,3:4)
> >         >         > DT2=DT
> >         >         > DT2[,y:=10L]
> >         >             V1 V2  y
> >         >         [1,]  1  3 10
> >         >         [2,]  2  4 10
> >         >         > DT
> >         >             V1 V2
> >         >         [1,]  1  3
> >         >         [2,]  2  4
> >         >         > DT2
> >         >             V1 V2  y
> >         >         [1,]  1  3 10
> >         >         [2,]  2  4 10
> >         >         > DT2[1,V1:=99L]
> >         >             V1 V2  y
> >         >         [1,] 99  3 10
> >         >         [2,]  2  4 10
> >         >         > DT
> >         >             V1 V2
> >         >         [1,] 99  3
> >         >         [2,]  2  4
> >         >         >
> >         >
> >         >         Matthew
> >         >
> >         >
> >         >         On Thu, 2011-10-27 at 11:46 -0700, Muhammad Waliji
> >         wrote:
> >         >         > I think this is a bug.  DT.2 <- DT.1 doesn't seem
> >         to make a
> >         >         copy in
> >         >         > all cases.
> >         >         >
> >         >         >
> >         >         > > DT.1 <- data.table(x=1, y=1)
> >         >         > > DT.2 <- DT.1
> >         >         > >
> >         >         > > # Both DT.1 and DT.2 are changed.
> >         >         > > DT.2[, y := NULL]
> >         >         >      x
> >         >         > [1,] 1
> >         >         > > DT.1
> >         >         >      x
> >         >         > [1,] 1
> >         >         > > DT.2
> >         >         >      x
> >         >         > [1,] 1
> >         >         > >
> >         >         > > # Only DT.2 is changed
> >         >         > > DT.2[, y := x]
> >         >         >      x y
> >         >         > [1,] 1 1
> >         >         > > DT.1
> >         >         >      x
> >         >         > [1,] 1
> >         >         > > DT.2
> >         >         >      x y
> >         >         > [1,] 1 1
> >         >         >
> >         >         >
> >         >
> >         >         > _______________________________________________
> >         >         > datatable-help mailing list
> >         >         > datatable-help at lists.r-forge.r-project.org
> >         >         >
> >         >
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >         >
> >         >
> >         >
> >         >
> >
> >
> >
> >
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20111028/fd86efc4/attachment-0001.htm>


More information about the datatable-help mailing list