[datatable-help] data.table BUG : data.table assignment

Matthew Dowle mdowle at mdowle.plus.com
Fri Oct 5 00:14:25 CEST 2012


> One thing that would be super-awesome is a lazy copy-on-write mechanism.
> By which I mean table2 would initially point to the same memory location
> as table1, but as soon as it's modified, the portions being modified would
> be copied to a new location.

Good point. Ok, time to reveal data.table:::shallow. It's exactly that.
Adding columns changes just the shallow copy. A column plonk changes just
the shallow copy. But, subassigning by reference isn't clever enough (yet)
to know that a copy-on-write at column level needs to be done, which is
one reason it hasn't been exported yet. If you're careful you could plonk
first before subassign, though.

shallow = data.table:::shallow  # self export (we're experts here)
DT = data.table(a=1:3,b=4:6)
DT2 = shallow(DT)   # instant, relative to deep copy()
DT2[,c:=7:9]
DT2
   a b c
1: 1 4 7
2: 2 5 8
3: 3 6 9
DT   # The add only changed DT2
   a b
1: 1 4
2: 2 5
3: 3 6
DT2[,b:=10:12]  # "plonk" b changes just DT2
DT2
   a  b c
1: 1 10 7
2: 2 11 8
3: 3 12 9
DT
   a b
1: 1 4
2: 2 5
3: 3 6
DT2[2,a:=13L]
DT2
    a  b c
1:  1 10 7
2: 13 11 8
3:  3 12 9
DT    # surprise, DT$a changed too.
    a b
1:  1 4
2: 13 5
3:  3 6
DT2[2,b:=14L]
DT2
    a  b c
1:  1 10 7
2: 13 14 8
3:  3 12 9
DT    # DT$b not changed too, because earlier plonk replaced that column
    a b
1:  1 4
2: 13 5
3:  3 6

This is one reason the term 'plonk' was introduced. To give a name to this
special operation that replaces the whole column with a new vector and
breaks any shallow reference. (As well as being the way to change a
column's type.)

copy-on-:= at column level, after a shallow(), is definitely on the to do
list to implement. Along with over-allocation of columns, for fast
insert() and delete() of rows.

In the meantime, shallow() is used internally quite a lot and seems
stable. Feel free to use it.  I doubt the name or arguments will change in
future.

>
> It's pretty rare use case to make a copy of a data structure without
> intending to modify it.  The only instance I can really think of is
> passing arguments to a function, which IIUC is already copy-on-write in R?

Yes. It's the operators you use on DT that determine whether a copy is
taken.  := is not copy-on-write, even within a function, but <- is.

> Also, I do think Natus' example is a little different from your example on
> SO.  In his, a new column is being added, but in yours, an existing column
> is being modified.

Good point. The example above clears this up hopefully.

> Is there a doc reference showing which circumstances make `<-` do a
> copy-by-reference, and which do a deep copy?  For example, if I do `table2
> <- table1[x>1, list(id)]` , it seems to do a deep copy:

Yes that's a deep copy currently. When copy-on-:= at column level is
implemented, it'll be a shallow copy.

In terms of documentation improvements, please make very detailed
suggestions; e.g, providing paragraphs to be placed in which files, where.
I'm so close to it, it seems crystal clear to my eyes!

>
>> table1<-data.table(id=c(1,2,3),x=c(1,2,3))
>> table2 <- table1[x>1, list(id)]
>> table2[, id := 3:4]
>> table1
>    id x
> 1:  1 1
> 2:  2 2
> 3:  3 3
>
> Sorry if this has already been hashed out a million times, I'm pretty new
> to data.table.

Not at all, great questions.

Matthew

>
> -Ken
>
> From: datatable-help-bounces at lists.r-forge.r-project.org
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] On Behalf Of
> Christoph Jäckel
> Sent: Thursday, October 04, 2012 7:07 AM
> To: natus
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] data.table BUG : data.table assignment
>
> This is actually intended behaviour and I had the problem once as well.
> Here is my question and the solution to it:
>
> http://stackoverflow.com/questions/8030452/pass-by-reference-the-operator-in-the-data-table-package
>
> In a nutshell: Use copy() if you don't want table2 to have y as well.
>
> I hope this helps,
>
> Christoph
> On Thu, Oct 4, 2012 at 1:56 PM, natus
> <niparisco at gmail.com<mailto:niparisco at gmail.com>> wrote:
> Hello,
>
> see this example :
>
> require(data.table)
>
> table1<-data.table(id=c(1,2,3),x=c(1,2,3))
> table2<-table1
> table1[,y:=sum(x)]
> table1
> table2
>
> The problem ? Both of table1 and table2 have the variable 'y' BUT only
> table1 should.
>
> Thx
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/data-table-BUG-data-table-assignment-tp4644988.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org<mailto:datatable-help at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> ________________________________
> CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution of
> any kind is strictly prohibited. If you are not the intended recipient,
> please contact the sender via reply e-mail and destroy all copies of the
> original message. Thank you.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list