[datatable-help] unique.data.frame should create a copy, right?

Ricardo Saporta saporta at scarletmail.rutgers.edu
Wed Jul 31 18:04:38 CEST 2013


Hey Arun,

great call on using `alloc.col()`   I would not have thought of that.

Since we were previously talking about updates to common functions in the
package, I wouldnt mind seeing a arugment added to `unique.data.table`
along the lines of `useKey=FALSE`  (perhaps better named).   Thoughts?

Rick


On Wed, Jul 31, 2013 at 11:06 AM, Arunkumar Srinivasan <
aragorn168b at gmail.com> wrote:

>  Ricardo,
>
> Yes, I was also thinking of this, because of precisely the issue you
> mention. In this case, I'd do `invisible(alloc.col(DT2))` before assigning
> by reference. The typical way of converting from a data.frame to a
> data.table (without complete copy or rather with a "shallow" copy) is:
>
> DF <- data.frame(x=1:5, y=6:10)
> tracemem(DF)
> [1] "<0x100f08678>"
>
> setattr(DF, 'class', c('data.table', 'data.frame'))
> data.table:::settruelength(DF, 0)
> invisible(alloc.col(DF))
> tracemem(DF)
> [1] "<0x103c23b30>"
>
> DF[, z := 1]
>
> Even thought there's a copy happening, this, as I understand is a
> "shallow" copy (copying only references/pointers and not the entire data)
> and therefore should have almost negligible time in copying). Now, if you
> look at the second line, it first sets the "truelength" attribute to 0
> (which is set to NULL for a data.frame, if you look at
> as.data.frame.data.table function). Then it allocates the columns with
> "alloc.col". So,
>
> DT1 <- data.table(1)
> DT2 <- unique.data.frame(DT1) # <~~~ your true length is screwed up
> truelength(DT2)
> # [1] 0
>
> invisible(alloc.col(DT2))
> truelength(DT2)
> # [1] 100
>
> DT2[, w := 2]
> # no warning / full copy.
>
> So, Frank, I guess this is an alternate way if you don't want the
> warning/full copy, but you want to specifically use `unique.data.frame`.
>
> Thanks for bringing it up Ricardo. If I've gotten something wrong, feel
> free to correct me..
>
> Arun
>
> On Wednesday, July 31, 2013 at 3:49 PM, Ricardo Saporta wrote:
>
> Arun, just to comment on this part:
>
> <<The answer to your problem is that you should be using `unique(DT1)`
> instead of `unique.data.frame(DT1)` because `unique` will call the
> "correct" `unique.data.table` method on DT1. >>
>
> I use `unique.data.frame(DT)` all the time.
> The reason being that I often have data with multiple rows per key.  If I
> want all unique rows, `unique.data.table` gives me a result other than
> what I need.   Any thoughts on a better way?
>
> On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote:
>
> Frank,
>
> The answer to your problem is that you should be using `unique(DT1)`
> instead of `unique.data.frame(DT1)` because `unique` will call the
> "correct" `unique.data.table` method on DT1.
>
> Now, as to why this is happening… You should know that data.table over
> allocates a list of column pointers in order to add columns by reference
> (you can read about this more, if you wish, by looking at ?`:=`). That is,
> if you do:
>
> DT1 <- data.table(1)
>
> You've created 1 column. But you've (or data.table has) allocated vector
> of a 100 column pointers (by default). You can see this by using the
> function `truelength`.
>
> truelength(DT1)
> > 100
>
> Your problem with `unique.data.frame` is that this `truelength` is not
> maintained after doing this copy. That is:
>
> DT2 <- unique(DT1) # <~~~ correct way
> DT3 <- unique.data.frame(DT1) # <~~~ incorrect way
>
> truelength(DT2)
> > 100
> truelength(DT3)
> > 0
>
> Therefore, we've a problem now. The over-allocated memory is somehow
> "gone" after this copy. Therefore when you do a `:=` after this, we will be
> writing to a memory location which isn't allocated. And this would normally
> lead to a segmentation fault (IIUC).
>
> And this is what happened with an earlier version of data.table in a
> similar context - setting the key of data.table. In version  1.7.8, the key
> of a data.table was set by:
>
> key(DT) <- …
>
> And this resulted in a "copy" that set the true length to 0. So assigning
> by reference after this step lead to a segmentation fault. This is why now
> we have a "setkey" function or more general "setattr" function to assign
> things without R's copy screwing things up.
>
> In order to catch this issue and rectify it without throwing a
> segmentation fault, the attribute ".internal.selfref" was designed.
> Basically it finds these situations and in that case gets a copy before
> assigning by reference. I can't find a documentation on "how" it's done.
> But the way I think of it is that when you assign by reference the existing
> .internal.selfref attribute (which is of class externalptr) is compared
> with the actual value of your data.table and if they match, then
> everything's good. Else, it has to make a copy and set the correct ptr as
> the attribute.
>
> You can read about this in ?setkey. So in essence use `unique` which'll
> call the correct `unique.data.table` (hidden) function. Hope this helps. If
> there's ambiguity or I got something wrong, please point out.
>
> Arun
>
> On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote:
>
> I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a
> warning about pointers, so apparently it is not...?
>
> A short example:
>
>  DT1 <- data.table(1)
> DT2 <- unique.data.frame(DT1)
> DT2[,gah:=1]
>
>
> An example closer to my application, undoing a cartesian/cross join:
>
> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0]
> setkey(DT1,A)
> DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE])
> DT2[,gah:=1] # warning: I should have made a copy, apparently
>
>
> I'm fine with explicitly making a copy, of course, and don't really know
> anything about pointers. I just thought I'd bring it up.
>
> --Frank
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> Ricardo Saporta
> Graduate Student, Data Analytics
> Rutgers University, New Jersey
> e: saporta at rutgers.edu
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130731/d8dc964a/attachment-0001.html>


More information about the datatable-help mailing list