[datatable-help] unique.data.frame should create a copy, right?
Ricardo Saporta
saporta at scarletmail.rutgers.edu
Wed Jul 31 15:49:04 CEST 2013
Arun, just to comment on this part:
<<The answer to your problem is that you should be using `unique(DT1)`
instead of `unique.data.frame(DT1)` because `unique` will call the
"correct" `unique.data.table` method on DT1. >>
I use `unique.data.frame(DT)` all the time.
The reason being that I often have data with multiple rows per key. If I
want all unique rows, `unique.data.table` gives me a result other than what
I need. Any thoughts on a better way?
On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote:
> Frank,
>
> The answer to your problem is that you should be using `unique(DT1)`
> instead of `unique.data.frame(DT1)` because `unique` will call the
> "correct" `unique.data.table` method on DT1.
>
> Now, as to why this is happening… You should know that data.table over
> allocates a list of column pointers in order to add columns by reference
> (you can read about this more, if you wish, by looking at ?`:=`). That is,
> if you do:
>
> DT1 <- data.table(1)
>
> You've created 1 column. But you've (or data.table has) allocated vector
> of a 100 column pointers (by default). You can see this by using the
> function `truelength`.
>
> truelength(DT1)
> > 100
>
> Your problem with `unique.data.frame` is that this `truelength` is not
> maintained after doing this copy. That is:
>
> DT2 <- unique(DT1) # <~~~ correct way
> DT3 <- unique.data.frame(DT1) # <~~~ incorrect way
>
> truelength(DT2)
> > 100
> truelength(DT3)
> > 0
>
> Therefore, we've a problem now. The over-allocated memory is somehow
> "gone" after this copy. Therefore when you do a `:=` after this, we will be
> writing to a memory location which isn't allocated. And this would normally
> lead to a segmentation fault (IIUC).
>
> And this is what happened with an earlier version of data.table in a
> similar context - setting the key of data.table. In version 1.7.8, the key
> of a data.table was set by:
>
> key(DT) <- …
>
> And this resulted in a "copy" that set the true length to 0. So assigning
> by reference after this step lead to a segmentation fault. This is why now
> we have a "setkey" function or more general "setattr" function to assign
> things without R's copy screwing things up.
>
> In order to catch this issue and rectify it without throwing a
> segmentation fault, the attribute ".internal.selfref" was designed.
> Basically it finds these situations and in that case gets a copy before
> assigning by reference. I can't find a documentation on "how" it's done.
> But the way I think of it is that when you assign by reference the existing
> .internal.selfref attribute (which is of class externalptr) is compared
> with the actual value of your data.table and if they match, then
> everything's good. Else, it has to make a copy and set the correct ptr as
> the attribute.
>
> You can read about this in ?setkey. So in essence use `unique` which'll
> call the correct `unique.data.table` (hidden) function. Hope this helps. If
> there's ambiguity or I got something wrong, please point out.
>
> Arun
>
> On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote:
>
> I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a
> warning about pointers, so apparently it is not...?
>
> A short example:
>
> DT1 <- data.table(1)
> DT2 <- unique.data.frame(DT1)
> DT2[,gah:=1]
>
>
> An example closer to my application, undoing a cartesian/cross join:
>
> DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0]
> setkey(DT1,A)
> DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE])
> DT2[,gah:=1] # warning: I should have made a copy, apparently
>
>
> I'm fine with explicitly making a copy, of course, and don't really know
> anything about pointers. I just thought I'd bring it up.
>
> --Frank
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org <javascript:_e({}, 'cvml',
> 'datatable-help at lists.r-forge.r-project.org');>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
--
Ricardo Saporta
Graduate Student, Data Analytics
Rutgers University, New Jersey
e: saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130731/b096ba25/attachment.html>
More information about the datatable-help
mailing list