[datatable-help] unique.data.frame should create a copy, right?

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Jul 31 12:10:38 CEST 2013


Frank,  

The answer to your problem is that you should be using `unique(DT1)` instead of `unique.data.frame(DT1)` because `unique` will call the "correct" `unique.data.table` method on DT1.  

Now, as to why this is happening… You should know that data.table over allocates a list of column pointers in order to add columns by reference (you can read about this more, if you wish, by looking at ?`:=`). That is, if you do:

DT1 <- data.table(1)

You've created 1 column. But you've (or data.table has) allocated vector of a 100 column pointers (by default). You can see this by using the function `truelength`.

truelength(DT1)
> 100

Your problem with `unique.data.frame` is that this `truelength` is not maintained after doing this copy. That is:

DT2 <- unique(DT1) # <~~~ correct way
DT3 <- unique.data.frame(DT1) # <~~~ incorrect way

truelength(DT2)
> 100
truelength(DT3)
> 0

Therefore, we've a problem now. The over-allocated memory is somehow "gone" after this copy. Therefore when you do a `:=` after this, we will be writing to a memory location which isn't allocated. And this would normally lead to a segmentation fault (IIUC).  

And this is what happened with an earlier version of data.table in a similar context - setting the key of data.table. In version  1.7.8, the key of a data.table was set by:

key(DT) <- …

And this resulted in a "copy" that set the true length to 0. So assigning by reference after this step lead to a segmentation fault. This is why now we have a "setkey" function or more general "setattr" function to assign things without R's copy screwing things up.

In order to catch this issue and rectify it without throwing a segmentation fault, the attribute ".internal.selfref" was designed. Basically it finds these situations and in that case gets a copy before assigning by reference. I can't find a documentation on "how" it's done. But the way I think of it is that when you assign by reference the existing .internal.selfref attribute (which is of class externalptr) is compared with the actual value of your data.table and if they match, then everything's good. Else, it has to make a copy and set the correct ptr as the attribute.

You can read about this in ?setkey. So in essence use `unique` which'll call the correct `unique.data.table` (hidden) function. Hope this helps. If there's ambiguity or I got something wrong, please point out.

Arun


On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote:

> I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a warning about pointers, so apparently it is not...?  
>  
> A short example:
>  
> > DT1 <- data.table(1)
> > DT2 <- unique.data.frame(DT1)
> >  
> > DT2[,gah:=1]
> >  
>  
>  
> An example closer to my application, undoing a cartesian/cross join:  
>  
> > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0]
> > setkey(DT1,A)
> >  
> > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE])
> >  
> > DT2[,gah:=1] # warning: I should have made a copy, apparently
> >  
>  
>  
> I'm fine with explicitly making a copy, of course, and don't really know anything about pointers. I just thought I'd bring it up.  
>  
> --Frank  
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130731/de69f943/attachment.html>


More information about the datatable-help mailing list