[datatable-help] unique.data.frame should create a copy, right?

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Jul 31 18:09:58 CEST 2013


Ricardo,

You read my mind.. :) I was thinking of the same as well.. Whether the community agrees or not would be interesting as well. It could save trouble with "alloc.col" manually.


Arun


On Wednesday, July 31, 2013 at 6:04 PM, Ricardo Saporta wrote:

> Hey Arun,  
>  
> great call on using `alloc.col()`   I would not have thought of that.  
>  
> Since we were previously talking about updates to common functions in the package, I wouldnt mind seeing a arugment added to `unique.data.table` along the lines of `useKey=FALSE`  (perhaps better named).   Thoughts?   
>  
> Rick  
>  
> On Wed, Jul 31, 2013 at 11:06 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Ricardo,  
> >  
> > Yes, I was also thinking of this, because of precisely the issue you mention. In this case, I'd do `invisible(alloc.col(DT2))` before assigning by reference. The typical way of converting from a data.frame to a data.table (without complete copy or rather with a "shallow" copy) is:  
> >  
> > DF <- data.frame(x=1:5, y=6:10)
> > tracemem(DF)
> > [1] "<0x100f08678>"
> >  
> > setattr(DF, 'class', c('data.table', 'data.frame'))  
> > data.table:::settruelength(DF, 0)
> > invisible(alloc.col(DF))
> > tracemem(DF)
> > [1] "<0x103c23b30>"
> >  
> > DF[, z := 1]
> >  
> > Even thought there's a copy happening, this, as I understand is a "shallow" copy (copying only references/pointers and not the entire data) and therefore should have almost negligible time in copying). Now, if you look at the second line, it first sets the "truelength" attribute to 0 (which is set to NULL for a data.frame, if you look at as.data.frame.data.table function). Then it allocates the columns with "alloc.col". So,   
> >  
> > DT1 <- data.table(1)
> > DT2 <- unique.data.frame(DT1) # <~~~ your true length is screwed up
> > truelength(DT2)
> > # [1] 0
> >  
> > invisible(alloc.col(DT2))  
> > truelength(DT2)
> > # [1] 100
> >  
> > DT2[, w := 2]
> > # no warning / full copy.
> >  
> > So, Frank, I guess this is an alternate way if you don't want the warning/full copy, but you want to specifically use `unique.data.frame`.
> >  
> > Thanks for bringing it up Ricardo. If I've gotten something wrong, feel free to correct me..  
> >  
> > Arun
> >  
> >  
> > On Wednesday, July 31, 2013 at 3:49 PM, Ricardo Saporta wrote:
> >  
> > > Arun, just to comment on this part:  
> > >  
> > > <<The answer to your problem is that you should be using `unique(DT1)` instead of `unique.data.frame(DT1)` because `unique` will call the "correct" `unique.data.table` method on DT1. >>  
> > >  
> > > I use `unique.data.frame(DT)` all the time.  
> > > The reason being that I often have data with multiple rows per key.  If I want all unique rows, `unique.data.table` gives me a result other than what I need.   Any thoughts on a better way?  
> > >  
> > > On Wednesday, July 31, 2013, Arunkumar Srinivasan wrote:
> > > > Frank,  
> > > >  
> > > > The answer to your problem is that you should be using `unique(DT1)` instead of `unique.data.frame(DT1)` because `unique` will call the "correct" `unique.data.table` method on DT1.   
> > > >  
> > > > Now, as to why this is happening… You should know that data.table over allocates a list of column pointers in order to add columns by reference (you can read about this more, if you wish, by looking at ?`:=`). That is, if you do:  
> > > >  
> > > > DT1 <- data.table(1)
> > > >  
> > > > You've created 1 column. But you've (or data.table has) allocated vector of a 100 column pointers (by default). You can see this by using the function `truelength`.  
> > > >  
> > > > truelength(DT1)
> > > > > 100
> > > >  
> > > > Your problem with `unique.data.frame` is that this `truelength` is not maintained after doing this copy. That is:
> > > >  
> > > > DT2 <- unique(DT1) # <~~~ correct way  
> > > > DT3 <- unique.data.frame(DT1) # <~~~ incorrect way
> > > >  
> > > > truelength(DT2)
> > > > > 100
> > > > truelength(DT3)
> > > > > 0
> > > >  
> > > > Therefore, we've a problem now. The over-allocated memory is somehow "gone" after this copy. Therefore when you do a `:=` after this, we will be writing to a memory location which isn't allocated. And this would normally lead to a segmentation fault (IIUC).   
> > > >  
> > > > And this is what happened with an earlier version of data.table in a similar context - setting the key of data.table. In version  1.7.8, the key of a data.table was set by:
> > > >  
> > > > key(DT) <- …  
> > > >  
> > > > And this resulted in a "copy" that set the true length to 0. So assigning by reference after this step lead to a segmentation fault. This is why now we have a "setkey" function or more general "setattr" function to assign things without R's copy screwing things up.  
> > > >  
> > > > In order to catch this issue and rectify it without throwing a segmentation fault, the attribute ".internal.selfref" was designed. Basically it finds these situations and in that case gets a copy before assigning by reference. I can't find a documentation on "how" it's done. But the way I think of it is that when you assign by reference the existing .internal.selfref attribute (which is of class externalptr) is compared with the actual value of your data.table and if they match, then everything's good. Else, it has to make a copy and set the correct ptr as the attribute.  
> > > >  
> > > > You can read about this in ?setkey. So in essence use `unique` which'll call the correct `unique.data.table` (hidden) function. Hope this helps. If there's ambiguity or I got something wrong, please point out.  
> > > >  
> > > > Arun
> > > >  
> > > >  
> > > > On Wednesday, July 31, 2013 at 12:07 AM, Frank Erickson wrote:
> > > >  
> > > > > I expect DT2 <- unique.data.frame(DT1) to be a new object, but get a warning about pointers, so apparently it is not...?  
> > > > >  
> > > > > A short example:
> > > > >  
> > > > > > DT1 <- data.table(1)
> > > > > > DT2 <- unique.data.frame(DT1)
> > > > > >  
> > > > > > DT2[,gah:=1]
> > > > > >  
> > > > >  
> > > > >  
> > > > > An example closer to my application, undoing a cartesian/cross join:  
> > > > >  
> > > > > > DT1 <- CJ(A=0:1,B=1:6,D0=0:1,D=0:1)[D>=D0]
> > > > > > setkey(DT1,A)
> > > > > >  
> > > > > > DT2 <- unique.data.frame(DT1[,-which(names(DT1)%in%'B'),with=FALSE])
> > > > > >  
> > > > > > DT2[,gah:=1] # warning: I should have made a copy, apparently
> > > > > >  
> > > > >  
> > > > >  
> > > > > I'm fine with explicitly making a copy, of course, and don't really know anything about pointers. I just thought I'd bring it up.  
> > > > >  
> > > > > --Frank  
> > > > > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> > >  
> > > --  
> > > Ricardo Saporta
> > > Graduate Student, Data Analytics
> > > Rutgers University, New Jersey
> > > e: saporta at rutgers.edu (mailto:saporta at rutgers.edu)
> > >  
> > >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130731/e3f61eba/attachment.html>


More information about the datatable-help mailing list