[datatable-help] Are you aware of this?

Arunkumar Srinivasan aragorn168b at gmail.com
Sat Jun 14 07:35:16 CEST 2014


Jeremiah,

Thanks. Just a few hours ago, I answered a similar question to a post from Ron (pasted below):

`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator.
There’s a pending feature request on adding this point (on explicit copy) to the FAQs, which we’ve not gotten to, yet.

To our knowledge, people do overcome this difference quite quickly.

It’s not necessary to know about pointers to understand that the object gets modified in-place. I’m not a python user at all, but recently came to know that this is also a feature there: https://docs.python.org/2/library/copy.html

But point taken. That explicit copy will be required will be added to the FAQs.


Arun

From: jeremiah rounds roundsjeremiah at gmail.com
Reply: jeremiah rounds roundsjeremiah at gmail.com
Date: June 14, 2014 at 7:23:22 AM
To: datatable-help at lists.r-forge.r-project.org datatable-help at lists.r-forge.r-project.org
Subject:  [datatable-help] Are you aware of this?  

As a fan of your work I have always been curious if you are aware of this?  I find it causes new users to make mistakes.


> dt = list()
> dt$x = 1:10
> dt$y = letters[10:1]
> dt = as.data.table(as.data.frame(dt))
> dt
     x y
 1:  1 j
 2:  2 i
 3:  3 h
 4:  4 g
 5:  5 f
 6:  6 e
 7:  7 d
 8:  8 c
 9:  9 b
10: 10 a
> x0 = dt$x
> x1 = dt$x
> x0[1] = 11
> setkeyv(dt,"y")
> x0
 [1] 11  2  3  4  5  6  7  8  9 10
> x1
 [1] 10  9  8  7  6  5  4  3  2  1
> x1 == x0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


x0 and x1 have assignments at the same exact time, and since R data.frame's will not do this, it lures people into thinking they are then identical and distinct as they are with data.frame's.  My theory is they are not actually copied: they are promised.  When x0 has its index 1 changed it induces a copy distinct from dt$x, but x1 has had no operation on it so it refers to dt$x with its promise. Setting the key on dt reorders it and since x1 still hasn't been evaluated it now matches the order of dt.

I found new users getting unpredictable results because they would try to use a data.table as a data.frame and induce this with sorts.  If you thought you copied something in a particular order in dt by doing the assigning ahead of the setkeyv you make a mistake.   You don't really expect x1 assigned maybe a page of code above to have its order changed by a setkeyv.  You do if you think about C pointers and references, but in R you really don't think that way.  Many R users don't even know what a pointer is.


Thanks,
Jeremiah

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] locfit_1.5-9.1       edgeR_3.4.2          limma_3.18.13       
[4] data.table_1.9.2     GenomicRanges_1.14.4 XVector_0.2.0       
[7] IRanges_1.20.7       BiocGenerics_0.8.0  

loaded via a namespace (and not attached):
[1] grid_3.0.1      lattice_0.20-15 plyr_1.8.1      Rcpp_0.11.1    
[5] reshape2_1.4    stats4_3.0.1    stringr_0.6.2   tools_3.0.1    



_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/f0b579ae/attachment.html>


More information about the datatable-help mailing list