[datatable-help] Are you aware of this?

jeremiah rounds roundsjeremiah at gmail.com
Sat Jun 14 07:23:10 CEST 2014


As a fan of your work I have always been curious if you are aware of this?
 I find it causes new users to make mistakes.


> dt = list()
> dt$x = 1:10
> dt$y = letters[10:1]
> dt = as.data.table(as.data.frame(dt))
> dt
     x y
 1:  1 j
 2:  2 i
 3:  3 h
 4:  4 g
 5:  5 f
 6:  6 e
 7:  7 d
 8:  8 c
 9:  9 b
10: 10 a
> x0 = dt$x
> x1 = dt$x
> x0[1] = 11
> setkeyv(dt,"y")
> x0
 [1] 11  2  3  4  5  6  7  8  9 10
> x1
 [1] 10  9  8  7  6  5  4  3  2  1
> x1 == x0
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


x0 and x1 have assignments at the same exact time, and since R data.frame's
will not do this, it lures people into thinking they are then identical and
distinct as they are with data.frame's.  My theory is they are not actually
copied: they are promised.  When x0 has its index 1 changed it induces a
copy distinct from dt$x, but x1 has had no operation on it so it refers to
dt$x with its promise. Setting the key on dt reorders it and since x1 still
hasn't been evaluated it now matches the order of dt.

I found new users getting unpredictable results because they would try to
use a data.table as a data.frame and induce this with sorts.  If you
thought you copied something in a particular order in dt by doing the
assigning ahead of the setkeyv you make a mistake.   You don't really
expect x1 assigned maybe a page of code above to have its order changed by
a setkeyv.  You do if you think about C pointers and references, but in R
you really don't think that way.  Many R users don't even know what a
pointer is.


Thanks,
Jeremiah

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] splines   parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] locfit_1.5-9.1       edgeR_3.4.2          limma_3.18.13
[4] data.table_1.9.2     GenomicRanges_1.14.4 XVector_0.2.0
[7] IRanges_1.20.7       BiocGenerics_0.8.0

loaded via a namespace (and not attached):
[1] grid_3.0.1      lattice_0.20-15 plyr_1.8.1      Rcpp_0.11.1
[5] reshape2_1.4    stats4_3.0.1    stringr_0.6.2   tools_3.0.1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140614/615ff843/attachment.html>


More information about the datatable-help mailing list