[datatable-help] 'by' on a numeric column produces inconsistent utput

Arunkumar Srinivasan aragorn168b at gmail.com
Thu Dec 19 12:56:01 CET 2013


Just tested this on the devel version (today's). And yes, this issue happens. But I'm not sure if this is an issue with 'data.table' per-se: 

On a clean session, if you do this:

require(data.table)
set.seed(32)
n <- 3
dt <- data.table(y=rnorm(n), by=round( rnorm(n), 1))

ll <- list(dt$by)
yy <- ll[[1L]]
address(dt$by) # [1] "0x7fad3c524a40"
address(ll[[1L]]) # [1] "0x7fad3c524a40"
address(yy) # [1] "0x7fad3c524a40"


You see that all three are pointing to the same address. And that's why the result is wrong because internally "yy" will be changed by reference during "fastorder". And it is *not* supposed to point to "yy" but to have made a copy.

After doing it the first time, the pointing changes back to how it's in R-stable.. Not sure if this is desirable. Probably should report on R-devel.

On R-3.0.2, the same commands as above on a clean session:

require(data.table)
set.seed(32)
n <- 3
dt <- data.table(y=rnorm(n), by=round( rnorm(n), 1))

ll <- list(dt$by)
yy <- ll[[1L]]
address(dt$by) # [1] "0x7fc35b640408"
address(ll[[1L]]) # [1] "0x7fc35a0ec838"
address(yy) # [1] "0x7fc35a0ec838"





Arun


On Thursday, December 19, 2013 at 9:43 AM, Arunkumar Srinivasan wrote:

> Simon, 
> 
> Thanks. One more towards my way :). I think we've nailed down the problem to R-devel version. I'll write again once I discuss it over with Kevin. 
> 
> Arun
> 
> 
> On Thursday, December 19, 2013 at 9:26 AM, Simon Zehnder wrote:
> 
> > Hi Arun,
> > 
> > here the results on Mac OS X Mavericks with gcc 4.8.2
> > 
> > data.table 1.8.10:
> > 
> > > set.seed(32)
> > > n <- 3
> > > dt <- data.table(
> > > 
> > 
> > + y=rnorm(n),
> > + by=round( rnorm(n), 1)
> > + )
> > > 
> > > dt[,
> > + list(max=max(y, na.rm=TRUE)),
> > + by=list(by)
> > + ]
> > by max
> > 1: 0.7 0.01464054
> > 2: 0.4 0.87328871
> > > 
> > > dt[,
> > + list(max=max(y, na.rm=TRUE)),
> > + by=list(by)
> > + ]
> > by max
> > 1: 0.7 0.01464054
> > 2: 0.4 0.87328871
> > 
> > data.table 1.8.11:
> > 
> > > set.seed(32)
> > > n <- 3
> > > dt <- data.table(
> > > 
> > 
> > + y=rnorm(n),
> > + by=round( rnorm(n), 1)
> > + )
> > > 
> > > dt[,
> > + list(max=max(y, na.rm=TRUE)),
> > + by=list(by)
> > + ]
> > by max
> > 1: 0.7 0.01464054
> > 2: 0.4 0.87328871
> > > 
> > > dt[,
> > + list(max=max(y, na.rm=TRUE)),
> > + by=list(by)
> > + ]
> > by max
> > 1: 0.7 0.01464054
> > 2: 0.4 0.87328871
> > 
> > Best
> > 
> > Simon
> > 
> > 
> > On 19 Dec 2013, at 09:05, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > 
> > > Simon, sure.
> > > 
> > > set.seed(32)
> > > n <- 3
> > > dt <- data.table(
> > > y=rnorm(n),
> > > by=round( rnorm(n), 1)
> > > )
> > > 
> > > dt[,
> > > list(max=max(y, na.rm=TRUE)),
> > > by=list(by)
> > > ]
> > > 
> > > dt[,
> > > list(max=max(y, na.rm=TRUE)),
> > > by=list(by)
> > > ]
> > > 
> > > 
> > > 
> > > Arun
> > > 
> > > On Thursday, December 19, 2013 at 8:49 AM, Simon Zehnder wrote:
> > > 
> > > > Arun,
> > > > 
> > > > if you could send me the reproducible code in copyable form I can as well try it on Mac OS X Mavericks with gcc 4.8.
> > > > 
> > > > Best
> > > > 
> > > > Simon
> > > > 
> > > > On 19 Dec 2013, at 08:44, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > 
> > > > > Aha, the issue seems to be with 'uniqlist', not sure why it gives
> > > > > > (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
> > > > > 
> > > > > 1,2,3 for you and 1,3 consistently for me. I'll revert this back to `duplist` for now. Not sure how to solve this though. I've tried it so far on 3 machines:
> > > > > 
> > > > > 1) OS X 10.8.5 + libvm (gcc)
> > > > > 2) OS X Mavericks + Clang
> > > > > 3) Debian Weezy + gcc
> > > > > 
> > > > > All of them give consistent output. Man this is such a drag.
> > > > > 
> > > > > Arun
> > > > > 
> > > > > On Thursday, December 19, 2013 at 8:37 AM, Kevin Ushey wrote:
> > > > > 
> > > > > > Hi Arun,
> > > > > > 
> > > > > > Here's the output on my machine -- other information missing from
> > > > > > before; it's with OSX Mavericks, with R and data.table compiled with
> > > > > > Apple clang.
> > > > > > 
> > > > > > ---
> > > > > > 
> > > > > > > library(data.table, lib="/Users/kevinushey/Library/R/3.1/library")
> > > > > > > set.seed(32)
> > > > > > > n <- 3
> > > > > > > dt <- data.table(
> > > > > > > 
> > > > > > 
> > > > > > + y=rnorm(n),
> > > > > > + by=round( rnorm(n), 1)
> > > > > > + )
> > > > > > ## run one
> > > > > > > byval <- list(by=dt$by)
> > > > > > > (o__ <- data.table:::fastorder(byval)) # 2,3,1
> > > > > > > 
> > > > > > 
> > > > > > [1] 2 3 1
> > > > > > > (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
> > > > > > 
> > > > > > [1] 1 2 3
> > > > > > > (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
> > > > > > 
> > > > > > [1] 1 1 1
> > > > > > > (firstofeachgroup = o__[f__]) # 2,1
> > > > > > 
> > > > > > [1] 2 3 1
> > > > > > > (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
> > > > > > 
> > > > > > [1] 3 1 2
> > > > > > > (f__ = f__[origorder]) # 3,1
> > > > > > 
> > > > > > [1] 3 1 2
> > > > > > > (len__ = len__[origorder]) # 2,1
> > > > > > 
> > > > > > [1] 1 1 1
> > > > > > 
> > > > > > ## run two
> > > > > > > (o__ <- data.table:::fastorder(byval)) # 2,3,1
> > > > > > 
> > > > > > [1] 1 2 3
> > > > > > > (f__ = data.table:::uniqlist(byval, order=o__)) # 1,3
> > > > > > 
> > > > > > [1] 1 3
> > > > > > > (len__ = data.table:::uniqlengths(f__, nrow(dt))) # 2,1
> > > > > > 
> > > > > > [1] 2 1
> > > > > > > (firstofeachgroup = o__[f__]) # 2,1
> > > > > > 
> > > > > > [1] 1 3
> > > > > > > (origorder = data.table:::iradixorder(firstofeachgroup)) # 2,1
> > > > > > 
> > > > > > [1] 1 2
> > > > > > > (f__ = f__[origorder]) # 3,1
> > > > > > 
> > > > > > [1] 1 3
> > > > > > > (len__ = len__[origorder]) # 2,1
> > > > > > 
> > > > > > [1] 2 1
> > > > > > 
> > > > > > On Wed, Dec 18, 2013 at 11:22 PM, Arunkumar Srinivasan
> > > > > > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > > > > Not sure how to debug without being able to reproduce. Tried on Mac OS X
> > > > > > > 10.8.5 and Debian GNU/Linux 7 (wheezy). I don't have access to a windows
> > > > > > > machine. I consistently gives me this:
> > > > > > > 
> > > > > > > > dt[,
> > > > > > > + list(max=max(y, na.rm=TRUE)),
> > > > > > > + by=list(by)
> > > > > > > + ]
> > > > > > > by max
> > > > > > > 1: 0.7 0.01464054
> > > > > > > 2: 0.4 0.87328871
> > > > > > > > 
> > > > > > > > dt[,
> > > > > > > + list(max=max(y, na.rm=TRUE)),
> > > > > > > + by=list(by)
> > > > > > > + ]
> > > > > > > by max
> > > > > > > 1: 0.7 0.01464054
> > > > > > > 2: 0.4 0.87328871
> > > > > > > 
> > > > > > > Can either of you provide me with the output of these steps in cases where
> > > > > > > there's an error? I've commented the output I get for each step.
> > > > > > > 
> > > > > > > byval <- list(by=dt$by)
> > > > > > > o__ <- data.table:::fastorder(byval) # 2,3,1
> > > > > > > f__ = data.table:::uniqlist(byval, order=o__) # 1,3
> > > > > > > len__ = data.table:::uniqlengths(f__, nrow(dt)) # 2,1
> > > > > > > firstofeachgroup = o__[f__] # 2,1
> > > > > > > origorder = data.table:::iradixorder(firstofeachgroup) # 2,1
> > > > > > > f__ = f__[origorder] # 3,1
> > > > > > > len__ = len__[origorder] # 2,1
> > > > > > > 
> > > > > > > 
> > > > > > > Arun
> > > > > > > 
> > > > > > > <...snip...>
> > > > > 
> > > > > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131219/0bfcf15f/attachment.html>


More information about the datatable-help mailing list