[datatable-help] data.table is asking for help

Ron Hylton rhylton at verizon.net
Sat Jun 14 02:51:24 CEST 2014


I suspected it was something like this.  As one clarification, there is a setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to setkey(test) so all columns are in the original datatable key then the warning goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding is that a function argument should be modifiable within the function body without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [mailto:aragorn168b at gmail.com] 
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should normally not be allowed. What happens is, when you set key the first time, there’s no key set (here) and therefore key is set on all the columns x1, x2 and x3. 

Now, the next group (in the by=.) is passed to your function, it’ll have the key already set to x1,x2,x3 (because setkey modifies the object by reference), but .SD has obtained new data corresponding to this group. And data.table sorts this data, knowing that it already has key set.. but if the key is set then the order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table warns in those scenarios.. and that’s why you get the warning. 

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}

Basically, we set the key of f (which is equal to .SD as it’s only modified by reference) to NULL everytime after.. so that .SD for the new group will not have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD should not be possible (locking binding doesn’t seem to affect things done by reference..). .SD however should retain the key of the data.table, if a key was set, wherever possible.

 

Arun


From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net> 
Date: June 14, 2014 at 1:55:53 AM
To: datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org> 
Subject:  [datatable-help] data.table is asking for help 





The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you didn't go under the hood please let datatable-help know so the root cause can be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, but maybe that‘s useful for someone.  The first case is the one that gives the warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using dataframe & ddply and I expected datatable to be substantially faster; the opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and others are values in the sense that each row with the same set of keys should have the same set of values.  Find all the key sets for which this is not true and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________ 
datatable-help mailing list 
datatable-help at lists.r-forge.r-project.org <mailto:datatable-help at lists.r-forge.r-project.org>  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140613/1026d1a3/attachment-0001.html>


More information about the datatable-help mailing list