[datatable-help] data.table is asking for help

Wed Jun 18 02:34:14 CEST 2014

Hi Matt,

There was recently another discussion on using setkey on .SD here:

  http://r.789695.n4.nabble.com/setkey-on-SD-td4690283.html

So the following code won't work any more in the current 1.9.3 dev
version. I think the idea of using setkey in a "chain" of data.tables
was nice, since it allows to set the key temporarily.

The basic idea is taken from the comment here:

http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917

A <-
  data.table(
    x = c(1, 2, 3, 4, 5),
    y = letters[1:5])
B <-
  data.table(
    x = c(1, 2, 3, 1, 4),
    f = c("Alice", "Alice", "Alice", "Bob", "Bob"),
    z = 101:105)
B[, setkey(.SD, x)][
  , .SD[A, roll = TRUE, rollends = FALSE], by = f][
    , setkey(.SD, x)]

Thanks,

M

On 06/18/2014 01:03 AM, Matt Dowle wrote:
> 
> Hi Ron,
> 
> Thanks for highlighting this.  Two changes now in v1.9.3 on GitHub:
> 
>   *
> 
>     |setkey| on |.SD| is now an error, rather than warnings for each
>     group about rebuilding the key. The new error is similar to when
>     attempting to use |:=| in a |.SD| subquery: |".SD is locked. Using
>     set*() functions on .SD is reserved for possible future use; a
>     tortuously flexible way to modify the original data by
>     group."| Thanks to Ron Hylton for highlighting the issue on
>     datatable-help here
>     <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
> 
>   *
> 
>     Looping calls to |unique(DT)| such as
>     in |DT[,unique(.SD),by=group]| is now faster by avoiding internal
>     overhead of calling |[.data.table|. Thanks again to Ron Hylton for
>     highlighting in the same thread
>     <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
>     His example is reduced from 28 sec to 9 sec, with identical results.
> 
> 
> I now get the following (on my slow netbook) with no changes to your code.
> 
> print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))   #  were
> warnings,    now error
> print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))   #  was
> 28s, now 9s
> print(system.time(uf <- ddply(test, .(id), conflictsFrame)))   # 13s
> 
> This just fixes the surprises, basically.   Clearly Arun uses data.table
> in a better way which is orders of magnitude faster.
> 
> Matt
> 
> 
> On 14/06/14 03:58, Ron Hylton wrote:
>>
>> Thanks, that very helpful.
>>
>>  
>>
>> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>> *Sent:* Friday, June 13, 2014 10:46 PM
>> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org
>> *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>  
>>
>> Sorry. But we can simplify it even further:
>>
>> The first step is just |unique(test)|. So, we can do:
>>
>> |system.time({|
>> |ans = unique(test)|
>> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>> |})|
>> |#  0.016   0.000   0.016  |
>>
>> Identical?
>>
>> |setkey(ans)|
>> |setkey(ut1)|
>> |identical(ans, ut1) # [1] TRUE|
>>
>>  
>>
>> Arun
>>
>>
>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Date: June 14, 2014 at 4:42:31 AM
>> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>     A slightly simpler version of the 2nd solution is:
>>
>>     |system.time({|
>>
>>     |ans = test[, .N, by=names(test)]|
>>
>>     |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>>
>>     |})|
>>
>>     |#  0.019   0.000   0.019   |
>>
>>      
>>
>>     The answers are identical, you can check this by doing:
>>
>>     |ans[, N := NULL]|
>>
>>     |setkey(ans)|
>>
>>     |setkey(ut1)|
>>
>>     |identical(ans, ut1) # [1] TRUE|
>>
>>      
>>
>>      
>>
>>     Arun
>>
>>
>>     From: Arunkumar Srinivasan aragorn168b at gmail.com
>>     <mailto:aragorn168b at gmail.com>
>>     Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>>     <mailto:aragorn168b at gmail.com>
>>     Date: June 14, 2014 at 4:34:15 AM
>>     To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>>     datatable-help at lists.r-forge.r-project.org
>>     <mailto:datatable-help at lists.r-forge.r-project.org>
>>     datatable-help at lists.r-forge.r-project.org
>>     <mailto:datatable-help at lists.r-forge.r-project.org>
>>     Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>         The j-expression is evaluated from within C for each group
>>         (unless they’re optimised with GForce - a new initiative in
>>         data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>>
>>         You can get around it by listing the columns by yourself and
>>         using |.I| instead, as follows:
>>
>>         |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>>
>>         |#  0.140   0.001   0.142    |
>>
>>          
>>
>>          
>>
>>         Takes about 0.14 seconds.
>>
>>         ------------------------------------------------------------------------
>>
>>         An even faster way is:
>>
>>         |system.time({|
>>
>>         |ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)    |
>>
>>         |ans = ans[, .N, by=names(ans)]                  # (2)    |
>>
>>         |ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)|
>>
>>         |})|
>>
>>         | |
>>
>>         |#  0.026   0.000   0.027    |
>>
>>          
>>
>>          
>>
>>         The idea for the second case is:
>>
>>         (1) remove all entries where there’s just 1 row corresponding
>>         to that |id|.
>>         (2) Aggregate this result by all the columns now and get the
>>         number of rows in the column |N| (we won’t have to use this
>>         column though).
>>         (3) Now, if we aggregate by |id| and if any id has just 1 row,
>>         then it’d mean that that |id| has had more than 1 rows (step
>>         (1) filtering ensures this), but all of them are same and we
>>         don’t need them. So we just filter for those where .N > 1L.
>>
>>         HTH
>>
>>          
>>
>>         Arun
>>
>>
>>         From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>>         Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>>         Date: June 14, 2014 at 3:30:55 AM
>>         To: datatable-help at lists.r-forge.r-project.org
>>         <mailto:datatable-help at lists.r-forge.r-project.org>
>>         datatable-help at lists.r-forge.r-project.org
>>         <mailto:datatable-help at lists.r-forge.r-project.org>
>>         Subject:  Re: [datatable-help] data.table is asking for help
>>
>>
>>
>>             The performance is what puzzles me; the results are
>>             correct so the warnings don’t matter, and not all the
>>             variations I’ve tried have warnings.  On the real dataset
>>             (~800,000 rows) datatable takes about 1.5 times longer
>>             than dataframe + ddply.  I expected it to be substantially
>>             faster.
>>
>>              
>>
>>             *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>>             *Sent:* Friday, June 13, 2014 8:57 PM
>>             *To:* Ron Hylton;
>>             datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>              
>>
>>                 However there’s another aspect.  While I’m relatively
>>                 new to R my understanding is that a function argument
>>                 should be modifiable within the function body without
>>                 affecting the caller, which perhaps conflicts with the
>>                 behavior of .SD.
>>
>>             `data.table` is designed for working with *really large*
>>             data sets in mind (> 100 or 200 GB in memory even). And
>>             therefore, as a design feature, it trades in "referential
>>             transparency" for manipulating data objects *as efficient
>>             as possible* in terms of both *speed* and *memory usage*
>>             (most of the times they go hand-in-hand).
>>
>>             This is perhaps the biggest design choice one needs to be
>>             aware of when working/choosing data.tables. It is possible
>>             to modify objects by reference using data.table - All the
>>             functions that begin with "set*" modify objects by
>>             reference. The only other non "set*" function is `:=`
>>             operator.
>>
>>              
>>
>>             HTH
>>
>>             Arun
>>
>>
>>             From: Ron Hylton rhylton at verizon.net
>>             <mailto:rhylton at verizon.net>
>>             Reply: Ron Hylton rhylton at verizon.net
>>             <mailto:rhylton at verizon.net>
>>             Date: June 14, 2014 at 2:52:04 AM
>>             To: datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             Subject:  Re: [datatable-help] data.table is asking for help
>>
>>              
>>
>>                 I suspected it was something like this.  As one
>>                 clarification, there is a setkey(test,id) before any
>>                 setkey(.SD).   If setkey(test,id) is changed to
>>                 setkey(test) so all columns are in the original
>>                 datatable key then the warning goes away.
>>
>>                  
>>
>>                 However there’s another aspect.  While I’m relatively
>>                 new to R my understanding is that a function argument
>>                 should be modifiable within the function body without
>>                 affecting the caller, which perhaps conflicts with the
>>                 behavior of .SD.
>>
>>                  
>>
>>                 *From:* Arunkumar Srinivasan
>>                 [mailto:aragorn168b at gmail.com]
>>                 *Sent:* Friday, June 13, 2014 8:23 PM
>>                 *To:* Ron Hylton;
>>                 datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 *Subject:* Re: [datatable-help] data.table is asking
>>                 for help
>>
>>                  
>>
>>                 Nicely reproducible post. Reproducible in v1.9.3
>>                 (latest commit) as well.
>>
>>                 This is a tricky one. It happens because you’re
>>                 setting key on |.SD| which should normally not be
>>                 allowed. What happens is, when you set key the first
>>                 time, there’s no key set (here) and therefore key is
>>                 set on all the columns |x1|, |x2| and |x3|.
>>
>>                 Now, the next group (in the |by=.|) is passed to your
>>                 function, it’ll have the |key| already set to
>>                 |x1,x2,x3| (because |setkey| modifies the object by
>>                 reference), but |.SD| has obtained *new* data
>>                 corresponding to /this/ group. And |data.table| sorts
>>                 this data, knowing that it already has key set.. but
>>                 if the key is set then the order must be 1:n. But it
>>                 wouldn’t be, as this data isn’t sorted. |data.table|
>>                 warns in those scenarios.. and that’s why you get the
>>                 warning.
>>
>>                 To verify this, you can try:
>>
>>                 |conflictsTable1 <- function(f, address) {|
>>
>>                 |  u <- unique(setkey(f))|
>>
>>                 |  setattr(f, 'sorted', NULL)|
>>
>>                 |  if (nrow(u) == 1) return(NULL)|
>>
>>                 |  u|
>>
>>                 |}|
>>
>>                 Basically, we set the key of |f| (which is equal to
>>                 |.SD| as it’s only modified by reference) to |NULL|
>>                 everytime after.. so that |.SD| for the new group will
>>                 not have the key set.
>>
>>                 The ideal scenario here, IIUC, is that |setkey(.SD)|
>>                 or things pointing to |.SD| should not be possible
>>                 (locking binding doesn’t seem to affect things done by
>>                 reference..). |.SD| however should retain the key of
>>                 the data.table, if a key was set, wherever possible.
>>
>>                  
>>
>>                 Arun
>>
>>
>>                 From: Ron Hylton rhylton at verizon.net
>>                 <mailto:rhylton at verizon.net>
>>                 Reply: Ron Hylton rhylton at verizon.net
>>                 <mailto:rhylton at verizon.net>
>>                 Date: June 14, 2014 at 1:55:53 AM
>>                 To: datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 Subject:  [datatable-help] data.table is asking for help
>>
>>                  
>>
>>                     The code below generates the warning:
>>
>>                      
>>
>>                     In setkeyv(x, cols, verbose = verbose) :
>>
>>                       Already keyed by this key but had invalid row
>>                     order, key rebuilt. If you didn't go under the
>>                     hood please let datatable-help know so the root
>>                     cause can be fixed.
>>
>>                      
>>
>>                     This is my first attempt at using datatable so I
>>                     probably did something dumb, but maybe that‘s
>>                     useful for someone.  The first case is the one
>>                     that gives the warnings.
>>
>>                      
>>
>>                     I’m also surprised at the timings.  I wrote the
>>                     original algorithm using dataframe & ddply and I
>>                     expected datatable to be substantially faster; the
>>                     opposite is true.
>>
>>                      
>>
>>                     The algorithm does the following:  Certain columns
>>                     in the table are keys and others are values in the
>>                     sense that each row with the same set of keys
>>                     should have the same set of values.  Find all the
>>                     key sets for which this is not true and return the
>>                     keys sets + conflicting value sets.
>>
>>                      
>>
>>                     Insight into the performance would be appreciated.
>>
>>                      
>>
>>                     Regards,
>>
>>                     Ron
>>
>>                      
>>
>>                     library(data.table)
>>
>>                     library(plyr)
>>
>>                      
>>
>>                     conflictsTable1 <- function(f) {
>>
>>                       u <- unique(setkey(f))
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     conflictsTable2 <- function(f) {
>>
>>                       u <- unique(f)
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     conflictsFrame <- function(f) {
>>
>>                       u <- unique(f)
>>
>>                       if (nrow(u) == 1) return(NULL)
>>
>>                       u
>>
>>                     }
>>
>>                      
>>
>>                     N <- 10000
>>
>>                     test <-
>>                     data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
>>                     x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>>
>>                      
>>
>>                     setkey(test,id)
>>
>>                      
>>
>>                     print(system.time(ut1 <- test[,
>>                     conflictsTable1(.SD), by=id]))
>>
>>                      
>>
>>                     print(system.time(ut2 <- test[,
>>                     conflictsTable2(.SD), by=id]))
>>
>>                      
>>
>>                     print(system.time(uf <- ddply(test, .(id),
>>                     conflictsFrame)))
>>
>>                     _______________________________________________
>>                     datatable-help mailing list
>>                     datatable-help at lists.r-forge.r-project.org
>>                     <mailto:datatable-help at lists.r-forge.r-project.org>
>>                     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>                 _______________________________________________
>>                 datatable-help mailing list
>>                 datatable-help at lists.r-forge.r-project.org
>>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>             _______________________________________________
>>             datatable-help mailing list
>>             datatable-help at lists.r-forge.r-project.org
>>             <mailto:datatable-help at lists.r-forge.r-project.org>
>>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>