[datatable-help] data.table is asking for help

Matt Dowle mdowle at mdowle.plus.com
Tue Jun 17 19:03:09 CEST 2014


Hi Ron,

Thanks for highlighting this.  Two changes now in v1.9.3 on GitHub:

  *

    |setkey|on|.SD|is now an error, rather than warnings for each group
    about rebuilding the key. The new error is similar to when
    attempting to use|:=|in a|.SD|subquery:|".SD is locked. Using set*()
    functions on .SD is reserved for possible future use; a tortuously
    flexible way to modify the original data by group."|Thanks to Ron
    Hylton for highlighting the issue on datatable-helphere
    <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.

  *

    Looping calls to|unique(DT)|such as in|DT[,unique(.SD),by=group]|is
    now faster by avoiding internal overhead of calling|[.data.table|.
    Thanks again to Ron Hylton for highlighting in thesame thread
    <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
    His example is reduced from 28 sec to 9 sec, with identical results.


I now get the following (on my slow netbook) with no changes to your code.

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))   #  were 
warnings,    now error
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))   #  was 
28s, now 9s
print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s

This just fixes the surprises, basically.   Clearly Arun uses data.table 
in a better way which is orders of magnitude faster.

Matt


On 14/06/14 03:58, Ron Hylton wrote:
>
> Thanks, that very helpful.
>
> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
> *Sent:* Friday, June 13, 2014 10:46 PM
> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org
> *Subject:* Re: [datatable-help] data.table is asking for help
>
> Sorry. But we can simplify it even further:
>
> The first step is just |unique(test)|. So, we can do:
>
> |system.time({|
> |ans = unique(test)|
> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
> |})|
> |#  0.016   0.000   0.016|
>
> Identical?
>
> |setkey(ans)|
> |setkey(ut1)|
> |identical(ans, ut1) # [1] TRUE|
>
> Arun
>
>
> From: Arunkumar Srinivasan aragorn168b at gmail.com 
> <mailto:aragorn168b at gmail.com>
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com 
> <mailto:aragorn168b at gmail.com>
> Date: June 14, 2014 at 4:42:31 AM
> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>, 
> datatable-help at lists.r-forge.r-project.org 
> <mailto:datatable-help at lists.r-forge.r-project.org> 
> datatable-help at lists.r-forge.r-project.org 
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject:  Re: [datatable-help] data.table is asking for help
>
>
>
>     A slightly simpler version of the 2nd solution is:
>
>     |system.time({|
>
>     |ans = test[, .N, by=names(test)]|
>
>     |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>
>     |})|
>
>     |#  0.019   0.000   0.019|
>
>       
>
>     The answers are identical, you can check this by doing:
>
>     |ans[, N := NULL]|
>
>     |setkey(ans)|
>
>     |setkey(ut1)|
>
>     |identical(ans, ut1) # [1] TRUE|
>
>       
>
>     Arun
>
>
>     From: Arunkumar Srinivasan aragorn168b at gmail.com
>     <mailto:aragorn168b at gmail.com>
>     Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>     <mailto:aragorn168b at gmail.com>
>     Date: June 14, 2014 at 4:34:15 AM
>     To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>     datatable-help at lists.r-forge.r-project.org
>     <mailto:datatable-help at lists.r-forge.r-project.org>
>     datatable-help at lists.r-forge.r-project.org
>     <mailto:datatable-help at lists.r-forge.r-project.org>
>     Subject:  Re: [datatable-help] data.table is asking for help
>
>
>
>         The j-expression is evaluated from within C for each group
>         (unless they're optimised with GForce - a new initiative in
>         data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>
>         You can get around it by listing the columns by yourself and
>         using |.I| instead, as follows:
>
>         |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>
>         |#  0.140   0.001   0.142|
>
>           
>
>           
>
>         Takes about 0.14 seconds.
>
>         ------------------------------------------------------------------------
>
>         An even faster way is:
>
>         |system.time({|
>
>         |ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)|
>
>         |ans = ans[, .N, by=names(ans)]                  # (2)|
>
>         |ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)|
>
>         |})|
>
>         |  |
>
>         |#  0.026   0.000   0.027|
>
>           
>
>           
>
>         The idea for the second case is:
>
>         (1) remove all entries where there's just 1 row corresponding
>         to that |id|.
>         (2) Aggregate this result by all the columns now and get the
>         number of rows in the column |N| (we won't have to use this
>         column though).
>         (3) Now, if we aggregate by |id| and if any id has just 1 row,
>         then it'd mean that that |id| has had more than 1 rows (step
>         (1) filtering ensures this), but all of them are same and we
>         don't need them. So we just filter for those where .N > 1L.
>
>         HTH
>
>         Arun
>
>
>         From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>         Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>         Date: June 14, 2014 at 3:30:55 AM
>         To: datatable-help at lists.r-forge.r-project.org
>         <mailto:datatable-help at lists.r-forge.r-project.org>
>         datatable-help at lists.r-forge.r-project.org
>         <mailto:datatable-help at lists.r-forge.r-project.org>
>         Subject:  Re: [datatable-help] data.table is asking for help
>
>
>
>             The performance is what puzzles me; the results are
>             correct so the warnings don't matter, and not all the
>             variations I've tried have warnings.  On the real dataset
>             (~800,000 rows) datatable takes about 1.5 times longer
>             than dataframe + ddply.  I expected it to be substantially
>             faster.
>
>             *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>             *Sent:* Friday, June 13, 2014 8:57 PM
>             *To:* Ron Hylton;
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             *Subject:* Re: [datatable-help] data.table is asking for help
>
>                 However there's another aspect.  While I'm relatively
>                 new to R my understanding is that a function argument
>                 should be modifiable within the function body without
>                 affecting the caller, which perhaps conflicts with the
>                 behavior of .SD.
>
>             `data.table` is designed for working with *really large*
>             data sets in mind (> 100 or 200 GB in memory even). And
>             therefore, as a design feature, it trades in "referential
>             transparency" for manipulating data objects *as efficient
>             as possible* in terms of both *speed* and *memory usage*
>             (most of the times they go hand-in-hand).
>
>             This is perhaps the biggest design choice one needs to be
>             aware of when working/choosing data.tables. It is possible
>             to modify objects by reference using data.table - All the
>             functions that begin with "set*" modify objects by
>             reference. The only other non "set*" function is `:=`
>             operator.
>
>             HTH
>
>             Arun
>
>
>             From: Ron Hylton rhylton at verizon.net
>             <mailto:rhylton at verizon.net>
>             Reply: Ron Hylton rhylton at verizon.net
>             <mailto:rhylton at verizon.net>
>             Date: June 14, 2014 at 2:52:04 AM
>             To: datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             Subject:  Re: [datatable-help] data.table is asking for help
>
>                 I suspected it was something like this.  As one
>                 clarification, there is a setkey(test,id) before any
>                 setkey(.SD).   If setkey(test,id) is changed to
>                 setkey(test) so all columns are in the original
>                 datatable key then the warning goes away.
>
>                 However there's another aspect.  While I'm relatively
>                 new to R my understanding is that a function argument
>                 should be modifiable within the function body without
>                 affecting the caller, which perhaps conflicts with the
>                 behavior of .SD.
>
>                 *From:* Arunkumar Srinivasan
>                 [mailto:aragorn168b at gmail.com]
>                 *Sent:* Friday, June 13, 2014 8:23 PM
>                 *To:* Ron Hylton;
>                 datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 *Subject:* Re: [datatable-help] data.table is asking
>                 for help
>
>                 Nicely reproducible post. Reproducible in v1.9.3
>                 (latest commit) as well.
>
>                 This is a tricky one. It happens because you're
>                 setting key on |.SD| which should normally not be
>                 allowed. What happens is, when you set key the first
>                 time, there's no key set (here) and therefore key is
>                 set on all the columns |x1|, |x2| and |x3|.
>
>                 Now, the next group (in the |by=.|) is passed to your
>                 function, it'll have the |key| already set to
>                 |x1,x2,x3| (because |setkey| modifies the object by
>                 reference), but |.SD| has obtained *new* data
>                 corresponding to /this/ group. And |data.table| sorts
>                 this data, knowing that it already has key set.. but
>                 if the key is set then the order must be 1:n. But it
>                 wouldn't be, as this data isn't sorted. |data.table|
>                 warns in those scenarios.. and that's why you get the
>                 warning.
>
>                 To verify this, you can try:
>
>                 |conflictsTable1 <- function(f, address) {|
>
>                 |   u <- unique(setkey(f))|
>
>                 |   setattr(f, 'sorted', NULL)|
>
>                 |   if (nrow(u) == 1) return(NULL)|
>
>                 |   u|
>
>                 |}|
>
>                 Basically, we set the key of |f| (which is equal to
>                 |.SD| as it's only modified by reference) to |NULL|
>                 everytime after.. so that |.SD| for the new group will
>                 not have the key set.
>
>                 The ideal scenario here, IIUC, is that |setkey(.SD)|
>                 or things pointing to |.SD| should not be possible
>                 (locking binding doesn't seem to affect things done by
>                 reference..). |.SD| however should retain the key of
>                 the data.table, if a key was set, wherever possible.
>
>                 Arun
>
>
>                 From: Ron Hylton rhylton at verizon.net
>                 <mailto:rhylton at verizon.net>
>                 Reply: Ron Hylton rhylton at verizon.net
>                 <mailto:rhylton at verizon.net>
>                 Date: June 14, 2014 at 1:55:53 AM
>                 To: datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 Subject:  [datatable-help] data.table is asking for help
>
>                     The code below generates the warning:
>
>                     In setkeyv(x, cols, verbose = verbose) :
>
>                       Already keyed by this key but had invalid row
>                     order, key rebuilt. If you didn't go under the
>                     hood please let datatable-help know so the root
>                     cause can be fixed.
>
>                     This is my first attempt at using datatable so I
>                     probably did something dumb, but maybe that's
>                     useful for someone.  The first case is the one
>                     that gives the warnings.
>
>                     I'm also surprised at the timings.  I wrote the
>                     original algorithm using dataframe & ddply and I
>                     expected datatable to be substantially faster; the
>                     opposite is true.
>
>                     The algorithm does the following:  Certain columns
>                     in the table are keys and others are values in the
>                     sense that each row with the same set of keys
>                     should have the same set of values.  Find all the
>                     key sets for which this is not true and return the
>                     keys sets + conflicting value sets.
>
>                     Insight into the performance would be appreciated.
>
>                     Regards,
>
>                     Ron
>
>                     library(data.table)
>
>                     library(plyr)
>
>                     conflictsTable1 <- function(f) {
>
>                     u <- unique(setkey(f))
>
>                     if (nrow(u) == 1) return(NULL)
>
>                     u
>
>                     }
>
>                     conflictsTable2 <- function(f) {
>
>                     u <- unique(f)
>
>                     if (nrow(u) == 1) return(NULL)
>
>                     u
>
>                     }
>
>                     conflictsFrame <- function(f) {
>
>                     u <- unique(f)
>
>                     if (nrow(u) == 1) return(NULL)
>
>                     u
>
>                     }
>
>                     N <- 10000
>
>                     test <-
>                     data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
>                     x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>
>                     setkey(test,id)
>
>                     print(system.time(ut1 <- test[,
>                     conflictsTable1(.SD), by=id]))
>
>                     print(system.time(ut2 <- test[,
>                     conflictsTable2(.SD), by=id]))
>
>                     print(system.time(uf <- ddply(test, .(id),
>                     conflictsFrame)))
>
>                     _______________________________________________
>                     datatable-help mailing list
>                     datatable-help at lists.r-forge.r-project.org
>                     <mailto:datatable-help at lists.r-forge.r-project.org>
>                     https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>                 _______________________________________________
>                 datatable-help mailing list
>                 datatable-help at lists.r-forge.r-project.org
>                 <mailto:datatable-help at lists.r-forge.r-project.org>
>                 https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>             _______________________________________________
>             datatable-help mailing list
>             datatable-help at lists.r-forge.r-project.org
>             <mailto:datatable-help at lists.r-forge.r-project.org>
>             https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140617/5387e221/attachment-0001.html>


More information about the datatable-help mailing list