[datatable-help] data.table is asking for help
Matt Dowle
mdowle at mdowle.plus.com
Tue Jun 17 19:03:09 CEST 2014
Hi Ron,
Thanks for highlighting this. Two changes now in v1.9.3 on GitHub:
*
|setkey|on|.SD|is now an error, rather than warnings for each group
about rebuilding the key. The new error is similar to when
attempting to use|:=|in a|.SD|subquery:|".SD is locked. Using set*()
functions on .SD is reserved for possible future use; a tortuously
flexible way to modify the original data by group."|Thanks to Ron
Hylton for highlighting the issue on datatable-helphere
<http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
*
Looping calls to|unique(DT)|such as in|DT[,unique(.SD),by=group]|is
now faster by avoiding internal overhead of calling|[.data.table|.
Thanks again to Ron Hylton for highlighting in thesame thread
<http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
His example is reduced from 28 sec to 9 sec, with identical results.
I now get the following (on my slow netbook) with no changes to your code.
print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) # were
warnings, now error
print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) # was
28s, now 9s
print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s
This just fixes the surprises, basically. Clearly Arun uses data.table
in a better way which is orders of magnitude faster.
Matt
On 14/06/14 03:58, Ron Hylton wrote:
>
> Thanks, that very helpful.
>
> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
> *Sent:* Friday, June 13, 2014 10:46 PM
> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org
> *Subject:* Re: [datatable-help] data.table is asking for help
>
> Sorry. But we can simplify it even further:
>
> The first step is just |unique(test)|. So, we can do:
>
> |system.time({|
> |ans = unique(test)|
> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
> |})|
> |# 0.016 0.000 0.016|
>
> Identical?
>
> |setkey(ans)|
> |setkey(ut1)|
> |identical(ans, ut1) # [1] TRUE|
>
> Arun
>
>
> From: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Date: June 14, 2014 at 4:42:31 AM
> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] data.table is asking for help
>
>
>
> A slightly simpler version of the 2nd solution is:
>
> |system.time({|
>
> |ans = test[, .N, by=names(test)]|
>
> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>
> |})|
>
> |# 0.019 0.000 0.019|
>
>
>
> The answers are identical, you can check this by doing:
>
> |ans[, N := NULL]|
>
> |setkey(ans)|
>
> |setkey(ut1)|
>
> |identical(ans, ut1) # [1] TRUE|
>
>
>
> Arun
>
>
> From: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
> <mailto:aragorn168b at gmail.com>
> Date: June 14, 2014 at 4:34:15 AM
> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] data.table is asking for help
>
>
>
> The j-expression is evaluated from within C for each group
> (unless they're optimised with GForce - a new initiative in
> data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>
> You can get around it by listing the columns by yourself and
> using |.I| instead, as follows:
>
> |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>
> |# 0.140 0.001 0.142|
>
>
>
>
>
> Takes about 0.14 seconds.
>
> ------------------------------------------------------------------------
>
> An even faster way is:
>
> |system.time({|
>
> |ans = test[test[, .I[.N > 1], by=id]$V1] # (1)|
>
> |ans = ans[, .N, by=names(ans)] # (2)|
>
> |ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3)|
>
> |})|
>
> | |
>
> |# 0.026 0.000 0.027|
>
>
>
>
>
> The idea for the second case is:
>
> (1) remove all entries where there's just 1 row corresponding
> to that |id|.
> (2) Aggregate this result by all the columns now and get the
> number of rows in the column |N| (we won't have to use this
> column though).
> (3) Now, if we aggregate by |id| and if any id has just 1 row,
> then it'd mean that that |id| has had more than 1 rows (step
> (1) filtering ensures this), but all of them are same and we
> don't need them. So we just filter for those where .N > 1L.
>
> HTH
>
> Arun
>
>
> From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
> Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
> Date: June 14, 2014 at 3:30:55 AM
> To: datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] data.table is asking for help
>
>
>
> The performance is what puzzles me; the results are
> correct so the warnings don't matter, and not all the
> variations I've tried have warnings. On the real dataset
> (~800,000 rows) datatable takes about 1.5 times longer
> than dataframe + ddply. I expected it to be substantially
> faster.
>
> *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
> *Sent:* Friday, June 13, 2014 8:57 PM
> *To:* Ron Hylton;
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> *Subject:* Re: [datatable-help] data.table is asking for help
>
> However there's another aspect. While I'm relatively
> new to R my understanding is that a function argument
> should be modifiable within the function body without
> affecting the caller, which perhaps conflicts with the
> behavior of .SD.
>
> `data.table` is designed for working with *really large*
> data sets in mind (> 100 or 200 GB in memory even). And
> therefore, as a design feature, it trades in "referential
> transparency" for manipulating data objects *as efficient
> as possible* in terms of both *speed* and *memory usage*
> (most of the times they go hand-in-hand).
>
> This is perhaps the biggest design choice one needs to be
> aware of when working/choosing data.tables. It is possible
> to modify objects by reference using data.table - All the
> functions that begin with "set*" modify objects by
> reference. The only other non "set*" function is `:=`
> operator.
>
> HTH
>
> Arun
>
>
> From: Ron Hylton rhylton at verizon.net
> <mailto:rhylton at verizon.net>
> Reply: Ron Hylton rhylton at verizon.net
> <mailto:rhylton at verizon.net>
> Date: June 14, 2014 at 2:52:04 AM
> To: datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] data.table is asking for help
>
> I suspected it was something like this. As one
> clarification, there is a setkey(test,id) before any
> setkey(.SD). If setkey(test,id) is changed to
> setkey(test) so all columns are in the original
> datatable key then the warning goes away.
>
> However there's another aspect. While I'm relatively
> new to R my understanding is that a function argument
> should be modifiable within the function body without
> affecting the caller, which perhaps conflicts with the
> behavior of .SD.
>
> *From:* Arunkumar Srinivasan
> [mailto:aragorn168b at gmail.com]
> *Sent:* Friday, June 13, 2014 8:23 PM
> *To:* Ron Hylton;
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> *Subject:* Re: [datatable-help] data.table is asking
> for help
>
> Nicely reproducible post. Reproducible in v1.9.3
> (latest commit) as well.
>
> This is a tricky one. It happens because you're
> setting key on |.SD| which should normally not be
> allowed. What happens is, when you set key the first
> time, there's no key set (here) and therefore key is
> set on all the columns |x1|, |x2| and |x3|.
>
> Now, the next group (in the |by=.|) is passed to your
> function, it'll have the |key| already set to
> |x1,x2,x3| (because |setkey| modifies the object by
> reference), but |.SD| has obtained *new* data
> corresponding to /this/ group. And |data.table| sorts
> this data, knowing that it already has key set.. but
> if the key is set then the order must be 1:n. But it
> wouldn't be, as this data isn't sorted. |data.table|
> warns in those scenarios.. and that's why you get the
> warning.
>
> To verify this, you can try:
>
> |conflictsTable1 <- function(f, address) {|
>
> | u <- unique(setkey(f))|
>
> | setattr(f, 'sorted', NULL)|
>
> | if (nrow(u) == 1) return(NULL)|
>
> | u|
>
> |}|
>
> Basically, we set the key of |f| (which is equal to
> |.SD| as it's only modified by reference) to |NULL|
> everytime after.. so that |.SD| for the new group will
> not have the key set.
>
> The ideal scenario here, IIUC, is that |setkey(.SD)|
> or things pointing to |.SD| should not be possible
> (locking binding doesn't seem to affect things done by
> reference..). |.SD| however should retain the key of
> the data.table, if a key was set, wherever possible.
>
> Arun
>
>
> From: Ron Hylton rhylton at verizon.net
> <mailto:rhylton at verizon.net>
> Reply: Ron Hylton rhylton at verizon.net
> <mailto:rhylton at verizon.net>
> Date: June 14, 2014 at 1:55:53 AM
> To: datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> Subject: [datatable-help] data.table is asking for help
>
> The code below generates the warning:
>
> In setkeyv(x, cols, verbose = verbose) :
>
> Already keyed by this key but had invalid row
> order, key rebuilt. If you didn't go under the
> hood please let datatable-help know so the root
> cause can be fixed.
>
> This is my first attempt at using datatable so I
> probably did something dumb, but maybe that's
> useful for someone. The first case is the one
> that gives the warnings.
>
> I'm also surprised at the timings. I wrote the
> original algorithm using dataframe & ddply and I
> expected datatable to be substantially faster; the
> opposite is true.
>
> The algorithm does the following: Certain columns
> in the table are keys and others are values in the
> sense that each row with the same set of keys
> should have the same set of values. Find all the
> key sets for which this is not true and return the
> keys sets + conflicting value sets.
>
> Insight into the performance would be appreciated.
>
> Regards,
>
> Ron
>
> library(data.table)
>
> library(plyr)
>
> conflictsTable1 <- function(f) {
>
> u <- unique(setkey(f))
>
> if (nrow(u) == 1) return(NULL)
>
> u
>
> }
>
> conflictsTable2 <- function(f) {
>
> u <- unique(f)
>
> if (nrow(u) == 1) return(NULL)
>
> u
>
> }
>
> conflictsFrame <- function(f) {
>
> u <- unique(f)
>
> if (nrow(u) == 1) return(NULL)
>
> u
>
> }
>
> N <- 10000
>
> test <-
> data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
> x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>
> setkey(test,id)
>
> print(system.time(ut1 <- test[,
> conflictsTable1(.SD), by=id]))
>
> print(system.time(ut2 <- test[,
> conflictsTable2(.SD), by=id]))
>
> print(system.time(uf <- ddply(test, .(id),
> conflictsFrame)))
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> <mailto:datatable-help at lists.r-forge.r-project.org>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140617/5387e221/attachment-0001.html>
More information about the datatable-help
mailing list