[datatable-help] data.table is asking for help
Michael Smith
my.r.help at gmail.com
Wed Jun 18 02:34:14 CEST 2014
Hi Matt,
There was recently another discussion on using setkey on .SD here:
http://r.789695.n4.nabble.com/setkey-on-SD-td4690283.html
So the following code won't work any more in the current 1.9.3 dev
version. I think the idea of using setkey in a "chain" of data.tables
was nice, since it allows to set the key temporarily.
The basic idea is taken from the comment here:
http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917
A <-
data.table(
x = c(1, 2, 3, 4, 5),
y = letters[1:5])
B <-
data.table(
x = c(1, 2, 3, 1, 4),
f = c("Alice", "Alice", "Alice", "Bob", "Bob"),
z = 101:105)
B[, setkey(.SD, x)][
, .SD[A, roll = TRUE, rollends = FALSE], by = f][
, setkey(.SD, x)]
Thanks,
M
On 06/18/2014 01:03 AM, Matt Dowle wrote:
>
> Hi Ron,
>
> Thanks for highlighting this. Two changes now in v1.9.3 on GitHub:
>
> *
>
> |setkey| on |.SD| is now an error, rather than warnings for each
> group about rebuilding the key. The new error is similar to when
> attempting to use |:=| in a |.SD| subquery: |".SD is locked. Using
> set*() functions on .SD is reserved for possible future use; a
> tortuously flexible way to modify the original data by
> group."| Thanks to Ron Hylton for highlighting the issue on
> datatable-help here
> <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
>
> *
>
> Looping calls to |unique(DT)| such as
> in |DT[,unique(.SD),by=group]| is now faster by avoiding internal
> overhead of calling |[.data.table|. Thanks again to Ron Hylton for
> highlighting in the same thread
> <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>.
> His example is reduced from 28 sec to 9 sec, with identical results.
>
>
> I now get the following (on my slow netbook) with no changes to your code.
>
> print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) # were
> warnings, now error
> print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) # was
> 28s, now 9s
> print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s
>
> This just fixes the surprises, basically. Clearly Arun uses data.table
> in a better way which is orders of magnitude faster.
>
> Matt
>
>
> On 14/06/14 03:58, Ron Hylton wrote:
>>
>> Thanks, that very helpful.
>>
>>
>>
>> *From:*Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>> *Sent:* Friday, June 13, 2014 10:46 PM
>> *To:* Ron Hylton; datatable-help at lists.r-forge.r-project.org
>> *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>
>>
>> Sorry. But we can simplify it even further:
>>
>> The first step is just |unique(test)|. So, we can do:
>>
>> |system.time({|
>> |ans = unique(test)|
>> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>> |})|
>> |# 0.016 0.000 0.016 |
>>
>> Identical?
>>
>> |setkey(ans)|
>> |setkey(ut1)|
>> |identical(ans, ut1) # [1] TRUE|
>>
>>
>>
>> Arun
>>
>>
>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Date: June 14, 2014 at 4:42:31 AM
>> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject: Re: [datatable-help] data.table is asking for help
>>
>>
>>
>> A slightly simpler version of the 2nd solution is:
>>
>> |system.time({|
>>
>> |ans = test[, .N, by=names(test)]|
>>
>> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]|
>>
>> |})|
>>
>> |# 0.019 0.000 0.019 |
>>
>>
>>
>> The answers are identical, you can check this by doing:
>>
>> |ans[, N := NULL]|
>>
>> |setkey(ans)|
>>
>> |setkey(ut1)|
>>
>> |identical(ans, ut1) # [1] TRUE|
>>
>>
>>
>>
>>
>> Arun
>>
>>
>> From: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Reply: Arunkumar Srinivasan aragorn168b at gmail.com
>> <mailto:aragorn168b at gmail.com>
>> Date: June 14, 2014 at 4:34:15 AM
>> To: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>,
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject: Re: [datatable-help] data.table is asking for help
>>
>>
>>
>> The j-expression is evaluated from within C for each group
>> (unless they’re optimised with GForce - a new initiative in
>> data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly.
>>
>> You can get around it by listing the columns by yourself and
>> using |.I| instead, as follows:
>>
>> |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]|
>>
>> |# 0.140 0.001 0.142 |
>>
>>
>>
>>
>>
>> Takes about 0.14 seconds.
>>
>> ------------------------------------------------------------------------
>>
>> An even faster way is:
>>
>> |system.time({|
>>
>> |ans = test[test[, .I[.N > 1], by=id]$V1] # (1) |
>>
>> |ans = ans[, .N, by=names(ans)] # (2) |
>>
>> |ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3)|
>>
>> |})|
>>
>> | |
>>
>> |# 0.026 0.000 0.027 |
>>
>>
>>
>>
>>
>> The idea for the second case is:
>>
>> (1) remove all entries where there’s just 1 row corresponding
>> to that |id|.
>> (2) Aggregate this result by all the columns now and get the
>> number of rows in the column |N| (we won’t have to use this
>> column though).
>> (3) Now, if we aggregate by |id| and if any id has just 1 row,
>> then it’d mean that that |id| has had more than 1 rows (step
>> (1) filtering ensures this), but all of them are same and we
>> don’t need them. So we just filter for those where .N > 1L.
>>
>> HTH
>>
>>
>>
>> Arun
>>
>>
>> From: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>> Reply: Ron Hylton rhylton at verizon.net <mailto:rhylton at verizon.net>
>> Date: June 14, 2014 at 3:30:55 AM
>> To: datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject: Re: [datatable-help] data.table is asking for help
>>
>>
>>
>> The performance is what puzzles me; the results are
>> correct so the warnings don’t matter, and not all the
>> variations I’ve tried have warnings. On the real dataset
>> (~800,000 rows) datatable takes about 1.5 times longer
>> than dataframe + ddply. I expected it to be substantially
>> faster.
>>
>>
>>
>> *From:* Arunkumar Srinivasan [mailto:aragorn168b at gmail.com]
>> *Sent:* Friday, June 13, 2014 8:57 PM
>> *To:* Ron Hylton;
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> *Subject:* Re: [datatable-help] data.table is asking for help
>>
>>
>>
>> However there’s another aspect. While I’m relatively
>> new to R my understanding is that a function argument
>> should be modifiable within the function body without
>> affecting the caller, which perhaps conflicts with the
>> behavior of .SD.
>>
>> `data.table` is designed for working with *really large*
>> data sets in mind (> 100 or 200 GB in memory even). And
>> therefore, as a design feature, it trades in "referential
>> transparency" for manipulating data objects *as efficient
>> as possible* in terms of both *speed* and *memory usage*
>> (most of the times they go hand-in-hand).
>>
>> This is perhaps the biggest design choice one needs to be
>> aware of when working/choosing data.tables. It is possible
>> to modify objects by reference using data.table - All the
>> functions that begin with "set*" modify objects by
>> reference. The only other non "set*" function is `:=`
>> operator.
>>
>>
>>
>> HTH
>>
>> Arun
>>
>>
>> From: Ron Hylton rhylton at verizon.net
>> <mailto:rhylton at verizon.net>
>> Reply: Ron Hylton rhylton at verizon.net
>> <mailto:rhylton at verizon.net>
>> Date: June 14, 2014 at 2:52:04 AM
>> To: datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject: Re: [datatable-help] data.table is asking for help
>>
>>
>>
>> I suspected it was something like this. As one
>> clarification, there is a setkey(test,id) before any
>> setkey(.SD). If setkey(test,id) is changed to
>> setkey(test) so all columns are in the original
>> datatable key then the warning goes away.
>>
>>
>>
>> However there’s another aspect. While I’m relatively
>> new to R my understanding is that a function argument
>> should be modifiable within the function body without
>> affecting the caller, which perhaps conflicts with the
>> behavior of .SD.
>>
>>
>>
>> *From:* Arunkumar Srinivasan
>> [mailto:aragorn168b at gmail.com]
>> *Sent:* Friday, June 13, 2014 8:23 PM
>> *To:* Ron Hylton;
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> *Subject:* Re: [datatable-help] data.table is asking
>> for help
>>
>>
>>
>> Nicely reproducible post. Reproducible in v1.9.3
>> (latest commit) as well.
>>
>> This is a tricky one. It happens because you’re
>> setting key on |.SD| which should normally not be
>> allowed. What happens is, when you set key the first
>> time, there’s no key set (here) and therefore key is
>> set on all the columns |x1|, |x2| and |x3|.
>>
>> Now, the next group (in the |by=.|) is passed to your
>> function, it’ll have the |key| already set to
>> |x1,x2,x3| (because |setkey| modifies the object by
>> reference), but |.SD| has obtained *new* data
>> corresponding to /this/ group. And |data.table| sorts
>> this data, knowing that it already has key set.. but
>> if the key is set then the order must be 1:n. But it
>> wouldn’t be, as this data isn’t sorted. |data.table|
>> warns in those scenarios.. and that’s why you get the
>> warning.
>>
>> To verify this, you can try:
>>
>> |conflictsTable1 <- function(f, address) {|
>>
>> | u <- unique(setkey(f))|
>>
>> | setattr(f, 'sorted', NULL)|
>>
>> | if (nrow(u) == 1) return(NULL)|
>>
>> | u|
>>
>> |}|
>>
>> Basically, we set the key of |f| (which is equal to
>> |.SD| as it’s only modified by reference) to |NULL|
>> everytime after.. so that |.SD| for the new group will
>> not have the key set.
>>
>> The ideal scenario here, IIUC, is that |setkey(.SD)|
>> or things pointing to |.SD| should not be possible
>> (locking binding doesn’t seem to affect things done by
>> reference..). |.SD| however should retain the key of
>> the data.table, if a key was set, wherever possible.
>>
>>
>>
>> Arun
>>
>>
>> From: Ron Hylton rhylton at verizon.net
>> <mailto:rhylton at verizon.net>
>> Reply: Ron Hylton rhylton at verizon.net
>> <mailto:rhylton at verizon.net>
>> Date: June 14, 2014 at 1:55:53 AM
>> To: datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> Subject: [datatable-help] data.table is asking for help
>>
>>
>>
>> The code below generates the warning:
>>
>>
>>
>> In setkeyv(x, cols, verbose = verbose) :
>>
>> Already keyed by this key but had invalid row
>> order, key rebuilt. If you didn't go under the
>> hood please let datatable-help know so the root
>> cause can be fixed.
>>
>>
>>
>> This is my first attempt at using datatable so I
>> probably did something dumb, but maybe that‘s
>> useful for someone. The first case is the one
>> that gives the warnings.
>>
>>
>>
>> I’m also surprised at the timings. I wrote the
>> original algorithm using dataframe & ddply and I
>> expected datatable to be substantially faster; the
>> opposite is true.
>>
>>
>>
>> The algorithm does the following: Certain columns
>> in the table are keys and others are values in the
>> sense that each row with the same set of keys
>> should have the same set of values. Find all the
>> key sets for which this is not true and return the
>> keys sets + conflicting value sets.
>>
>>
>>
>> Insight into the performance would be appreciated.
>>
>>
>>
>> Regards,
>>
>> Ron
>>
>>
>>
>> library(data.table)
>>
>> library(plyr)
>>
>>
>>
>> conflictsTable1 <- function(f) {
>>
>> u <- unique(setkey(f))
>>
>> if (nrow(u) == 1) return(NULL)
>>
>> u
>>
>> }
>>
>>
>>
>> conflictsTable2 <- function(f) {
>>
>> u <- unique(f)
>>
>> if (nrow(u) == 1) return(NULL)
>>
>> u
>>
>> }
>>
>>
>>
>> conflictsFrame <- function(f) {
>>
>> u <- unique(f)
>>
>> if (nrow(u) == 1) return(NULL)
>>
>> u
>>
>> }
>>
>>
>>
>> N <- 10000
>>
>> test <-
>> data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)),
>> x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))
>>
>>
>>
>> setkey(test,id)
>>
>>
>>
>> print(system.time(ut1 <- test[,
>> conflictsTable1(.SD), by=id]))
>>
>>
>>
>> print(system.time(ut2 <- test[,
>> conflictsTable2(.SD), by=id]))
>>
>>
>>
>> print(system.time(uf <- ddply(test, .(id),
>> conflictsFrame)))
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> <mailto:datatable-help at lists.r-forge.r-project.org>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list