[datatable-help] Update table from other table

Arunkumar Srinivasan aragorn168b at gmail.com
Thu Sep 18 00:04:21 CEST 2014


Arun, sometimes it helps to have nomatch = 0 when using :=

dta = data.table(idx = c(1,1), key = "idx")
dtb = data.table(idx = c(1,2), val = c("a", "b"), key = "idx")

This fails because of cartesian join not allowed by default

That's because of a bug. `allow.cartesian` error shouldn't occur with `:=` at all, as the number of rows will *never* exceed `x`. IIRC, the allow.cartesian bugs are scheduled to be fixed for the next-next release (after 1.9.4).

Arun

From: Juan Manuel Truppia <jmtruppia at gmail.com>
Reply: Juan Manuel Truppia <jmtruppia at gmail.com>>
Date: September 15, 2014 at 7:52:26 PM
To: Arunkumar Srinivasan <aragorn168b at gmail.com>>
Cc: datatable-help at lists.r-forge.r-project.org <datatable-help at lists.r-forge.r-project.org>>
Subject:  Re: [datatable-help] Update table from other table  

Arun, sometimes it helps to have nomatch = 0 when using :=

dta = data.table(idx = c(1,1), key = "idx")
dtb = data.table(idx = c(1,2), val = c("a", "b"), key = "idx")

This fails because of cartesian join not allowed by default

dta[dtb, val := i.val]

but this doesn't

dta[dtb, val := i.val, nomatch = 0]

This is the same as doing

dta[dtb, val := i.val, allow.cartesian = TRUE]

On Fri, Sep 12, 2014 at 8:08 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Glad it helped.
Always welcome "pull requests" :).

Arun

From: Juan Manuel Truppia <jmtruppia at gmail.com>
Reply: Juan Manuel Truppia <jmtruppia at gmail.com>>
Date: September 12, 2014 at 5:46:59 PM
To: Arunkumar Srinivasan <aragorn168b at gmail.com>>
Cc: datatable-help at lists.r-forge.r-project.org <datatable-help at lists.r-forge.r-project.org>>
Subject:  Re: [datatable-help] Update table from other table

Great! (sorry, .EACHI = TRUE was an old definition).
It's good to know also that nomatch = 0 is irrelevant when using :=, I always used is to avoid the rows in dtb creeping in dtb as NAs.
Also, it's really useful to know that by = EACHI should be used when the calculations you are perfoming depend on the group or not. This came in really in handy yesterday, and should be emphasized in .EACHI description. Should I perform a pull request?

On Fri, Sep 12, 2014 at 12:14 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
I think you mean:

dta[dtb, b:=b, by=.EACHI]

and not .EACHI = TRUE. Not sure what’s the use of nomatch=0L along with :=.

by=.EACHI does exactly what it means, really. It evaluates j for each i match. Let’s first see the matches:

dta[dtb, which=TRUE]
# [1] 1 1 3

So, first row of dtb matches with first of dta. The second of dtb matches with 1st of dta and so on.

When you add by=.EACHI, as shown on the top, j-expression is evaluated on each of these matches. So, it’ll be evaluated 3-times here. On the other hand, without it, j is evaluated once. In this case, it doesn’t make a difference either way. So you should avoid by=.EACHI, as it’ll be slower with it.

It’s particularly useful when you’d like to perform operations in j, that depends on the values in j on that group. For example, consider these data.tables dt1 and dt2:

dt1 = data.table(x=rep(1:4, each=2), y=1:8, key="x")
dt2 = data.table(x=3:5, z=10, key="x")

And, you’d like to get sum(y)*z while joining.. If not for the by=.EACHI feature.. you’d approach the problem like this:

dt1[dt2][, list(agg = sum(y)*z[1]), by=x]

With by=.EACHI, this is simply:

dt1[dt2, list(agg=sum(y)*z), by=.EACHI]

Here, your expression is evaluated on each i.

Another interesting use case is, say, you’d like to create a lagged vector of y:

dt1[dt2, list(y=y, lagy = c(NA, head(y,-1)), z=z), by=.EACHI]

It’s that simple.. really. Basically, as long as the operation you’re performing in j affects it depending on whether j is executed for that group or as a whole, then you’re most likely looking for by=.EACHI. If not, by=.EACHI has no effect, and therefore you’re wanting to use a normal join there..

This is not a text book definition, rather my understanding of this awesome feature!

Hope this helps.


Arun

From: Juan Manuel Truppia <jmtruppia at gmail.com>
Reply: Juan Manuel Truppia <jmtruppia at gmail.com>>
Date: September 11, 2014 at 10:16:41 PM
To: datatable-help at lists.r-forge.r-project.org <datatable-help at lists.r-forge.r-project.org>>
Subject:  [datatable-help] Update table from other table

What is the best data.table way of doing something similar to UPDATE FROM in SQL?

I used to do something like

dta = data.table(idx = c(1, 2, 3), a = runif(3), key = "idx")
dtb = data.table(idx = c(1, 3), b = runif(3), key = "idx")
dta[dtb, b := b]

However, after the 1.9.3 and the explicit .EACHI, it fails sometimes, but I can't determine when.

So, just to be sure, I do 

dta[dtb, b := b, .EACHI = TRUE, nomatch = 0]

Is the .EACHI and the nomatch necessary?

In this case, I want the row with idx 1 and 3 (the matching ones) to end with a b value from the matching b column in dtb, and the row with idx 2 (the one that isn't in dtb) to end up with NA in column b.


_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140918/82ff4ee1/attachment.html>


More information about the datatable-help mailing list