[datatable-help] changing data.table by-without-by syntax to require a "by"

Eduard Antonyan eduard.antonyan at gmail.com
Thu May 2 01:27:58 CEST 2013


I mean I find it a little easier to read when joining with each.i=TRUE
*and* there is by=b - this is an extra operation that I don't believe has
an analog in current syntax (but I haven't thought about this too much).

On May 1, 2013, at 6:18 PM, Arunkumar Srinivasan <aragorn168b at gmail.com>
wrote:

 Eduard,

What do you mean here: `at least when by is not there`. The "cross.apply"
or ".join" or "each.i" was supposedly an option when "i" argument is a
`data.table`, right? I can find a reason why there would be a `by` there…
(I mean an explicit by). Do you mean the implicit by when it's true? if
not, could you elaborate (maybe with an example)?


Arun

On Thursday, May 2, 2013 at 12:47 AM, Eduard Antonyan wrote:

yeah, I think cross.apply is pretty clear as well, at least when an extra
'by' is not there, but I like each.i when there is a 'by'. Either way this
is a pretty small consideration for me and I'd be perfectly happy with
either.


On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

 In retrospect, `.join` is also confusing/untrue (as the data.table join is
still being done). I find `cross.apply` clearer.

Arun

On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote:

 Eduard,

Yes, that clears it up. If `.join` if FALSE, then there's no
`by-without-by`, basically. `drop` really serves another purpose.

Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of
the intended purposes of this post to begin with) to mean to apply to *any*
`i` operation. Unless this is true, I'd like to stick to `.join` as it's
what we are setting to FALSE/TRUE here.

Thanks for the patient clarifications.

Arun

On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote:

Arun, from my previous email:

"Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b':
  dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by
= b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a
join in some cases but not others

Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by
(will do cross-apply only when 'i' is a join):
  dt[i, j, each.i = TRUE] <-> dt[i, j]"

Together with the default being each.i=FALSE, you can see that the answer
to your question will be:

DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2,
allow.cartesian=TRUE][, sum(y)], i.e.
[1] 21

and
DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2,
sum(y), allow.cartesian=TRUE], i.e.
   x V1
1: 1  6
2: 2  9
3: 1  6



On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

eddi,

sorry again, I am confused a bit now.

DT1 <- data.table(x=c(1,1,1,2,2), y=1:5))
DT2 <- data.table(x=c(1,2,1))
setkey(DT1, "x")

What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE,
.join = FALSE]` ? c(6,9,6) or 21?


Arun

On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote:

 Sorry the proposed result was a wrong paste in the last message:

    # proposed way and the result:
    DT1[DT2, sum(y), .join = FALSE]
    [1] 6 9 6

And the last part that it *should* be a data.table is quite obvious then.

Arun

On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote:

 Eduard,

Great. That explains me the difference between `drop` and `.join` here.
Even though I don't *need* this feature (I can't recall the last time when
I use a `data.table` for `i` and had to reduce the function, say, sum).
But, I think it can only better the usage.

However, there's one point *I think* would still disagree with @eddi here,
not sure.

    DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)
    DT2 <- data.table(x=c(1,2,1))
    setkey(DT1, "x")

    # proposed way and the result:
    DT1[DT2, sum(y), .join = FALSE]
    [1] 21


So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]`
*should* result in a `data.table` output as follows (it's even more clearer
now that .join is set to TRUE, meaning it's a data.table join):

   x V1
1: 1  6
2: 2  9
3: 1  6

Basically, `.join = TRUE` is the current functionality unchanged and nice
to be default (as Matthew hinted).

Arun

On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote:

Arun,

Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] does
currently.
No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is
literally a 'by' by each of the rows of DT2 that are in the join (thus
each.i! - the operation 'y' will be performed for each of the rows of 'i'
and then combined and returned). There is no efficiency issue here that I
can see, but Matthew can correct me on this. As far as I understand the
efficiency comes into play when e.g. the rows of 'i' are unique, and after
the join you'd like to do a 'by' by those, then DT1[DT2][, j, by =
key(DT1)] would be less efficient since the 'by' could've already been done
while joining.

DT1[DT2, .JOIN=FALSE] would be equivalent to both current and future
DT1[DT2] - in this expression there is no by-without-by happening in either
case.

The purpose of this is NOT for j just being a column or an expression that
gets evaluated into a signal column. It applies to any j. The extra
'by-without-by' column is currently output independently of how many
columns you output in your j-expression, the behavior is very similar as to
when you specify a by=., except that the 'by' happens by a very special
expression, that only exists when joining two data-tables and that
generally doesn't exist before or after the join.

Hope this answers your questions.


On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

 Eduard, thanks for your reply. But somethings are unclear to me still.
I'll try to explain them below.

First I prefer .JOIN (or cross.apply) just because `each.i` seems general
(that it is applicable to *every* i operation, which as of now seems
untrue). .JOIN is specific to data.table type for `i`.

>From what I understand from your reply, if (.JOIN = FALSE), then,

    DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y]

Is this right? It's a bit confusing because I think you're okay with
"by-without-by" and I got the impression from Sadao that he finds the
syntax of "by-without-by" unaccessible/advanced for basic users. So, just
to clarify, here the DT1[DT2, y, .JOIN=FALSE] will still do the
"by-without-by" and then result in a "vector", right?

Matthew explains in the current documentation that DT1[DT2][, y] would
"join" all columns of DT1 and DT2 and then subset. I assume the
implementation underneath is *not* DT1[DT2][, y] rather the result is an
efficient equivalence. Then, that of course seems alright to me.

If what I've told so far is right, then the syntax `DT1[DT2, .JOIN=FALSE]`
doesn't make sense/has no purpose to me. At least I can't think of any at
the moment.

To conclude, IMHO, if the purpose of `.JOIN` is to provide the same as
DT1[i, j] for DT1[DT2, j] (j being a column or an expression that results
in getting evaluated as a scalar for every group in the current
by-without-by syntax), then, I find this is covered in `drop = TRUE/FALSE`.
Correct me if I am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]`
instead of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of
DT1[i, list(x,y)].

If you/anyone believes it's wrong, I'd be all ears to clarify as to what's
the purpose of `drop` then (and also how it *doesn't* suit here as compared
to .JOIN).

Arun

On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote:

Arun,

If the new boolean is false, the result would be the same as without it and
would be equal to current behavior of d[i][, j]. If it's true, it will only
have an effect if i is a join (I think each.i= fits slightly better for
this description than .join=) - this will replicate current underlying
behavior. If you think the cross-apply is something that could work not
just for i being a data-table but other things as well, then it would make
perfect sense to implement that action too when the bool is true.

On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan <aragorn168b at gmail.com>
wrote:

(The earlier message was too long and was rejected.)
So, from the discussion so far, I see that Matthew is nice enough to
implement `.JOIN` or `cross.apply`. I've a couple of questions. Suppose,

    DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
    setkey(DT1, "x")
    DT2 <- data.table(x=1)
    DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something like this. I
expect here the same output as current DT1[DT2, y]

The above syntax seems "okay". But my first question is what is
`.JOIN=FALSE` supposed to do under these two circumstances? Suppose,

    DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)
    setkey(DT1, "x")
    DT2 <- data.table(x=c(1,2,1), w=c(11:13))
    # what's the output supposed to be for?
    DT1[DT2, y, .JOIN=FALSE]
    DT1[DT2, .JOIN = FALSE]

Depending on this I'd have to think about `drop = TRUE/FALSE`. Also, how
does it work with `subset`?

    DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored?
 Is this supposed to also do a "cross-apply" on the logical subset? I guess
not. So, .JOIN is an "extra" parameter that comes into play *only* when `i`
is a `data.table`?

I'd love to have some replies to these questions for me to take a stance on
`.JOIN`. Thank you.

Best,
Arun.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130501/36238253/attachment-0001.html>


More information about the datatable-help mailing list