[datatable-help] changing data.table by-without-by syntax to require a "by"

Wed May 1 17:43:21 CEST 2013

Sure, here's a recap. The most succinct way of putting it is - the meaning
of d[i, j, by = b] is very complicated and unintuitive right now because of
hidden by's in some cases and that statement can be made much more readable
by making by-without-by's explicit. The longer version follows.

First let's go over what is done currently, in particular what exactly is
by-without-by. The following example, adapted from Matthew's examples
illustrates current behavior:

> X = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = "a")
> Y = data.table(a = c(1,2,1), key = "a")
> X[Y]
   a b
1: 1 1
2: 1 2
3: 1 1
4: 1 2
5: 2 3
6: 2 4
> X[Y, sum(b)]
   a V1
1: 1  3
2: 1  3
3: 2  7

What's happening here is that the action j=sum(b) is performed for each row
of Y (or rather each 'a') as if that was a 'by' by the rows of Y. Had Y had
unique 'a' values only, this would've been equivalent to doing a 'by' by
'a' after the merge, but there is a difference when Y$a has duplicates.

This is interesting behavior that can be used in a variety of situations
(it also has an interesting leveraging point - if Y$a *is* unique and you'd
like to do 'by=a' after the merge, it's more computationally advantageous
to do the 'by' *during* the merge and not after), however it interferes
with the naturally established action for d[i, j], where for other i's this
would simply do action 'j', without doing an extra hidden 'by'.

The proposal is thus to do the above special 'by' only when explicitly
asked to - e.g. by adding a new boolean 'each.i = TRUE', the default value
for which would be FALSE. This will make syntax much more readable and
user-friendly, would eliminate a few FAQ points and would also allow a new
kind of action, that afaik is actually not possible with current syntax.

Here's some correspondences - left is new syntax and right is old syntax:

Take 'dt' and apply 'i' (where 'i' is anything, including a join):
  dt[i] <-> dt[i]

Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b':
  dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by
= b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a
join in some cases but not others

Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by
(will do cross-apply only when 'i' is a join):
  dt[i, j, each.i = TRUE] <-> dt[i, j]

Take 'dt' and apply 'i', return j over *both* the cross-apply/by-without-by
(for 'i' being a join only) and another specified 'by', think of this as
doing by=list(b, rows of Y):
  dt[i, j, by = b, each.i = TRUE] <-> afaik there is no direct
correspondence in current behavior

On Tuesday, April 30, 2013, Ricardo Saporta wrote:

> Eddi,
>
> Perhaps you could summarize succinctly, now after a good bit of
> discussion,  what your proposed change is.
>
> -Rick
>
>
> On Tue, Apr 30, 2013 at 7:10 PM, statquant3 <statquant at outlook.com> wrote:
>
>> Hi, I red the 30 posts and I have to confess that I still do not
>> understand
>> the point of the changes...
>> Could anyone kindly write an example of the current behaviour and what the
>> new option will bring to the table ?
>> Sorry...
>>
>>
>>
>> --
>> View this message in context:
>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html
>> Sent from the datatable-help mailing list archive at Nabble.com.
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130501/ec1dacc3/attachment.html>