[datatable-help] changing data.table by-without-by syntax to require a "by"

Matthew Dowle mdowle at mdowle.plus.com
Fri Apr 26 13:14:02 CEST 2013


 

I didn't get any feedback off list on this one. 

But I'm coming
round to the idea. 

What about by=.JOIN (is that you were thinking .J
stood for?) Other possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY,
.EACHJOIN. Just to brainstorm it. 

by=.JOIN could be added anyway with
no backwards compatibility issues, so that those who wished to be
explicit now could be. 

To change the default for X[Y, j] I'm also
coming round to. It might help in a few related areas e.g. X[Y][,j]
(which isn't great right now, agreed). We have successfully made
non-backwards-compatibile changes in the past by introducing a global
option which we slowly migrate to. If datatable.bywithoutby was added it
could take values TRUE|"warning"|FALSE from day one, with default TRUE.
That allows those who wish for explicit by to migrate straight away by
changing the default to FALSE. Existing users could set it to "warning"
to see how many implicit bywithoutby they have. Those calls can
gradually be changed to by=.JOIN and in that way both implicit and
explicit work at the same time, for say a year, with full backwards
compatibility by default. This approach allows a slow and flexible
migration path on a per feature basis. Then the default could be chaged
to "warning" before finally FALSE. Depending on how it goes, the option
could be left there to allow TRUE if anyone wanted it, or removed (maybe
after two years). Similar to the removal of J() outside DT[...] i.e.
users can still now very easily write J=data.table in their .Rprofile if
they wish, for backwards compatibility. 

Or ... instead of : 

 X[Y, j,
by=.JOIN] 

what about : 

 X[by=Y, j] 

Matthew 

On 25.04.2013 16:32,
Matthew Dowle wrote: 

> I'd appreciate some input from others whether
they agree or not. If you have a view perhaps let me know off list, or
on list, whichever you prefer. 
> 
> Thanks, 
> 
> Matthew 
> 
> On
25.04.2013 13:45, Eduard Antonyan wrote: 
> 
>> Well, so can .I or .N or
.GRP or .BY, yet those are used as special names, which is exactly why I
suggested .J. 
>> The problem with using 'missingness' is that it
already means smth very different when i is not a join/cross, it means
*don't* do a by, thus introducing the whole case thing one has to
through in their head every time as in OP (which of course becomes
automatic after a while, but it's a cost nonetheless, which is in
particular high for new people). So I see absence of 'by' as an already
taken and used signal and thus something else has to be used for the new
signal of cross apply (it doesn't have to be the specific option I
mentioned above). This is exactly why I find optional turning off of
this behavior unsatisfactory, and I don't see that as a solution to this
at all. 
>> I think in the x+y context the appropriate analog is - what
if that added x and y normally, but when x and y were data.frames it did
element by element multiplication instead? Yes that's possible to do,
and possible to document, but it's not a good idea, because it takes
place of adding them element by element. The recycling behavior doesn't
do that - what that does is it says it doesn't really make sense to add
them as is, but we can do that after recycling, so let's recycle. It
doesn't take the place of another existing way of adding vectors. 
>>

>> On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com
[15]> wrote:
>> 
>>> I see what you're getting at. But .J may be a
column name, which is the current meaning of by = single symbol. And why
.J? If not .J, or any single symbol what else instead? A character value
such as by="irows" is taken to mean the "irows" column currently (for
consistency with by="colA,colB,colC"). But some signal needs to be
passed to by=, then (you're suggesting), to trigger the cross apply by
each i row. Currently, that signal is missingness (which I like, rely
on, and use with join inherited scope).
>>> 
>>> As I wrote in the S.O.
thread, I'm happy to make it optional (i.e. an option to turn off
by-without-by), since there is no downside. But you've continued to
argue for a change to the default, iiuc.
>>> 
>>> Maybe it helps to
consider :
>>> 
>>> x+y
>>> 
>>> Fundamentally in R this depends on what
x and y are. Most of us probably assume (as a first thought) that x and
y are vectors and know that this will apply "+" elementwise, recycling y
if necessary. In R we like and write code like this all the time. I
think of X[Y, j] in the same way: j is the operation (like +) which is
applied for each row of Y. If you need j for the entire set that Y joins
to, then like a FAQ says, make j missing too and it's X[Y][,j]. But
providing a way to make X[Y,j] do the same as X[Y][,j] would be nice and
is on the list: drop=TRUE would do that (as someone mentioned on the
S.O. thread). So maybe the new option would be datatable.drop (but with
default FALSE not TRUE). If you wanted to turn off by-without-by you
might set options(datatable.drop=TRUE). Then you can use data.table how
you prefer (explicit by) and I can use it how I prefer.
>>> 
>>> I'm
happy to add the argument to [.data.table, and make its default
changeable via a global option in the usual way. 
>>> 
>>> Matthew 
>>>

>>> On 25.04.2013 05:16, Eduard Antonyan wrote: 
>>> 
>>>> That's
really interesting, I can't currently think of another way of doing that
as after X[Y] is done the necessary information is lost. 
>>>> To retain
that functionality and achieve better readability, as in OP, I think
smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good
replacement for current syntax. 
>>>> 
>>>> On Apr 24, 2013, at 6:01 PM,
Eduard Antonyan <eduard.antonyan at gmail.com [14]> wrote:
>>>> 
>>>>>
that's an interesting example - I didn't realize current behavior would
do that, I'm not at a PC anymore but I'll definitely think about it and
report back, as it's not immediately obvious to me 
>>>>> 
>>>>> On Wed,
Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com [13]>
wrote:
>>>>> 
>>>>>> i. prefix is just a robust way to reference join
inherited columns: the 'top' column in the i table. Like table aliases
in SQL. 
>>>>>> 
>>>>>> What about this? : 
>>>>>> 1> X =
data.table(a=1:3,b=1:15, key="a")
>>>>>> 1> X
>>>>>> a b
>>>>>> 1: 1
1
>>>>>> 2: 1 4
>>>>>> 3: 1 7
>>>>>> 4: 1 10
>>>>>> 5: 1 13
>>>>>> 6: 2
2
>>>>>> 7: 2 5
>>>>>> 8: 2 8
>>>>>> 9: 2 11
>>>>>> 10: 2 14
>>>>>> 11:
3 3
>>>>>> 12: 3 6
>>>>>> 13: 3 9
>>>>>> 14: 3 12
>>>>>> 15: 3 15

>>>>>> 
>>>>>> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>>>>>>

>>>>>> 1> Y
>>>>>> a top
>>>>>> 1: 1 3
>>>>>> 2: 2 4
>>>>>> 3: 1
2
>>>>>> 1> X[Y, head(.SD,i.top)]
>>>>>> a b
>>>>>> 1: 1 1
>>>>>> 2: 1
4
>>>>>> 3: 1 7
>>>>>> 4: 2 2
>>>>>> 5: 2 5
>>>>>> 6: 2 8
>>>>>> 7: 2
11
>>>>>> 8: 1 1 
>>>>>> 
>>>>>> 9: 1 4
>>>>>> 1> 
>>>>>> 
>>>>>> On
24.04.2013 23:43, Eduard Antonyan wrote: 
>>>>>> 
>>>>>>> I assumed they
meant create a table :) 
>>>>>>> that looks cool, what's i.top ? I can
get a very similar to yours result by writing: 
>>>>>>> X[Y][, head(.SD,
top[1]), by = a] 
>>>>>>> and I probably would want the following to
produce your result (this might depend a little on what exactly i.top
is): 
>>>>>>> X[Y, head(.SD, i.top), by = a] 
>>>>>>> 
>>>>>>> On Wed,
Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com [12]>
wrote:
>>>>>>> 
>>>>>>>> That sentence on that linked webpage seems
incorect English, since table is a noun not a verb. Should "table" be
"join" perhaps? 
>>>>>>>> 
>>>>>>>> Anyway, by-without-by is often used
with join inherited scope (JIS). For example, translating their example
: 
>>>>>>>> 
>>>>>>>> 1> X = data.table(a=1:3,b=1:15, key="a")
>>>>>>>>
1> X
>>>>>>>> a b
>>>>>>>> 1: 1 1
>>>>>>>> 2: 1 4
>>>>>>>> 3: 1
7
>>>>>>>> 4: 1 10
>>>>>>>> 5: 1 13
>>>>>>>> 6: 2 2
>>>>>>>> 7: 2
5
>>>>>>>> 8: 2 8
>>>>>>>> 9: 2 11
>>>>>>>> 10: 2 14
>>>>>>>> 11: 3
3
>>>>>>>> 12: 3 6
>>>>>>>> 
>>>>>>>> 13: 3 9
>>>>>>>> 14: 3 12
>>>>>>>>
15: 3 15
>>>>>>>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>>>>>>>> 1>
Y
>>>>>>>> a top
>>>>>>>> 1: 1 3
>>>>>>>> 2: 2 4
>>>>>>>> 1> X[Y,
head(.SD,i.top)]
>>>>>>>> a b
>>>>>>>> 1: 1 1
>>>>>>>> 2: 1 4
>>>>>>>>
3: 1 7
>>>>>>>> 4: 2 2
>>>>>>>> 5: 2 5
>>>>>>>> 
>>>>>>>> 6: 2
8
>>>>>>>> 7: 2 11
>>>>>>>> 1> 
>>>>>>>> 
>>>>>>>> If there was no
by-without-by (analogous to CROSS BY), then how would that be
done?
>>>>>>>> 
>>>>>>>> On 24.04.2013 22:22, Eduard Antonyan wrote:

>>>>>>>> 
>>>>>>>>> By that you mean current behavior? You'd get
current behavior by explicitly specifying the appropriate "by" (i.e.
"by" equal to the key). 
>>>>>>>>> Btw, I'm trying to understand SQL
CROSS APPLY vs JOIN using
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/ [9],
and I can't figure out how by-without-by (or with by-with-by for that
matter:) ) helps with e.g. the first example there: 
>>>>>>>>> "We table
table1 and table2. table1 has a column called rowcount. 
>>>>>>>>>

>>>>>>>>> For each row from table1 we need to select first rowcount
rows from table2, ordered by table2.id [10]" 
>>>>>>>>> 
>>>>>>>>> On
Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com
[11]> wrote:
>>>>>>>>> 
>>>>>>>>>> But then what would be analogous to
CROSS APPLY in SQL?
>>>>>>>>>> 
>>>>>>>>>> > I'd agree with Eduard,
although it's probably too late to change behavior
>>>>>>>>>> > now.
Maybe for data.table.2? Eduard's proposal seems more closely
>>>>>>>>>>
> aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only
if
>>>>>>>>>> > requested).
>>>>>>>>>> >
>>>>>>>>>> > S.
>>>>>>>>>>
>
>>>>>>>>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700 >> From:
eduard.antonyan at gmail.com [1]
>>>>>>>>>> >> To:
datatable-help at lists.r-forge.r-project.org [2]
>>>>>>>>>> 
>>>>>>>>>>>>
Subject: Re: [datatable-help] changing data.table
by-without-by
>>>>>>>>>> >> syntax to require a "by"
>>>>>>>>>>
>>
>>>>>>>>>> >> I think you're missing the point Michael. Just because
it's possible to
>>>>>>>>>> >> do it
>>>>>>>>>> >> the way it's done
now, doesn't mean that's the best way, as I've tried
>>>>>>>>>> >>
to
>>>>>>>>>> >> argue in the OP. I don't think you've addressed the
issue of unnecessary
>>>>>>>>>> >> complexity pointed out in
OP.
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
--
>>>>>>>>>> >> View this message in context:
>>>>>>>>>> >>
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[3]
>>>>>>>>>> >> Sent from the datatable-help mailing list archive at
Nabble.com [4].
>>>>>>>>>> >>
_______________________________________________
>>>>>>>>>> >>
datatable-help mailing list >>
datatable-help at lists.r-forge.r-project.org [5]
>>>>>>>>>> 
>>>>>>>>>>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
>>>>>>>>>> >
_______________________________________________
>>>>>>>>>> >
datatable-help mailing list > datatable-help at lists.r-forge.r-project.org
[7]
>>>>>>>>>> >
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[8]

 

Links:
------
[1] mailto:eduard.antonyan at gmail.com
[2]
mailto:datatable-help at lists.r-forge.r-project.org
[3]
http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
[4]
http://Nabble.com
[5]
mailto:datatable-help at lists.r-forge.r-project.org
[6]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
mailto:datatable-help at lists.r-forge.r-project.org
[8]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[9]
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/
[10]
http://table2.id
[11] mailto:mdowle at mdowle.plus.com
[12]
mailto:mdowle at mdowle.plus.com
[13] mailto:mdowle at mdowle.plus.com
[14]
mailto:eduard.antonyan at gmail.com
[15] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/f38f0a62/attachment-0001.html>


More information about the datatable-help mailing list