[datatable-help] changing data.table by-without-by syntax to require a "by"

Eduard Antonyan eduard.antonyan at gmail.com
Thu Apr 25 14:45:45 CEST 2013


Well, so can .I or .N or .GRP or .BY, yet those are used as special names,
which is exactly why I suggested .J.

The problem with using 'missingness' is that it already means smth very
different when i is not a join/cross, it means *don't* do a by, thus
introducing the whole case thing one has to through in their head every
time as in OP (which of course becomes automatic after a while, but it's a
cost nonetheless, which is in particular high for new people). So I see
absence of 'by' as an already taken and used signal and thus something else
has to be used for the new signal of cross apply (it doesn't have to be the
specific option I mentioned above). This is exactly why I find optional
turning off of this behavior unsatisfactory, and I don't see that as a
solution to this at all.

I think in the x+y context the appropriate analog is - what if that added x
and y normally, but when x and y were data.frames it did element by element
multiplication instead? Yes that's possible to do, and possible to
document, but it's not a good idea, because it takes place of adding them
element by element. The recycling behavior doesn't do that - what that does
is it says it doesn't really make sense to add them as is, but we can do
that after recycling, so let's recycle. It doesn't take the place of
another existing way of adding vectors.

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:



I see what you're getting at. But .J may be a column name, which is
the current meaning of by = single symbol. And why .J?  If not .J, or
any single symbol what else instead?  A character value such as
by="irows" is taken to mean the "irows" column currently (for
consistency with by="colA,colB,colC").  But some signal needs to be
passed to by=, then (you're suggesting), to trigger the cross apply by
each i row.  Currently, that signal is missingness  (which I like,
rely on, and use with join inherited scope).

As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an
option to turn off by-without-by), since there is no downside.   But
you've continued to argue for a change to the default, iiuc.

Maybe it helps to consider :

     x+y

Fundamentally in R this depends on what x and y are.  Most of us
probably assume (as a first thought) that x and y are vectors and know
that this will apply "+" elementwise,  recycling y if necessary.  In R
we like and write code like this all the time.   I think of X[Y, j] in
the same way: j is the operation (like +) which is applied for each
row of Y.   If you need j for the entire set that Y joins to,  then
like a FAQ says,  make j missing too and it's X[Y][,j]. But providing
a way to make X[Y,j] do the same as X[Y][,j] would be nice and is on
the list:  drop=TRUE would do that (as someone mentioned on the S.O.
thread).  So maybe the new option would be datatable.drop (but with
default FALSE not TRUE).  If you wanted to turn off by-without-by you
might set options(datatable.drop=TRUE). Then you can use data.table
how you prefer (explicit by) and I can use it how I prefer.



I'm happy to add the argument to [.data.table,  and make its default
changeable via a global option in the usual way.

Matthew



On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing
that as after X[Y] is done the necessary information is lost.
To retain that functionality and achieve better readability, as in OP, I
think smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good
replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <eduard.antonyan at gmail.com>
wrote:

 that's an interesting example - I didn't realize current behavior would do
that, I'm not at a PC anymore but I'll definitely think about it and report
back, as it's not immediately obvious to me


On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

>
>
> i. prefix is just a robust way to reference join inherited columns:   the
> 'top' column in the i table.   Like table aliases in SQL.
>
> What about this? :
> 1> X = data.table(a=1:3,b=1:15, key="a")
> 1> X
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 1 10
> 5: 1 13
> 6: 2 2
> 7: 2 5
> 8: 2 8
> 9: 2 11
> 10: 2 14
> 11: 3 3
> 12: 3 6
> 13: 3 9
> 14: 3 12
> 15: 3 15
>
> 1> Y = data.table(a=c(1,2,1), top=c(3,4,2))
>
>
> 1> Y
> a top
> 1: 1 3
> 2: 2 4
> 3: 1 2
> 1> X[Y, head(.SD,i.top)]
> a b
> 1: 1 1
> 2: 1 4
> 3: 1 7
> 4: 2 2
> 5: 2 5
> 6: 2 8
> 7: 2 11
> 8: 1 1
>
> 9: 1  4
> 1>
>
>
>
> On 24.04.2013 23:43, Eduard Antonyan wrote:
>
> I assumed they meant create a table :)
> that looks cool, what's i.top ? I can get a very similar to yours result
> by writing:
> X[Y][, head(.SD, top[1]), by = a]
> and I probably would want the following to produce your result (this might
> depend a little on what exactly i.top is):
> X[Y, head(.SD, i.top), by = a]
>
>
> On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>
>>
>>
>> That sentence on that linked webpage seems incorect English, since table
>> is a noun not a verb.  Should "table" be "join" perhaps?
>>
>> Anyway, by-without-by is often used with join inherited scope (JIS).  For
>> example, translating their example :
>>
>> 1> X = data.table(a=1:3,b=1:15, key="a")
>> 1> X
>>     a  b
>>  1: 1  1
>>  2: 1  4
>>  3: 1  7
>>  4: 1 10
>>  5: 1 13
>>  6: 2  2
>>  7: 2  5
>>  8: 2  8
>>  9: 2 11
>> 10: 2 14
>> 11: 3  3
>> 12: 3  6
>>
>>
>>
>> 13: 3  9
>> 14: 3 12
>> 15: 3 15
>> 1> Y = data.table(a=c(1,2), top=c(3,4))
>> 1> Y
>>    a top
>> 1: 1   3
>> 2: 2   4
>> 1> X[Y, head(.SD,i.top)]
>>    a  b
>> 1: 1  1
>> 2: 1  4
>> 3: 1  7
>> 4: 2  2
>> 5: 2  5
>>
>>
>>
>> 6: 2  8
>> 7: 2 11
>> 1>
>>
>>
>>
>> If there was no by-without-by (analogous to CROSS BY),  then how would that be done?
>>
>>
>>
>> On 24.04.2013 22:22, Eduard Antonyan wrote:
>>
>> By that you mean current behavior? You'd get current behavior by
>> explicitly specifying the appropriate "by" (i.e. "by" equal to the key).
>> Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using
>> http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I
>> can't figure out how by-without-by (or with by-with-by for that matter:) )
>> helps with e.g. the first example there:
>> "We table table1 and table2. table1 has a column called rowcount.
>>
>> For each row from table1 we need to select first rowcount rows from
>> table2, ordered by table2.id"
>>
>>
>>
>>
>> On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:
>>
>>> But then what would be analogous to CROSS APPLY in SQL?
>>>
>>> > I'd agree with Eduard, although it's probably too late to change
>>> behavior
>>> > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
>>> > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
>>> > requested).
>>> >
>>> > S.
>>> >
>>> >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>>> >> From: eduard.antonyan at gmail.com
>>> >> To: datatable-help at lists.r-forge.r-project.org
>>> >> Subject: Re: [datatable-help] changing data.table by-without-by
>>> >> syntax       to      require a "by"
>>> >>
>>> >> I think you're missing the point Michael. Just because it's possible
>>> to
>>> >> do it
>>> >> the way it's done now, doesn't mean that's the best way, as I've tried
>>> >> to
>>> >> argue in the OP. I don't think you've addressed the issue of
>>> unnecessary
>>> >> complexity pointed out in OP.
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
>>> >> Sent from the datatable-help mailing list archive at Nabble.com.
>>> >> _______________________________________________
>>> >> datatable-help mailing list
>>> >> datatable-help at lists.r-forge.r-project.org
>>> >>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>> >
>>> _______________________________________________
>>> > datatable-help mailing list
>>> > datatable-help at lists.r-forge.r-project.org
>>> >
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130425/04c0f793/attachment.html>


More information about the datatable-help mailing list