[datatable-help] datatable-help Digest, Vol 28, Issue 2

Juliet Hannah juliet.hannah at gmail.com
Thu Jun 7 18:46:54 CEST 2012


Hi Yike,

In Matthew's email, he writes:

So in other words, when i has no key it's the first column of i to the
first column of x's key, the 2nd column of i to the 2nd column of x's key,
etc.

I think this answers my original question. In the example (which is
from the FAQ), i has no key. So it seems
the  *default* is to match the first column of i to x's key.

That's how data.table "knows". It uses the first column as default.

At least that's the way I understand things now.  Matthew, please
correct if I've misunderstood.

Regards,

Juliet

On Thu, Jun 7, 2012 at 12:10 PM, Yike Lu <yikelu.home at gmail.com> wrote:
>
>
> On 6/7/2012 11:35 AM, Matthew Dowle wrote:
>>
>>
>> Ah. See previous reply.
>
> ??? I understand that i doesn't have to be keyed, but how does data.table
> know when the names are not even the same? Sorry, I don't see the answer to
> that in your reply.
>
> By equi-join, do you mean it tests for equality on all possible combinations
> of columns between [i] and [key(X)]? This seems like it could be rather
> dangerous.
>
>
>> I tend to get confused by the words 'left' and
>> 'right'. The old SQL syntax *= and =* confused me too. One is left outer
>> and the other is right outer but which is which and left and right I
>> always had to do do a test example to check for myself.
>
> I can see where you're coming from. The system I came from had no right
> join, although it would have been trivial to define one in terms of the left
> join. So I got used to remembering "left as in the left table is the
> 'larger' or the one without a proper key".
>
>>
>> When I see X[Y] I just think, well, X subset by Y. I don't think left and
>> right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears
>> left of Y and Y left of Z in both cases, but they mean different (and each
>> useful) things.
>
> Yeah, this is where it kind of gets confusing for me. To me, the standard
> use of using the left outer is to leave NULL values in order to indicate
> missingness in the left (unkeyed) table. There are several possibilities
> though. Thinking about it in terms of Y %lj% X <-> X[Y] for a single given
> row and as a toy example, a single key, set in both tables:
> 1) key_value(X) in key_values(Y) - normal case, normal join
> 2) key_value(X) NOT in key_values(Y) - join X gets filled in with nulls in
> the join columns
> 3) key_value(Y) NOT in key_values(X) - the row  in Y gets lost.
>
> So in fact, X and Y can be supersets, subsets, completely disjoint, or
> partially overlapping of each other and there would still be a well defined
> action.
>
> So it seems both our mental models are somewhat wrong - for me, Y isn't
> necessarily the larger table, for you Y isn't necessarily the subset.
>
> It's more accurate to think that Y's rows don't get touched, or that Y is
> the table of interest, to which you are adding additional information.
>
> The reason I like thinking in joins is I prefer infix syntax. More elegant
> to me to say setkey(Y, a, b) %lj% setkey(X, a, b) than setkey(X, a,
> b)[setkey(Y, a, b)], and the chaining versus nested is easy as well...
>
> X[Y][Z] => (X %rj% Y) %rj% Z
> X[Y[Z]] => X %rj% (Y %rj% Z)
>
>>
>>
>> In X[Y] since X is being subset using Y, the result is all X's columns
>> followed by the JIS columns from Y (if any).  I'm also trying to use the
>> names "x" and "i" to refer generically to X[Y] where x=X and i=Y.
>>
>> merge() is more useful than I've admited in the past. There's a FR to add
>> by.x and by.y to that, which'll be the way to join by column name when
>> needed.
>
> I was wondering what the differences/similarities there were to merge(). The
> only place I use it is for the alias outer join: %oj% = function(x, y)
> merge(x, y)
>
>>
>>> Matt,
>>>
>>> Two things
>>> 1) I noticed that in Juliet's example, Y's first column is not named at
>>> all, so defaults to a name of "V1", yet the join carries through. How
>>> does the underlying implementation work here? I believe this was the
>>> original question, and I'm curious as well now.
>>> 2) Having re-read the paragraph you sent, I now realize that X[Y] is
>>> equivalent to X right join Y. It may be helpful to explicitly write this
>>> in. I certainly had this flipped (I thought it was X left join Y), and
>>> the column ordering is flipped from standard SQL.
>>>
>>> On 6/7/2012 6:00 AM, datatable-help-request at lists.r-forge.r-project.org
>>> wrote:
>>>>
>>>>
>>>> ----------------------------------------------------------------------
>>>>
>>>> Message: 1
>>>> Date: Wed, 6 Jun 2012 13:47:00 -0400
>>>> From: Juliet Hannah<juliet.hannah at gmail.com>
>>>> To: datatable-help at lists.r-forge.r-project.org
>>>> Subject: [datatable-help] join example from faq
>>>> Message-ID:
>>>>
>>>>  <CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoXAhXsUhcdCzBw at mail.gmail.com>
>>>> Content-Type: text/plain; charset=ISO-8859-1
>>>>
>>>> All,
>>>>
>>>> I am not understanding a few basic things. I am looking at pg 5 of the
>>>> faq.
>>>>
>>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>>> setkey(X,grp)
>>>>   Y = data.table(c("b","c"), bar=c(4,2))
>>>>   X[Y]
>>>>
>>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>>
>>>> Does this mean data.table looks up X's key using Y?
>>>>
>>>> Y has two columns. How does it know to use the first column in this
>>>> example? Y's key has not been set.
>>>>
>>>> Hope my question is not too obvious. :)
>>>>
>>>> Thanks,
>>>>
>>>> Juliet
>>>>
>>>> P.S. Thanks for the wonderful package.  I had to do some aggregations
>>>> the other day and my other solutions were
>>>> running for hours, but data.table finished in a couple of minutes!
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> Message: 2
>>>> Date: Wed, 06 Jun 2012 23:28:08 +0100
>>>> From: Matthew Dowle<mdowle at mdowle.plus.com>
>>>> To: Juliet Hannah<juliet.hannah at gmail.com>
>>>> Cc: datatable-help at lists.r-forge.r-project.org
>>>> Subject: Re: [datatable-help] join example from faq
>>>> Message-ID:<1339021688.2573.16.camel at netbook>
>>>> Content-Type: text/plain; charset="UTF-8"
>>>>
>>>> Hi,
>>>>
>>>> Hopefully this paragraph from ?data.table sheds some light :
>>>>
>>>>    "When i is a data.table, x must have a key. i is joined to x using
>>>> the
>>>> key and the rows in x that match are returned. An equi-join is performed
>>>> between each column in i to each column in x's key. The match is a
>>>> binary search in compiled C in O(log n) time. If i has less columns than
>>>> x's key then many rows of x may match to each row of i. If i has more
>>>> columns than x's key, the columns of i not involved in the join are
>>>> included in the result. If i also has a key, it is i's key columns that
>>>> are used to match to x's key columns and a binary merge of the two
>>>> tables is carried out."
>>>>
>>>> The critical sentence is "If i also has a key ..."; i.e., i doesn't have
>>>> to be keyed. Only x must have a key. It's often faster if i is keyed
>>>> too, though. There have been some speed improvements in 1.8.1, too.
>>>>
>>>> In the example you highlighted I think it goes on to show 'join
>>>> inherited scope', which is this paragraph :
>>>>
>>>>    "Advanced: In the X[Y,j] form of grouping, the j expression sees
>>>> variables in X first, then Y. We call this join inherited scope. If the
>>>> variable is not in X or Y then the calling frame is searched, its
>>>> calling frame, and so on in the usual way up to and including the global
>>>> environment."
>>>>
>>>> Encouraging to hear you've reduced hours to minutes. At least someone
>>>> knows not to use benchmark(...,replications=100) for tasks that take
>>>> under 0.01 seconds and then conclude data.table is slow!
>>>>
>>>> Matthew
>>>>
>>>> On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote:
>>>>>
>>>>> All,
>>>>>
>>>>> I am not understanding a few basic things. I am looking at pg 5 of the
>>>>> faq.
>>>>>
>>>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>>>> setkey(X,grp)
>>>>>   Y = data.table(c("b","c"), bar=c(4,2))
>>>>>   X[Y]
>>>>>
>>>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>>>
>>>>> Does this mean data.table looks up X's key using Y?
>>>>>
>>>>> Y has two columns. How does it know to use the first column in this
>>>>> example? Y's key has not been set.
>>>>>
>>>>> Hope my question is not too obvious. :)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Juliet
>>>>>
>>>>> P.S. Thanks for the wonderful package.  I had to do some aggregations
>>>>> the other day and my other solutions were
>>>>> running for hours, but data.table finished in a couple of minutes!
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>> End of datatable-help Digest, Vol 28, Issue 2
>>>> *********************************************
>>>
>>>
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list