[datatable-help] datatable-help Digest, Vol 28, Issue 2

Yike Lu yikelu.home at gmail.com
Thu Jun 7 18:10:54 CEST 2012


On 6/7/2012 11:35 AM, Matthew Dowle wrote:
>
> Ah. See previous reply.
??? I understand that i doesn't have to be keyed, but how does 
data.table know when the names are not even the same? Sorry, I don't see 
the answer to that in your reply.

By equi-join, do you mean it tests for equality on all possible 
combinations of columns between [i] and [key(X)]? This seems like it 
could be rather dangerous.

> I tend to get confused by the words 'left' and
> 'right'. The old SQL syntax *= and =* confused me too. One is left outer
> and the other is right outer but which is which and left and right I
> always had to do do a test example to check for myself.
I can see where you're coming from. The system I came from had no right 
join, although it would have been trivial to define one in terms of the 
left join. So I got used to remembering "left as in the left table is 
the 'larger' or the one without a proper key".
>
> When I see X[Y] I just think, well, X subset by Y. I don't think left and
> right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears
> left of Y and Y left of Z in both cases, but they mean different (and each
> useful) things.
Yeah, this is where it kind of gets confusing for me. To me, the 
standard use of using the left outer is to leave NULL values in order to 
indicate missingness in the left (unkeyed) table. There are several 
possibilities though. Thinking about it in terms of Y %lj% X <-> X[Y] 
for a single given row and as a toy example, a single key, set in both 
tables:
1) key_value(X) in key_values(Y) - normal case, normal join
2) key_value(X) NOT in key_values(Y) - join X gets filled in with nulls 
in the join columns
3) key_value(Y) NOT in key_values(X) - the row  in Y gets lost.

So in fact, X and Y can be supersets, subsets, completely disjoint, or 
partially overlapping of each other and there would still be a well 
defined action.

So it seems both our mental models are somewhat wrong - for me, Y isn't 
necessarily the larger table, for you Y isn't necessarily the subset.

It's more accurate to think that Y's rows don't get touched, or that Y 
is the table of interest, to which you are adding additional information.

The reason I like thinking in joins is I prefer infix syntax. More 
elegant to me to say setkey(Y, a, b) %lj% setkey(X, a, b) than setkey(X, 
a, b)[setkey(Y, a, b)], and the chaining versus nested is easy as well...

X[Y][Z] => (X %rj% Y) %rj% Z
X[Y[Z]] => X %rj% (Y %rj% Z)
>
>
> In X[Y] since X is being subset using Y, the result is all X's columns
> followed by the JIS columns from Y (if any).  I'm also trying to use the
> names "x" and "i" to refer generically to X[Y] where x=X and i=Y.
>
> merge() is more useful than I've admited in the past. There's a FR to add
> by.x and by.y to that, which'll be the way to join by column name when
> needed.
I was wondering what the differences/similarities there were to merge(). 
The only place I use it is for the alias outer join: %oj% = function(x, 
y) merge(x, y)
>
>> Matt,
>>
>> Two things
>> 1) I noticed that in Juliet's example, Y's first column is not named at
>> all, so defaults to a name of "V1", yet the join carries through. How
>> does the underlying implementation work here? I believe this was the
>> original question, and I'm curious as well now.
>> 2) Having re-read the paragraph you sent, I now realize that X[Y] is
>> equivalent to X right join Y. It may be helpful to explicitly write this
>> in. I certainly had this flipped (I thought it was X left join Y), and
>> the column ordering is flipped from standard SQL.
>>
>> On 6/7/2012 6:00 AM, datatable-help-request at lists.r-forge.r-project.org
>> wrote:
>>>
>>> ----------------------------------------------------------------------
>>>
>>> Message: 1
>>> Date: Wed, 6 Jun 2012 13:47:00 -0400
>>> From: Juliet Hannah<juliet.hannah at gmail.com>
>>> To: datatable-help at lists.r-forge.r-project.org
>>> Subject: [datatable-help] join example from faq
>>> Message-ID:
>>> 	<CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoXAhXsUhcdCzBw at mail.gmail.com>
>>> Content-Type: text/plain; charset=ISO-8859-1
>>>
>>> All,
>>>
>>> I am not understanding a few basic things. I am looking at pg 5 of the
>>> faq.
>>>
>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>> setkey(X,grp)
>>>    Y = data.table(c("b","c"), bar=c(4,2))
>>>    X[Y]
>>>
>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>
>>> Does this mean data.table looks up X's key using Y?
>>>
>>> Y has two columns. How does it know to use the first column in this
>>> example? Y's key has not been set.
>>>
>>> Hope my question is not too obvious. :)
>>>
>>> Thanks,
>>>
>>> Juliet
>>>
>>> P.S. Thanks for the wonderful package.  I had to do some aggregations
>>> the other day and my other solutions were
>>> running for hours, but data.table finished in a couple of minutes!
>>>
>>>
>>> ------------------------------
>>>
>>> Message: 2
>>> Date: Wed, 06 Jun 2012 23:28:08 +0100
>>> From: Matthew Dowle<mdowle at mdowle.plus.com>
>>> To: Juliet Hannah<juliet.hannah at gmail.com>
>>> Cc: datatable-help at lists.r-forge.r-project.org
>>> Subject: Re: [datatable-help] join example from faq
>>> Message-ID:<1339021688.2573.16.camel at netbook>
>>> Content-Type: text/plain; charset="UTF-8"
>>>
>>> Hi,
>>>
>>> Hopefully this paragraph from ?data.table sheds some light :
>>>
>>>     "When i is a data.table, x must have a key. i is joined to x using
>>> the
>>> key and the rows in x that match are returned. An equi-join is performed
>>> between each column in i to each column in x's key. The match is a
>>> binary search in compiled C in O(log n) time. If i has less columns than
>>> x's key then many rows of x may match to each row of i. If i has more
>>> columns than x's key, the columns of i not involved in the join are
>>> included in the result. If i also has a key, it is i's key columns that
>>> are used to match to x's key columns and a binary merge of the two
>>> tables is carried out."
>>>
>>> The critical sentence is "If i also has a key ..."; i.e., i doesn't have
>>> to be keyed. Only x must have a key. It's often faster if i is keyed
>>> too, though. There have been some speed improvements in 1.8.1, too.
>>>
>>> In the example you highlighted I think it goes on to show 'join
>>> inherited scope', which is this paragraph :
>>>
>>>     "Advanced: In the X[Y,j] form of grouping, the j expression sees
>>> variables in X first, then Y. We call this join inherited scope. If the
>>> variable is not in X or Y then the calling frame is searched, its
>>> calling frame, and so on in the usual way up to and including the global
>>> environment."
>>>
>>> Encouraging to hear you've reduced hours to minutes. At least someone
>>> knows not to use benchmark(...,replications=100) for tasks that take
>>> under 0.01 seconds and then conclude data.table is slow!
>>>
>>> Matthew
>>>
>>> On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote:
>>>> All,
>>>>
>>>> I am not understanding a few basic things. I am looking at pg 5 of the
>>>> faq.
>>>>
>>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>>> setkey(X,grp)
>>>>    Y = data.table(c("b","c"), bar=c(4,2))
>>>>    X[Y]
>>>>
>>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>>
>>>> Does this mean data.table looks up X's key using Y?
>>>>
>>>> Y has two columns. How does it know to use the first column in this
>>>> example? Y's key has not been set.
>>>>
>>>> Hope my question is not too obvious. :)
>>>>
>>>> Thanks,
>>>>
>>>> Juliet
>>>>
>>>> P.S. Thanks for the wonderful package.  I had to do some aggregations
>>>> the other day and my other solutions were
>>>> running for hours, but data.table finished in a couple of minutes!
>>>
>>>
>>>
>>> ------------------------------
>>>
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>> End of datatable-help Digest, Vol 28, Issue 2
>>> *********************************************
>>
>


More information about the datatable-help mailing list