[datatable-help] datatable-help Digest, Vol 28, Issue 2

Thu Jun 7 17:35:53 CEST 2012

Ah. See previous reply.  I tend to get confused by the words 'left' and
'right'. The old SQL syntax *= and =* confused me too. One is left outer
and the other is right outer but which is which and left and right I
always had to do do a test example to check for myself.

When I see X[Y] I just think, well, X subset by Y. I don't think left and
right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears
left of Y and Y left of Z in both cases, but they mean different (and each
useful) things.

In X[Y] since X is being subset using Y, the result is all X's columns
followed by the JIS columns from Y (if any).  I'm also trying to use the
names "x" and "i" to refer generically to X[Y] where x=X and i=Y.

merge() is more useful than I've admited in the past. There's a FR to add
by.x and by.y to that, which'll be the way to join by column name when
needed.

> Matt,
>
> Two things
> 1) I noticed that in Juliet's example, Y's first column is not named at
> all, so defaults to a name of "V1", yet the join carries through. How
> does the underlying implementation work here? I believe this was the
> original question, and I'm curious as well now.
> 2) Having re-read the paragraph you sent, I now realize that X[Y] is
> equivalent to X right join Y. It may be helpful to explicitly write this
> in. I certainly had this flipped (I thought it was X left join Y), and
> the column ordering is flipped from standard SQL.
>
> On 6/7/2012 6:00 AM, datatable-help-request at lists.r-forge.r-project.org
> wrote:
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Wed, 6 Jun 2012 13:47:00 -0400
>> From: Juliet Hannah<juliet.hannah at gmail.com>
>> To: datatable-help at lists.r-forge.r-project.org
>> Subject: [datatable-help] join example from faq
>> Message-ID:
>> 	<CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoXAhXsUhcdCzBw at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> All,
>>
>> I am not understanding a few basic things. I am looking at pg 5 of the
>> faq.
>>
>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>> setkey(X,grp)
>>   Y = data.table(c("b","c"), bar=c(4,2))
>>   X[Y]
>>
>> The faq says X[Y] is a join looking up X's rows using Y.
>>
>> Does this mean data.table looks up X's key using Y?
>>
>> Y has two columns. How does it know to use the first column in this
>> example? Y's key has not been set.
>>
>> Hope my question is not too obvious. :)
>>
>> Thanks,
>>
>> Juliet
>>
>> P.S. Thanks for the wonderful package.  I had to do some aggregations
>> the other day and my other solutions were
>> running for hours, but data.table finished in a couple of minutes!
>>
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Wed, 06 Jun 2012 23:28:08 +0100
>> From: Matthew Dowle<mdowle at mdowle.plus.com>
>> To: Juliet Hannah<juliet.hannah at gmail.com>
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] join example from faq
>> Message-ID:<1339021688.2573.16.camel at netbook>
>> Content-Type: text/plain; charset="UTF-8"
>>
>> Hi,
>>
>> Hopefully this paragraph from ?data.table sheds some light :
>>
>>    "When i is a data.table, x must have a key. i is joined to x using
>> the
>> key and the rows in x that match are returned. An equi-join is performed
>> between each column in i to each column in x's key. The match is a
>> binary search in compiled C in O(log n) time. If i has less columns than
>> x's key then many rows of x may match to each row of i. If i has more
>> columns than x's key, the columns of i not involved in the join are
>> included in the result. If i also has a key, it is i's key columns that
>> are used to match to x's key columns and a binary merge of the two
>> tables is carried out."
>>
>> The critical sentence is "If i also has a key ..."; i.e., i doesn't have
>> to be keyed. Only x must have a key. It's often faster if i is keyed
>> too, though. There have been some speed improvements in 1.8.1, too.
>>
>> In the example you highlighted I think it goes on to show 'join
>> inherited scope', which is this paragraph :
>>
>>    "Advanced: In the X[Y,j] form of grouping, the j expression sees
>> variables in X first, then Y. We call this join inherited scope. If the
>> variable is not in X or Y then the calling frame is searched, its
>> calling frame, and so on in the usual way up to and including the global
>> environment."
>>
>> Encouraging to hear you've reduced hours to minutes. At least someone
>> knows not to use benchmark(...,replications=100) for tasks that take
>> under 0.01 seconds and then conclude data.table is slow!
>>
>> Matthew
>>
>> On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote:
>>> All,
>>>
>>> I am not understanding a few basic things. I am looking at pg 5 of the
>>> faq.
>>>
>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>> setkey(X,grp)
>>>   Y = data.table(c("b","c"), bar=c(4,2))
>>>   X[Y]
>>>
>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>
>>> Does this mean data.table looks up X's key using Y?
>>>
>>> Y has two columns. How does it know to use the first column in this
>>> example? Y's key has not been set.
>>>
>>> Hope my question is not too obvious. :)
>>>
>>> Thanks,
>>>
>>> Juliet
>>>
>>> P.S. Thanks for the wonderful package.  I had to do some aggregations
>>> the other day and my other solutions were
>>> running for hours, but data.table finished in a couple of minutes!
>>
>>
>>
>> ------------------------------
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>> End of datatable-help Digest, Vol 28, Issue 2
>> *********************************************
>