[datatable-help] datatable-help Digest, Vol 28, Issue 2
Matthew Dowle
mdowle at mdowle.plus.com
Thu Jun 7 19:07:35 CEST 2012
Yes, that's it.
So, when I see DT["foo"], I immediately know two things: i) that DT is
keyed and ii) that the first column of DT's key is type 'character' (or
'factor'). I'll get back all the rows for "foo" because mult is "all" by
default. If I want a particular row of the "foo" group it's just
DT[J("foo",20120331L),roll=TRUE] using the fact that DT's key is id then
date.
It's similar to v["foo"] in base R. You immediately know that v is a
vector with names set in advance. You "join" foo to the names 'key' of
that vector and return the item that matches. The only difference is that
DT can have more than 1 column in its key, and the types of each column
key can be different.
Most of the queries I do are like that. They join to the key column in
order. I find it a pain in SQL to continually have huge 'where' or 'ON'
clauses like "where a.id = b.id and a.category = b.category and a.date =
b.date", over and over again repeating the same column names. Makes my
eyes water.
> Hi Yike,
>
> In Matthew's email, he writes:
>
> So in other words, when i has no key it's the first column of i to the
> first column of x's key, the 2nd column of i to the 2nd column of x's key,
> etc.
>
> I think this answers my original question. In the example (which is
> from the FAQ), i has no key. So it seems
> the *default* is to match the first column of i to x's key.
>
> That's how data.table "knows". It uses the first column as default.
>
> At least that's the way I understand things now. Matthew, please
> correct if I've misunderstood.
>
> Regards,
>
> Juliet
>
> On Thu, Jun 7, 2012 at 12:10 PM, Yike Lu <yikelu.home at gmail.com> wrote:
>>
>>
>> On 6/7/2012 11:35 AM, Matthew Dowle wrote:
>>>
>>>
>>> Ah. See previous reply.
>>
>> ??? I understand that i doesn't have to be keyed, but how does
>> data.table
>> know when the names are not even the same? Sorry, I don't see the answer
>> to
>> that in your reply.
>>
>> By equi-join, do you mean it tests for equality on all possible
>> combinations
>> of columns between [i] and [key(X)]? This seems like it could be rather
>> dangerous.
>>
>>
>>> I tend to get confused by the words 'left' and
>>> 'right'. The old SQL syntax *= and =* confused me too. One is left
>>> outer
>>> and the other is right outer but which is which and left and right I
>>> always had to do do a test example to check for myself.
>>
>> I can see where you're coming from. The system I came from had no right
>> join, although it would have been trivial to define one in terms of the
>> left
>> join. So I got used to remembering "left as in the left table is the
>> 'larger' or the one without a proper key".
>>
>>>
>>> When I see X[Y] I just think, well, X subset by Y. I don't think left
>>> and
>>> right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears
>>> left of Y and Y left of Z in both cases, but they mean different (and
>>> each
>>> useful) things.
>>
>> Yeah, this is where it kind of gets confusing for me. To me, the
>> standard
>> use of using the left outer is to leave NULL values in order to indicate
>> missingness in the left (unkeyed) table. There are several possibilities
>> though. Thinking about it in terms of Y %lj% X <-> X[Y] for a single
>> given
>> row and as a toy example, a single key, set in both tables:
>> 1) key_value(X) in key_values(Y) - normal case, normal join
>> 2) key_value(X) NOT in key_values(Y) - join X gets filled in with nulls
>> in
>> the join columns
>> 3) key_value(Y) NOT in key_values(X) - the row in Y gets lost.
>>
>> So in fact, X and Y can be supersets, subsets, completely disjoint, or
>> partially overlapping of each other and there would still be a well
>> defined
>> action.
>>
>> So it seems both our mental models are somewhat wrong - for me, Y isn't
>> necessarily the larger table, for you Y isn't necessarily the subset.
>>
>> It's more accurate to think that Y's rows don't get touched, or that Y
>> is
>> the table of interest, to which you are adding additional information.
>>
>> The reason I like thinking in joins is I prefer infix syntax. More
>> elegant
>> to me to say setkey(Y, a, b) %lj% setkey(X, a, b) than setkey(X, a,
>> b)[setkey(Y, a, b)], and the chaining versus nested is easy as well...
>>
>> X[Y][Z] => (X %rj% Y) %rj% Z
>> X[Y[Z]] => X %rj% (Y %rj% Z)
>>
>>>
>>>
>>> In X[Y] since X is being subset using Y, the result is all X's columns
>>> followed by the JIS columns from Y (if any). I'm also trying to use
>>> the
>>> names "x" and "i" to refer generically to X[Y] where x=X and i=Y.
>>>
>>> merge() is more useful than I've admited in the past. There's a FR to
>>> add
>>> by.x and by.y to that, which'll be the way to join by column name when
>>> needed.
>>
>> I was wondering what the differences/similarities there were to merge().
>> The
>> only place I use it is for the alias outer join: %oj% = function(x, y)
>> merge(x, y)
>>
>>>
>>>> Matt,
>>>>
>>>> Two things
>>>> 1) I noticed that in Juliet's example, Y's first column is not named
>>>> at
>>>> all, so defaults to a name of "V1", yet the join carries through. How
>>>> does the underlying implementation work here? I believe this was the
>>>> original question, and I'm curious as well now.
>>>> 2) Having re-read the paragraph you sent, I now realize that X[Y] is
>>>> equivalent to X right join Y. It may be helpful to explicitly write
>>>> this
>>>> in. I certainly had this flipped (I thought it was X left join Y), and
>>>> the column ordering is flipped from standard SQL.
>>>>
>>>> On 6/7/2012 6:00 AM,
>>>> datatable-help-request at lists.r-forge.r-project.org
>>>> wrote:
>>>>>
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>>
>>>>> Message: 1
>>>>> Date: Wed, 6 Jun 2012 13:47:00 -0400
>>>>> From: Juliet Hannah<juliet.hannah at gmail.com>
>>>>> To: datatable-help at lists.r-forge.r-project.org
>>>>> Subject: [datatable-help] join example from faq
>>>>> Message-ID:
>>>>>
>>>>> <CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoXAhXsUhcdCzBw at mail.gmail.com>
>>>>> Content-Type: text/plain; charset=ISO-8859-1
>>>>>
>>>>> All,
>>>>>
>>>>> I am not understanding a few basic things. I am looking at pg 5 of
>>>>> the
>>>>> faq.
>>>>>
>>>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>>>> setkey(X,grp)
>>>>> Y = data.table(c("b","c"), bar=c(4,2))
>>>>> X[Y]
>>>>>
>>>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>>>
>>>>> Does this mean data.table looks up X's key using Y?
>>>>>
>>>>> Y has two columns. How does it know to use the first column in this
>>>>> example? Y's key has not been set.
>>>>>
>>>>> Hope my question is not too obvious. :)
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Juliet
>>>>>
>>>>> P.S. Thanks for the wonderful package. I had to do some aggregations
>>>>> the other day and my other solutions were
>>>>> running for hours, but data.table finished in a couple of minutes!
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> Message: 2
>>>>> Date: Wed, 06 Jun 2012 23:28:08 +0100
>>>>> From: Matthew Dowle<mdowle at mdowle.plus.com>
>>>>> To: Juliet Hannah<juliet.hannah at gmail.com>
>>>>> Cc: datatable-help at lists.r-forge.r-project.org
>>>>> Subject: Re: [datatable-help] join example from faq
>>>>> Message-ID:<1339021688.2573.16.camel at netbook>
>>>>> Content-Type: text/plain; charset="UTF-8"
>>>>>
>>>>> Hi,
>>>>>
>>>>> Hopefully this paragraph from ?data.table sheds some light :
>>>>>
>>>>> "When i is a data.table, x must have a key. i is joined to x using
>>>>> the
>>>>> key and the rows in x that match are returned. An equi-join is
>>>>> performed
>>>>> between each column in i to each column in x's key. The match is a
>>>>> binary search in compiled C in O(log n) time. If i has less columns
>>>>> than
>>>>> x's key then many rows of x may match to each row of i. If i has more
>>>>> columns than x's key, the columns of i not involved in the join are
>>>>> included in the result. If i also has a key, it is i's key columns
>>>>> that
>>>>> are used to match to x's key columns and a binary merge of the two
>>>>> tables is carried out."
>>>>>
>>>>> The critical sentence is "If i also has a key ..."; i.e., i doesn't
>>>>> have
>>>>> to be keyed. Only x must have a key. It's often faster if i is keyed
>>>>> too, though. There have been some speed improvements in 1.8.1, too.
>>>>>
>>>>> In the example you highlighted I think it goes on to show 'join
>>>>> inherited scope', which is this paragraph :
>>>>>
>>>>> "Advanced: In the X[Y,j] form of grouping, the j expression sees
>>>>> variables in X first, then Y. We call this join inherited scope. If
>>>>> the
>>>>> variable is not in X or Y then the calling frame is searched, its
>>>>> calling frame, and so on in the usual way up to and including the
>>>>> global
>>>>> environment."
>>>>>
>>>>> Encouraging to hear you've reduced hours to minutes. At least someone
>>>>> knows not to use benchmark(...,replications=100) for tasks that take
>>>>> under 0.01 seconds and then conclude data.table is slow!
>>>>>
>>>>> Matthew
>>>>>
>>>>> On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote:
>>>>>>
>>>>>> All,
>>>>>>
>>>>>> I am not understanding a few basic things. I am looking at pg 5 of
>>>>>> the
>>>>>> faq.
>>>>>>
>>>>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
>>>>>> setkey(X,grp)
>>>>>> Y = data.table(c("b","c"), bar=c(4,2))
>>>>>> X[Y]
>>>>>>
>>>>>> The faq says X[Y] is a join looking up X's rows using Y.
>>>>>>
>>>>>> Does this mean data.table looks up X's key using Y?
>>>>>>
>>>>>> Y has two columns. How does it know to use the first column in this
>>>>>> example? Y's key has not been set.
>>>>>>
>>>>>> Hope my question is not too obvious. :)
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Juliet
>>>>>>
>>>>>> P.S. Thanks for the wonderful package. I had to do some
>>>>>> aggregations
>>>>>> the other day and my other solutions were
>>>>>> running for hours, but data.table finished in a couple of minutes!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>> End of datatable-help Digest, Vol 28, Issue 2
>>>>> *********************************************
>>>>
>>>>
>>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list