[datatable-help] Problem with FAQ 2.8
Gabor Grothendieck
ggrothendieck at gmail.com
Sat Jun 8 05:24:23 CEST 2013
1. Good point about having non-equal keys when roll=TRUE. Note that
in that case we have the following where the column labelled x1 in the
output is wrong. Either it should be labelled y1 (since 10 comes from
y1, not from x1) or else it should have the value 1 (since x1 is 1 in
the input). If both keys were output with their original column names
of disambiguating names in the case that they are the same then there
would no longer be a problem.
> X <- data.table(x1 = 1, x2 = 2, key = "x1")
> Y <- data.table(y1 = 10, y2 = 3)
> X[Y]
x1 x2 y2
1: 10 NA 3
> X[Y,,roll=TRUE]
x1 x2 y2
1: 10 2 3
2. The issues in this thread seem pretty fundamental to data.table and
in my opinion resolving them deserves a high priority.
On Fri, Jun 7, 2013 at 8:32 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Hi,
>
> Agreed. A change in the software to match the FAQ makes sense. That FAQ has
> non-join columns in mind, for which it is true I believe, but yes it should
> be true for join columns as well.
>
> The other consideration is rolling joins. With roll=TRUE it is natural to
> want to know the staleness of the data joined to. The column names usually
> match, say 'date', but the data is different. That was the primary
> motivation for i. and x. prefixes:
>
> X[Y, list(price, daysold = i.date-x.date), roll=TRUE]
>
> i. prefix is already available, but I don't think I did x. yet. Anyway, the
> 'date' in X 'should' be higher in scope, in compliance with FAQ 2.8, so that
> this should be the same (although less clear to read since it relies on the
> reader knowing FAQ 2.8) :
>
> X[Y, list(price, daysold = i.date-date), roll=TRUE]
>
> That's less useful now that roll takes a limit, although you still might
> want to know the staleness of data returned within the limit.
>
> I've added a link to this thread to FR 2693 to be addressed. Thanks all.
>
> Matthew
>
>
>
> On 07.06.2013 05:50, Gabor Grothendieck wrote:
>>
>> One correction to my post. merge() does not include both key columns
>> in its output; however, that may be less germane because unlike
>> data.table and SQL one cannot give merge an expression that refers to
>> them
>>
>>> merge(as.data.frame(d1), as.data.frame(d2), by = 1)
>>
>> id1 val val2
>> 1 1 1 11
>> 2 2 2 12
>> 3 2 3 12
>>
>> The situation with SQLite is as described in my post where both the
>> id1 and id2 columns are output:
>>
>>> library(sqldf)
>>> sqldf("select * from d1 join d2 on d1.id1 = d2.id2")
>>
>> id1 val id2 val2
>> 1 1 1 1 11
>> 2 2 2 2 12
>> 3 2 3 2 12
>>
>> and one could refer to them as id1 and id2 if they are distinct names
>> or as d1.id1 and d2.id2 in the select.
>>
>> One other possibility for data.table would be to change X[Y] so that
>> in the case of keys with different names both columns appear as in the
>> SQL example. This would presumably also ensure that both could be
>> referenced in X[Y, j]. however, if the names are the same then there
>> would be no need to output them both and it would be ok to output them
>> as a single comonly named column.
>>
>> On Fri, Jun 7, 2013 at 12:34 AM, Gabor Grothendieck
>> <ggrothendieck at gmail.com> wrote:
>>>
>>> On Thu, Jun 6, 2013 at 11:50 PM, Michael Nelson
>>> <michael.nelson at sydney.edu.au> wrote:
>>>>
>>>> This is related to
>>>>
>>>> FR 2693
>>>>
>>>>
>>>>
>>>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978
>>>>
>>>> What is happening is that the `join` columns must be referenced using
>>>> their names as defined in `i` (or Y in X[Y] syntax)
>>>>
>>>> The FAQ doesn't explicitly cover how you are supposed to reference the
>>>> columns used in the join.
>>>>
>>>> Perhaps some binding magic could be used to ensure that either column
>>>> name could be used. I don't think it is useful want both to be defined and
>>>> available as separate objects - -that would mean there were two copies of
>>>> something that are identical in value (but not name!)
>>>>
>>>
>>> Note that the FAQ says that the X variables are "in scope".and the
>>> ordinary meaning of being in scope is that such a variable can be
>>> referenced in an unqualified manner so I think it does imply that
>>> these variables can be accessed.
>>>
>>> I assume from your response that the answer to my question is that the
>>> FAQ is wrong and the behavior is as intended.
>>>
>>> If that is the case then it would be desirable that the behavior of
>>> the software be changed to make the FAQ correct. Having tiny little
>>> exceptions like this that are difficult to remember and error prone
>>> just makes the software harder to use. Another possibility would be
>>> to outlaw having keys in X and Y which have different names (although
>>> that would be drastic and inconvenient though safer and easier to
>>> learn then the current situation).
>>>
>>> For example, continuing the code in my post here is a second example
>>> consider what would happen if this were to occur:
>>>
>>>> id1 <- 1
>>>> d1[d2, sum(id1 * val)]
>>>
>>> id1 V1
>>> 1: 1 1
>>> 2: 2 5
>>> 3: 4 NA
>>>
>>> It would be difficult to realize without close examination that there
>>> is an error in this code (assuming that the writer intended id1 to be
>>> taken from d1). Here d1$id1 is not in scope (contrary to the FAQ) and
>>> so id1 in the caller is used resulting in wrong output (relative to
>>> the result intended).
>>>
>>> Here is another oddity. It seems that in the first case we cannot
>>> access id1 but if we do a join and then access the columns in a
>>> separate [] then we can.
>>>
>>> if (exists("id1")) rm(id1)
>>>>
>>>> d1[d2, id1]
>>>
>>> Error in `[.data.table`(d1, d2, id1) : object 'id1' not found
>>>>
>>>> d1[d2][, id1]
>>>
>>> [1] 1 2 2 4
>>>
>>> Note that in R's merge one can refer to both keys and in SQL when one
>>> does a join one can as well so the behavior we have been discussing
>>> here seems entirely unexpected.
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
More information about the datatable-help
mailing list