[datatable-help] misc join questions

Johann Hibschman jhibschman+r at gmail.com
Wed Jun 1 19:40:04 CEST 2011


"Matthew Dowle" <mdowle at mdowle.plus.com> writes:

> 1. See FAQ 2.12

Thanks, that helps.  That seems basic enough that I think it should go
somewhere in the documentation for data.table; anyone working with the
extract methods will want to know exactly how that works.

> 2. In practice I haven't experienced this problem but I see the concern. An 
> option could be added "checkjoinnames" (or better name) which would issue a 
> warning if the columns used in the key had different names. Perhaps it would 
> take value 0 (don't check), 1 (warning) and 2 (error). The argument to 
> [.data.table would default to getOption("datatable.checkjoinnames"), 
> permitting a global setting, or per-query setting as desired.  Would that 
> work?

That sounds like reasonable enough, but I don't think that would help my
own case, since I think it's more the exception than the rule that my
"x" table would have the correct key.  Given that extra features have a
cost, I'd say "no," unless someone else chimes in.

For my own purposes, I would rather give a "nojoin" or "without_join" or
"nokeys" option that would remove the columns used for the join from
y[x,].  Then I could just write "cbind(x, y[J(x$date),nokeys=TRUE])",
which seems like a clear enough way to say what I mean.  If we want to
go to crazy magic, "y[withkey(x, date),]" is also pretty.

Alternatively, since I'm used to the merge.data.frame syntax, if we
could expand merge.data.table to allow "merge(x, y, by="date",
all.x=TRUE)" for arbitrary data tables, creating keys as needed, that
would also help.  Perhaps add a "autokey.x" option to enable the
creation of needed keys.

As usual, there are a lot of options.  My preference would be to add
more smarts to merge, rather than keep adding features to the extract
methods.  Extract.data.table already does an awful lot.

> 3 i) Yes, and FR#1006 is to improve that.  Nudges like this help to 
> encourage (thanks) :
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1006&group_id=240&atid=978
> 3 ii) Watch out for the 5 general examples on the wiki and fully understand 
> them.  Most of those differences are due to copying, one way or another :
> http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table
> You might also find that as.data.table() is more efficient than 
> data.table(); the former just changes the class. Also, if you know (for 
> sure) the table is already sorted,  you can set the "sorted" attribute 
> directly, which avoids the overhead of the (current) copy in setkey().

Thanks, that helps.

Regards,
Johann

>
>
> "Johann Hibschman" <jhibschman+r at gmail.com> wrote in message 
> news:u1o62oxd0f5.fsf at ld-chrate28.citadelgroup.com...
>> I've run into a few questions about joins.
>>
>> 1. In x[i, ], how does data.table decide if "i" is meant to be an
>>   expression, or an integer/logical/data.table?
>>
>> In practice, it seems to work fine, but I worry about accidental
>> maskings, like if I do:
>>
>>  filter <- x$date > 10
>>  x[filter,]
>>
>> What happens if x has a column named "filter"?  Which takes precedence?
>>
>>
>> 2. Is there an easy way to join two tables, yet be protected from
>>   unexpected keys?
>>
>> For example, I often make a date-value lookup table, like
>>
>>  y <- data.table(date=blah, val1=blah, val2=blah, key="date")
>>
>> Then I want to merge in the values with a new table, x, like:
>>
>>  (data.frame syntax) merge(x, y, by="date")
>>
>> If x has no key, and I know the first column is date, or x has a key
>> and the first column in the key is date, I can do
>>
>>  y[x]
>>
>> However, I worry that I will set a different key on x, while doing some
>> operation elsewhere, in which case y[x] will give nonsense.  To be
>> extra-safe, I can do something like
>>
>>  cbind(x, y[J(x$date),][, -1, with=FALSE])
>>
>> where the "[, -1, with=FALSE]" is to remove the date column from the
>> join result, so I don't end up with two date columns in my result.  I
>> find this very ugly, but I can't find a better way.  What would you
>> recommend?
>>
>>
>> 3. Does setting a new key on a table create a copy?
>>
>> If I do,
>>
>>  f <- function (x) {
>>    y <- create.lookup.table()
>>    setkey(x, date)
>>    y[x]
>>  }
>>
>> will I create a copy of x by setting the key?  In general, what
>> operations create copies?  Is there anything that operates on
>> references that I have to look out for?
>>
>>
>> Thanks,
>> Johann 



More information about the datatable-help mailing list