[datatable-help] misc join questions
Johann Hibschman
jhibschman+r at gmail.com
Wed Jun 1 19:40:04 CEST 2011
"Matthew Dowle" <mdowle at mdowle.plus.com> writes:
> 1. See FAQ 2.12
Thanks, that helps. That seems basic enough that I think it should go
somewhere in the documentation for data.table; anyone working with the
extract methods will want to know exactly how that works.
> 2. In practice I haven't experienced this problem but I see the concern. An
> option could be added "checkjoinnames" (or better name) which would issue a
> warning if the columns used in the key had different names. Perhaps it would
> take value 0 (don't check), 1 (warning) and 2 (error). The argument to
> [.data.table would default to getOption("datatable.checkjoinnames"),
> permitting a global setting, or per-query setting as desired. Would that
> work?
That sounds like reasonable enough, but I don't think that would help my
own case, since I think it's more the exception than the rule that my
"x" table would have the correct key. Given that extra features have a
cost, I'd say "no," unless someone else chimes in.
For my own purposes, I would rather give a "nojoin" or "without_join" or
"nokeys" option that would remove the columns used for the join from
y[x,]. Then I could just write "cbind(x, y[J(x$date),nokeys=TRUE])",
which seems like a clear enough way to say what I mean. If we want to
go to crazy magic, "y[withkey(x, date),]" is also pretty.
Alternatively, since I'm used to the merge.data.frame syntax, if we
could expand merge.data.table to allow "merge(x, y, by="date",
all.x=TRUE)" for arbitrary data tables, creating keys as needed, that
would also help. Perhaps add a "autokey.x" option to enable the
creation of needed keys.
As usual, there are a lot of options. My preference would be to add
more smarts to merge, rather than keep adding features to the extract
methods. Extract.data.table already does an awful lot.
> 3 i) Yes, and FR#1006 is to improve that. Nudges like this help to
> encourage (thanks) :
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1006&group_id=240&atid=978
> 3 ii) Watch out for the 5 general examples on the wiki and fully understand
> them. Most of those differences are due to copying, one way or another :
> http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table
> You might also find that as.data.table() is more efficient than
> data.table(); the former just changes the class. Also, if you know (for
> sure) the table is already sorted, you can set the "sorted" attribute
> directly, which avoids the overhead of the (current) copy in setkey().
Thanks, that helps.
Regards,
Johann
>
>
> "Johann Hibschman" <jhibschman+r at gmail.com> wrote in message
> news:u1o62oxd0f5.fsf at ld-chrate28.citadelgroup.com...
>> I've run into a few questions about joins.
>>
>> 1. In x[i, ], how does data.table decide if "i" is meant to be an
>> expression, or an integer/logical/data.table?
>>
>> In practice, it seems to work fine, but I worry about accidental
>> maskings, like if I do:
>>
>> filter <- x$date > 10
>> x[filter,]
>>
>> What happens if x has a column named "filter"? Which takes precedence?
>>
>>
>> 2. Is there an easy way to join two tables, yet be protected from
>> unexpected keys?
>>
>> For example, I often make a date-value lookup table, like
>>
>> y <- data.table(date=blah, val1=blah, val2=blah, key="date")
>>
>> Then I want to merge in the values with a new table, x, like:
>>
>> (data.frame syntax) merge(x, y, by="date")
>>
>> If x has no key, and I know the first column is date, or x has a key
>> and the first column in the key is date, I can do
>>
>> y[x]
>>
>> However, I worry that I will set a different key on x, while doing some
>> operation elsewhere, in which case y[x] will give nonsense. To be
>> extra-safe, I can do something like
>>
>> cbind(x, y[J(x$date),][, -1, with=FALSE])
>>
>> where the "[, -1, with=FALSE]" is to remove the date column from the
>> join result, so I don't end up with two date columns in my result. I
>> find this very ugly, but I can't find a better way. What would you
>> recommend?
>>
>>
>> 3. Does setting a new key on a table create a copy?
>>
>> If I do,
>>
>> f <- function (x) {
>> y <- create.lookup.table()
>> setkey(x, date)
>> y[x]
>> }
>>
>> will I create a copy of x by setting the key? In general, what
>> operations create copies? Is there anything that operates on
>> references that I have to look out for?
>>
>>
>> Thanks,
>> Johann
More information about the datatable-help
mailing list