<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Suppose I have a data.table with a unique identifier and value and another data.table with a cross-reference to that identifier and lots of other measurement data. E.g. suppose my lookup table is "locs" and my measurement data is "obsv" as follows:<div><div><br></div><div><font class="Apple-style-span" face="'Courier New'">obsv=data.table(id=1:7, loc=c(10,20,10,10,30,10,20), mvar=rnorm(7), key='id')</font></div><div><font class="Apple-style-span" face="'Courier New'">locs=data.table(loc=c(30,20,10),name=c("foo","bar","baz"), other=letters[1:3], key='loc')</font></div></div><div><font class="Apple-style-span" face="'Courier New'"><br></font></div><div>I simply want to add the 'name' column from locs to the obsv table using :=. But this quickly becomes really complicated because (1) the keys for the two data.tables differ (appropriately), (2) the key for locs is an integer, and (3) the return columns of a join always include the matching columns.</div><div><br></div><div>First of all, the gotcha is that locs[obsv[,loc]] doesn't work. This is because obsv[,loc] returns a numeric column, which is treated as indexing the row numbers. Surprise! </div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> locs[obsv[,loc]] </font></div><div><font class="Apple-style-span" face="'Courier New'"> loc name other</font></div><div><font class="Apple-style-span" face="'Courier New'">[1,] NA <NA> <NA></font></div><div><font class="Apple-style-span" face="'Courier New'">[2,] NA <NA> <NA></font></div><div><font class="Apple-style-span" face="'Courier New'">[3,] NA <NA> <NA></font></div><div><font class="Apple-style-span" face="'Courier New'">[4,] NA <NA> <NA></font></div><div><font class="Apple-style-span" face="'Courier New'">[5,] NA <NA> <NA></font></div><div><font class="Apple-style-span" face="'Courier New'">[6,] NA <NA> <NA></font></div><div><font class="Apple-style-span" face="'Courier New'">[7,] NA <NA> <NA></font></div></div><div><br></div><div>This unexpected and silent behavior could easily cause very bad results. (When I first did this test I used 1,2 and 3 instead of 10,20 and 30 and everything seemed to work!) I think this inconsistency should be addressed. For example, consider modifying joins so that they only happen when the i argument is a data.table or a list. If it is a character, then it should fail. Part of the problem here is the inconsistency that A[,col1] returns a vector of characters, but A[,list(col1,col2)] returns a data.table. If instead, data.tables were always returned unless, say, a simplify=TRUE argument was provided, then we'd be in better shape because locs[obsv[,loc]] would always be a join and locs[obsv[,loc,simplify=TRUE]] would be a row retrieval as for data.frame.</div><div><br></div><div>Anyway, the solution to the above seems to be to create a list object for i:</div><div><br></div><div><font class="Apple-style-span" face="'Courier New'"><div>> locs[list(obsv[,loc])]</div><div>Error in `[.data.table`(locs, list(obsv[, loc])) : </div><div> typeof x.loc (integer) != typeof i.V1 (double)</div></font></div><div><br></div><div>but that doesn't work because obsv$loc is class numeric and locs$loc is class integer. This is because locs$loc is silently changed to integer when the key is set. So, to perform a lookup we need to coerce to integer as follows:</div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> locs[list(as.integer(obsv[,loc]))]</font></div><div><font class="Apple-style-span" face="'Courier New'"> loc name other</font></div><div><font class="Apple-style-span" face="'Courier New'">[1,] 1 baz c</font></div><div><font class="Apple-style-span" face="'Courier New'">[2,] 2 bar b</font></div><div><font class="Apple-style-span" face="'Courier New'">[3,] 1 baz c</font></div><div><font class="Apple-style-span" face="'Courier New'">[4,] 1 baz c</font></div><div><font class="Apple-style-span" face="'Courier New'">[5,] 3 foo a</font></div><div><font class="Apple-style-span" face="'Courier New'">[6,] 1 baz c</font></div><div><font class="Apple-style-span" face="'Courier New'">[7,] 2 bar b</font></div></div><div><br></div><div>But if I want a right-hand-side value for assignment, then I need just a single column. However, there is an inconsistency: a join adds the extra join column names even if I specify only the column I want.</div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> locs[list(as.integer(obsv[,loc])),name]</font></div><div><font class="Apple-style-span" face="'Courier New'"> loc name</font></div><div><font class="Apple-style-span" face="'Courier New'">[1,] 1 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[2,] 2 bar</font></div><div><font class="Apple-style-span" face="'Courier New'">[3,] 1 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[4,] 1 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[5,] 3 foo</font></div><div><font class="Apple-style-span" face="'Courier New'">[6,] 1 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[7,] 2 bar</font></div></div><div><br></div><div>In my opinion, the j column should always return only the columns that the user requested. If the user wants additional columns (possibly from the join) then they can be requested explicitly. Anyway, the workaround for this is to make a chain statement:</div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> locs[list(as.integer(obsv[,loc]))][,name]</font></div><div><font class="Apple-style-span" face="'Courier New'">[1] "baz" "bar" "baz" "baz" "foo" "baz" "bar"</font></div></div><div><font class="Apple-style-span" face="'Courier New'"><br></font></div><div>That looks good! Now to assign using := I write</div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> obsv[,locname:=locs[list(as.integer(obsv[,loc]))][,name]]</font></div><div><font class="Apple-style-span" face="'Courier New'"> id loc mvar locname</font></div><div><font class="Apple-style-span" face="'Courier New'">[1,] 1 1 -1.8721465 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[2,] 2 2 -0.5341284 bar</font></div><div><font class="Apple-style-span" face="'Courier New'">[3,] 3 1 -2.2243473 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[4,] 4 1 -0.3093871 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[5,] 5 3 -0.7879228 foo</font></div><div><font class="Apple-style-span" face="'Courier New'">[6,] 6 1 -0.5672113 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[7,] 7 2 1.0390285 bar</font></div></div><div><br></div><div>I would argue that this is surprisingly obtuse and not intuitive. Things would get even more complicated if locs had a 2-column key! </div><div><br></div><div>I suppose one reply is that I should just temporarily set the key for obsv and then reassign the entire obsv data.table. I.e.,</div><div><font class="Apple-style-span" face="'Courier New'"><br></font></div><div><div><font class="Apple-style-span" face="'Courier New'">> setkey(obsv,loc)</font></div><div><font class="Apple-style-span" face="'Courier New'">> obsv=locs[obsv]</font></div><div><font class="Apple-style-span" face="'Courier New'">> setkey(obsv,id)</font></div></div><div><font class="Apple-style-span" face="'Courier New'"><br></font></div><div>This works, but is somehow to my eyes particularly dissatisfying. Keys must be reset twice. Potentially large datasets must be reassigned in their entirety. Another solution that performs in-place assignment is similar:</div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> setkey(obsv,loc)</font></div><div><font class="Apple-style-span" face="'Courier New'">> obsv[,locname:=locs[obsv][,name]]</font></div><div><font class="Apple-style-span" face="'Courier New'"> id loc mvar locname</font></div><div><font class="Apple-style-span" face="'Courier New'">[1,] 1 1 -0.6648842 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[2,] 3 1 -0.4477593 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[3,] 4 1 -1.1300506 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[4,] 6 1 -0.3041305 baz</font></div><div><font class="Apple-style-span" face="'Courier New'">[5,] 2 2 -0.8239177 bar</font></div><div><font class="Apple-style-span" face="'Courier New'">[6,] 7 2 -0.3416380 bar</font></div><div><font class="Apple-style-span" face="'Courier New'">[7,] 5 3 1.2745693 foo</font></div><div><font class="Apple-style-span" face="'Courier New'">> setkey(obsv,id)</font></div></div><div><br></div><div>This is not so bad, but it would be a lot nicer to not have to set keys and to simply say:</div><div><br></div><div><div><font class="Apple-style-span" face="'Courier New'">> obsv[,locname := locs[obsv,name]]</font></div></div><div><font class="Apple-style-span" face="'Courier New'"><br></font></div><div>This could be achieved if (1) joins were performed by matching commonly named columns (like an SQL natural join) if the the two tables did not share keys of the same cardinality and types and (2) only explicitly listed columns were returned. In my opinion, this idea of "natural joins" based on column names would simplify joins a lot, while making them more generally useful and intuitive. If column names differed, then you might specify a list instead, e.g.</div><div><br></div><div><font class="Apple-style-span" face="'Courier New'">> A[list(id=B$a_id), val]</font></div><div><br></div><div>or maybe specify the mapping as an optional parameter that could be used if A and B did not have common columns and if A and B's keys differed, e.g.</div><div><font class="Apple-style-span" face="'Courier New'"><br></font></div><div><font class="Apple-style-span" face="'Courier New'">> A[B, val, map=c("id=a_id")]</font></div><div><br></div><div>If joins matched by name, then the implementation could check if the key was sufficiently satisfied to be used and otherwise it would just perform a more conventional non-key'd join.</div><div><br></div><div><br></div></body></html>