<div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

Suppose I have a data.table with a unique identifier and value and another data.table with a cross-reference to that identifier and lots of other measurement data.  E.g. suppose my lookup table is "locs" and my measurement data is "obsv" as follows:<br>


<br>

obsv=data.table(id=1:7, loc=c(10,20,10,10,30,10,20), mvar=rnorm(7), key='id')<br>

locs=data.table(loc=c(30,20,10),name=c("foo","bar","baz"), other=letters[1:3], key='loc')<br>

<br>

I simply want to add the 'name' column from locs to the obsv table using :=.  But this quickly becomes really complicated because (1) the keys for the two data.tables differ (appropriately), (2) the key for locs is an integer, and (3) the return columns of a join always include the matching columns.<br>


<br>

First of all, the gotcha is that locs[obsv[,loc]] doesn't work.  This is because obsv[,loc] returns a numeric column, which is treated as indexing the row numbers.  Surprise!<br>

<br>

> locs[obsv[,loc]]<br>

     loc name other<br>

[1,]  NA <NA>  <NA><br>

[2,]  NA <NA>  <NA><br>

[3,]  NA <NA>  <NA><br>

[4,]  NA <NA>  <NA><br>

[5,]  NA <NA>  <NA><br>

[6,]  NA <NA>  <NA><br>

[7,]  NA <NA>  <NA><br></blockquote><div><br></div><div>Actually the standard way to do this would be</div><div><br></div><div>> setkey(locs, loc)[setkey(obsv, loc)]</div><div><br></div><div>It would look nicer as 3 separate lines, but you catch my drift. This explicitly tells data.table which keys to join on.</div>

<div><br></div><div>If you really want to filter down and only add name...</div><div><br></div><div>> setkey(locs[, list(loc, names)], loc)[setkey(obsv loc)]</div><div><br></div><div>For the record, I personally define infix operators to make this look nicer::</div>

<div><br></div><div><div>`%lj%`=function(x,y) y[x]</div><div>`%rj%`=function(x,y) x[y]</div><div>`%oj%`=function(x,y) merge(x,y)</div><div>`%ij%`=function(x,y) x[y, nomatch=0]</div><div><br></div></div><div>I should probably do some type checking, but I preferred the one liners. The above then becomes:</div>

<div><br></div><div>> setkey(locs[, list(loc, name)], loc) %rj% setkey(obsv, loc)</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

This unexpected and silent behavior could easily cause very bad results.  (When I first did this test I used 1,2 and 3 instead of 10,20 and 30 and everything seemed to work!)  I think this inconsistency should be addressed.  For example, consider modifying joins so that they only happen when the i argument is a data.table or a list.  If it is a character, then it should fail.  Part of the problem here is the inconsistency that A[,col1] returns a vector of characters, but A[,list(col1,col2)] returns a data.table.  If instead, data.tables were always returned unless, say, a simplify=TRUE argument was provided, then we'd be in better shape because locs[obsv[,loc]] would always be a join and locs[obsv[,loc,simplify=TRUE]] would be a row retrieval as for data.frame.<br>

</blockquote><div><br></div><div>I personally prefer the return of vector when using locs[, loc] as it's more consistent with the rest of the language. I agree that having DT[, list(col1, col2)] return a DT is kind of confusing, would be more consistent to have it return a list and then just have DT[, data.table(col1, col2)]</div>

<div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Anyway, the solution to the above seems to be to create a list object for i:<br>

<br>

> locs[list(obsv[,loc])]<br>

Error in `[.data.table`(locs, list(obsv[, loc])) :<br>

  typeof x.loc (integer) != typeof i.V1 (double)<br>

<br>

but that doesn't work because obsv$loc is class numeric and locs$loc is class integer.  This is because locs$loc is silently changed to integer when the key is set.  So, to perform a lookup we need to coerce to integer as follows:<br>


<br></blockquote><div><br></div><div>Get the latest version, doubles are allowed in keys - no coercion to int.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


I suppose one reply is that I should just temporarily set the key for obsv and then reassign the entire obsv data.table.  I.e.,<br>

<br>

> setkey(obsv,loc)<br>

> obsv=locs[obsv]<br>

> setkey(obsv,id)<br>

<br>

This works, but is somehow to my eyes particularly dissatisfying.  Keys must be reset twice.  Potentially large datasets must be reassigned in their entirety.  Another solution that performs in-place assignment is similar:<br>

</blockquote><div><br></div><div>I see what you are saying here. What I typically would have done is reassign obsv to the joined version. I typically don't find that setting key is the bottleneck, and I never profiled the reassignment...</div>

<div><br></div><div>> obsv <- setkey(locs[, list(loc, name)], loc) %rj% setkey(obsv, loc)</div><div><br></div><div>But I get the feeling that this reassign is done efficiently, as there are lots of things that data.table warns you are inefficient. Maybe Matt can chime in here.</div>

<div><br></div><div>As for the rest of your comments, I do agree that having foreign keys (that seems to be what you are asking for) would be more efficient. Not sure how easy or hard it would be, both implementation wise and syntactically. Also natural join would be nice to reduce frictions - maybe use it as a default in the case where there are no keys.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

> setkey(obsv,loc)<br>

> obsv[,locname:=locs[obsv][,name]]<br>

     id loc       mvar locname<br>

[1,]  1   1 -0.6648842     baz<br>

[2,]  3   1 -0.4477593     baz<br>

[3,]  4   1 -1.1300506     baz<br>

[4,]  6   1 -0.3041305     baz<br>

[5,]  2   2 -0.8239177     bar<br>

[6,]  7   2 -0.3416380     bar<br>

[7,]  5   3  1.2745693     foo<br>

> setkey(obsv,id)<br>

<br>

This is not so bad, but it would be a lot nicer to not have to set keys and to simply say:<br>

<br>

> obsv[,locname := locs[obsv,name]]<br>

<br>

This could be achieved if (1) joins were performed by matching commonly named columns (like an SQL natural join) if the the two tables did not share keys of the same cardinality and types and (2) only explicitly listed columns were returned.  In my opinion, this idea of "natural joins" based on column names would simplify joins a lot, while making them more generally useful and intuitive.  If column names differed, then you might specify a list instead, e.g.<br>


<br>

> A[list(id=B$a_id), val]<br>

<br>

or maybe specify the mapping as an optional parameter that could be used if A and B did not have common columns and if A and B's keys differed, e.g.<br>

<br>

> A[B, val, map=c("id=a_id")]<br>

<br>

If joins matched by name, then the implementation could check if the key was sufficiently satisfied to be used and otherwise it would just perform a more conventional non-key'd join.<br><br>

</blockquote></div><br>