[datatable-help] Adding a column in place and join surprises

Mon Aug 6 18:58:11 CEST 2012

For the archives, this post was answer in a different thread :

http://r.789695.n4.nabble.com/Re-datatable-help-Digest-Vol-30-Issue-1-tp4638913.html

> Suppose I have a data.table with a unique identifier and value and another
> data.table with a cross-reference to that identifier and lots of other
> measurement data.  E.g. suppose my lookup table is "locs" and my
> measurement data is "obsv" as follows:
>
> obsv=data.table(id=1:7, loc=c(10,20,10,10,30,10,20), mvar=rnorm(7),
> key='id')
> locs=data.table(loc=c(30,20,10),name=c("foo","bar","baz"),
> other=letters[1:3], key='loc')
>
> I simply want to add the 'name' column from locs to the obsv table using
> :=.  But this quickly becomes really complicated because (1) the keys for
> the two data.tables differ (appropriately), (2) the key for locs is an
> integer, and (3) the return columns of a join always include the matching
> columns.
>
> First of all, the gotcha is that locs[obsv[,loc]] doesn't work.  This is
> because obsv[,loc] returns a numeric column, which is treated as indexing
> the row numbers.  Surprise!
>
>> locs[obsv[,loc]]
>      loc name other
> [1,]  NA <NA>  <NA>
> [2,]  NA <NA>  <NA>
> [3,]  NA <NA>  <NA>
> [4,]  NA <NA>  <NA>
> [5,]  NA <NA>  <NA>
> [6,]  NA <NA>  <NA>
> [7,]  NA <NA>  <NA>
>
> This unexpected and silent behavior could easily cause very bad results.
> (When I first did this test I used 1,2 and 3 instead of 10,20 and 30 and
> everything seemed to work!)  I think this inconsistency should be
> addressed.  For example, consider modifying joins so that they only happen
> when the i argument is a data.table or a list.  If it is a character, then
> it should fail.  Part of the problem here is the inconsistency that
> A[,col1] returns a vector of characters, but A[,list(col1,col2)] returns a
> data.table.  If instead, data.tables were always returned unless, say, a
> simplify=TRUE argument was provided, then we'd be in better shape because
> locs[obsv[,loc]] would always be a join and locs[obsv[,loc,simplify=TRUE]]
> would be a row retrieval as for data.frame.
>
> Anyway, the solution to the above seems to be to create a list object for
> i:
>
>> locs[list(obsv[,loc])]
> Error in `[.data.table`(locs, list(obsv[, loc])) :
>   typeof x.loc (integer) != typeof i.V1 (double)
>
> but that doesn't work because obsv$loc is class numeric and locs$loc is
> class integer.  This is because locs$loc is silently changed to integer
> when the key is set.  So, to perform a lookup we need to coerce to integer
> as follows:
>
>> locs[list(as.integer(obsv[,loc]))]
>      loc name other
> [1,]   1  baz     c
> [2,]   2  bar     b
> [3,]   1  baz     c
> [4,]   1  baz     c
> [5,]   3  foo     a
> [6,]   1  baz     c
> [7,]   2  bar     b
>
> But if I want a right-hand-side value for assignment, then I need just a
> single column.  However, there is an inconsistency: a join adds the extra
> join column names even if I specify only the column I want.
>
>> locs[list(as.integer(obsv[,loc])),name]
>      loc name
> [1,]   1  baz
> [2,]   2  bar
> [3,]   1  baz
> [4,]   1  baz
> [5,]   3  foo
> [6,]   1  baz
> [7,]   2  bar
>
> In my opinion, the j column should always return only the columns that the
> user requested.  If the user wants additional columns (possibly from the
> join) then they can be requested explicitly.  Anyway, the workaround for
> this is to make a chain statement:
>
>> locs[list(as.integer(obsv[,loc]))][,name]
> [1] "baz" "bar" "baz" "baz" "foo" "baz" "bar"
>
> That looks good!  Now to assign using := I write
>
>> obsv[,locname:=locs[list(as.integer(obsv[,loc]))][,name]]
>      id loc       mvar locname
> [1,]  1   1 -1.8721465     baz
> [2,]  2   2 -0.5341284     bar
> [3,]  3   1 -2.2243473     baz
> [4,]  4   1 -0.3093871     baz
> [5,]  5   3 -0.7879228     foo
> [6,]  6   1 -0.5672113     baz
> [7,]  7   2  1.0390285     bar
>
> I would argue that this is surprisingly obtuse and not intuitive.  Things
> would get even more complicated if locs had a 2-column key!
>
> I suppose one reply is that I should just temporarily set the key for obsv
> and then reassign the entire obsv data.table.  I.e.,
>
>> setkey(obsv,loc)
>> obsv=locs[obsv]
>> setkey(obsv,id)
>
> This works, but is somehow to my eyes particularly dissatisfying.  Keys
> must be reset twice.  Potentially large datasets must be reassigned in
> their entirety.  Another solution that performs in-place assignment is
> similar:
>
>> setkey(obsv,loc)
>> obsv[,locname:=locs[obsv][,name]]
>      id loc       mvar locname
> [1,]  1   1 -0.6648842     baz
> [2,]  3   1 -0.4477593     baz
> [3,]  4   1 -1.1300506     baz
> [4,]  6   1 -0.3041305     baz
> [5,]  2   2 -0.8239177     bar
> [6,]  7   2 -0.3416380     bar
> [7,]  5   3  1.2745693     foo
>> setkey(obsv,id)
>
> This is not so bad, but it would be a lot nicer to not have to set keys
> and to simply say:
>
>> obsv[,locname := locs[obsv,name]]
>
> This could be achieved if (1) joins were performed by matching commonly
> named columns (like an SQL natural join) if the the two tables did not
> share keys of the same cardinality and types and (2) only explicitly
> listed columns were returned.  In my opinion, this idea of "natural joins"
> based on column names would simplify joins a lot, while making them more
> generally useful and intuitive.  If column names differed, then you might
> specify a list instead, e.g.
>
>> A[list(id=B$a_id), val]
>
> or maybe specify the mapping as an optional parameter that could be used
> if A and B did not have common columns and if A and B's keys differed,
> e.g.
>
>> A[B, val, map=c("id=a_id")]
>
> If joins matched by name, then the implementation could check if the key
> was sufficiently satisfied to be used and otherwise it would just perform
> a more conventional non-key'd join.
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help