[datatable-help] Adding a column in place and join surprises
Matthew Dowle
mdowle at mdowle.plus.com
Mon Aug 6 18:58:11 CEST 2012
For the archives, this post was answer in a different thread :
http://r.789695.n4.nabble.com/Re-datatable-help-Digest-Vol-30-Issue-1-tp4638913.html
> Suppose I have a data.table with a unique identifier and value and another
> data.table with a cross-reference to that identifier and lots of other
> measurement data. E.g. suppose my lookup table is "locs" and my
> measurement data is "obsv" as follows:
>
> obsv=data.table(id=1:7, loc=c(10,20,10,10,30,10,20), mvar=rnorm(7),
> key='id')
> locs=data.table(loc=c(30,20,10),name=c("foo","bar","baz"),
> other=letters[1:3], key='loc')
>
> I simply want to add the 'name' column from locs to the obsv table using
> :=. But this quickly becomes really complicated because (1) the keys for
> the two data.tables differ (appropriately), (2) the key for locs is an
> integer, and (3) the return columns of a join always include the matching
> columns.
>
> First of all, the gotcha is that locs[obsv[,loc]] doesn't work. This is
> because obsv[,loc] returns a numeric column, which is treated as indexing
> the row numbers. Surprise!
>
>> locs[obsv[,loc]]
> loc name other
> [1,] NA <NA> <NA>
> [2,] NA <NA> <NA>
> [3,] NA <NA> <NA>
> [4,] NA <NA> <NA>
> [5,] NA <NA> <NA>
> [6,] NA <NA> <NA>
> [7,] NA <NA> <NA>
>
> This unexpected and silent behavior could easily cause very bad results.
> (When I first did this test I used 1,2 and 3 instead of 10,20 and 30 and
> everything seemed to work!) I think this inconsistency should be
> addressed. For example, consider modifying joins so that they only happen
> when the i argument is a data.table or a list. If it is a character, then
> it should fail. Part of the problem here is the inconsistency that
> A[,col1] returns a vector of characters, but A[,list(col1,col2)] returns a
> data.table. If instead, data.tables were always returned unless, say, a
> simplify=TRUE argument was provided, then we'd be in better shape because
> locs[obsv[,loc]] would always be a join and locs[obsv[,loc,simplify=TRUE]]
> would be a row retrieval as for data.frame.
>
> Anyway, the solution to the above seems to be to create a list object for
> i:
>
>> locs[list(obsv[,loc])]
> Error in `[.data.table`(locs, list(obsv[, loc])) :
> typeof x.loc (integer) != typeof i.V1 (double)
>
> but that doesn't work because obsv$loc is class numeric and locs$loc is
> class integer. This is because locs$loc is silently changed to integer
> when the key is set. So, to perform a lookup we need to coerce to integer
> as follows:
>
>> locs[list(as.integer(obsv[,loc]))]
> loc name other
> [1,] 1 baz c
> [2,] 2 bar b
> [3,] 1 baz c
> [4,] 1 baz c
> [5,] 3 foo a
> [6,] 1 baz c
> [7,] 2 bar b
>
> But if I want a right-hand-side value for assignment, then I need just a
> single column. However, there is an inconsistency: a join adds the extra
> join column names even if I specify only the column I want.
>
>> locs[list(as.integer(obsv[,loc])),name]
> loc name
> [1,] 1 baz
> [2,] 2 bar
> [3,] 1 baz
> [4,] 1 baz
> [5,] 3 foo
> [6,] 1 baz
> [7,] 2 bar
>
> In my opinion, the j column should always return only the columns that the
> user requested. If the user wants additional columns (possibly from the
> join) then they can be requested explicitly. Anyway, the workaround for
> this is to make a chain statement:
>
>> locs[list(as.integer(obsv[,loc]))][,name]
> [1] "baz" "bar" "baz" "baz" "foo" "baz" "bar"
>
> That looks good! Now to assign using := I write
>
>> obsv[,locname:=locs[list(as.integer(obsv[,loc]))][,name]]
> id loc mvar locname
> [1,] 1 1 -1.8721465 baz
> [2,] 2 2 -0.5341284 bar
> [3,] 3 1 -2.2243473 baz
> [4,] 4 1 -0.3093871 baz
> [5,] 5 3 -0.7879228 foo
> [6,] 6 1 -0.5672113 baz
> [7,] 7 2 1.0390285 bar
>
> I would argue that this is surprisingly obtuse and not intuitive. Things
> would get even more complicated if locs had a 2-column key!
>
> I suppose one reply is that I should just temporarily set the key for obsv
> and then reassign the entire obsv data.table. I.e.,
>
>> setkey(obsv,loc)
>> obsv=locs[obsv]
>> setkey(obsv,id)
>
> This works, but is somehow to my eyes particularly dissatisfying. Keys
> must be reset twice. Potentially large datasets must be reassigned in
> their entirety. Another solution that performs in-place assignment is
> similar:
>
>> setkey(obsv,loc)
>> obsv[,locname:=locs[obsv][,name]]
> id loc mvar locname
> [1,] 1 1 -0.6648842 baz
> [2,] 3 1 -0.4477593 baz
> [3,] 4 1 -1.1300506 baz
> [4,] 6 1 -0.3041305 baz
> [5,] 2 2 -0.8239177 bar
> [6,] 7 2 -0.3416380 bar
> [7,] 5 3 1.2745693 foo
>> setkey(obsv,id)
>
> This is not so bad, but it would be a lot nicer to not have to set keys
> and to simply say:
>
>> obsv[,locname := locs[obsv,name]]
>
> This could be achieved if (1) joins were performed by matching commonly
> named columns (like an SQL natural join) if the the two tables did not
> share keys of the same cardinality and types and (2) only explicitly
> listed columns were returned. In my opinion, this idea of "natural joins"
> based on column names would simplify joins a lot, while making them more
> generally useful and intuitive. If column names differed, then you might
> specify a list instead, e.g.
>
>> A[list(id=B$a_id), val]
>
> or maybe specify the mapping as an optional parameter that could be used
> if A and B did not have common columns and if A and B's keys differed,
> e.g.
>
>> A[B, val, map=c("id=a_id")]
>
> If joins matched by name, then the implementation could check if the key
> was sufficiently satisfied to be used and otherwise it would just perform
> a more conventional non-key'd join.
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list