[datatable-help] datatable-help Digest, Vol 30, Issue 1

Mon Aug 6 18:53:36 CEST 2012

> On Aug 2, 2012, at 1:13 PM, Yike Lu wrote:
>>
>> David Kulp wrote :
>>
>> Suppose I have a data.table with a unique identifier and value and
>> another data.table with a cross-reference to that identifier and lots of
>> other measurement data.  E.g. suppose my lookup table is "locs" and my
>> measurement data is "obsv" as follows:
>>
>> obsv=data.table(id=1:7, loc=c(10,20,10,10,30,10,20), mvar=rnorm(7),
>> key='id')
>> locs=data.table(loc=c(30,20,10),name=c("foo","bar","baz"),
>> other=letters[1:3], key='loc')
>>
>> I simply want to add the 'name' column from locs to the obsv table using
>> :=.  But this quickly becomes really complicated because (1) the keys
>> for the two data.tables differ (appropriately), (2) the key for locs is
>> an integer, and (3) the return columns of a join always include the
>> matching columns.
>>
>> First of all, the gotcha is that locs[obsv[,loc]] doesn't work.  This is
>> because obsv[,loc] returns a numeric column, which is treated as
>> indexing the row numbers.  Surprise!

See FAQ 1.3:
 Q: Why does DT[,region] return a vector? I'd like a 1-column data.table.
There is no drop argument like I'm used to in data.frame.
 A: Try DT[,list(region)] instead.

Also ?data.table has this when describing j :

  "[...] A single column or single expression returns that type, usually a
vector. See the examples."

where the examples include :

   DT[,v]        # v column (as vector)
   DT[,list(v)]  # v column (as data.table)

When a numeric vector is then passed as i, that's row lookup by row
number, so intended behaviour.

>>
>> > locs[obsv[,loc]]
>>      loc name other
>> [1,]  NA <NA>  <NA>
>> [2,]  NA <NA>  <NA>
>> [3,]  NA <NA>  <NA>
>> [4,]  NA <NA>  <NA>
>> [5,]  NA <NA>  <NA>
>> [6,]  NA <NA>  <NA>
>> [7,]  NA <NA>  <NA>
>>
>> Actually the standard way to do this would be
>>
>> > setkey(locs, loc)[setkey(obsv, loc)]

Yes that's one way. Another is merge.data.table, which does that, but
without modifying the keys of the objects.

merge.data.table has always had it's own help page but as from 1.8.2,
typing ?merge brings up a choice of which merge (base or data.table) to
look at.

To see how merge.data.table does the change in keys (it's a fairly short
function), type

   data.table:::merge.data.table

> Thanks for this followup.  I didn't realize that setkey returned the
> data.table.  This makes for an interesting idiom!.
>
> At least now I know what to do, but I stumbled a lot in the process.  From
> a design philosophy perspective, I see data.table as designed around
> specific ways of doing things (e.g. assuming that keys are set for joins,
> assuming that the first columns of Y match the key of X or that Y has a
> key, etc.).  If your needs don't match these design assumptions, then you
> have to make modifications.  An alternative approach is to write the
> syntax for the general case, but implement important optimizations.  For
> example, always perform a natural join (matching corresponding names in X
> and Y) for any X[Y], add something like X[Y, by=(Xcol1=Ycol1, ...)] and if
> the matching uses the key for X, then all the better.
>

True and agreed. But merge() may be that general case. This is why merge's
performance has been improved in recent versions, so its speed is
comparable to X[Y] but with more flexible capabilities.

Also see FAQ 1.12 "What is the difference between X[Y] and merge(X,Y)?"

>>
>> It would look nicer as 3 separate lines, but you catch my drift. This
>> explicitly tells data.table which keys to join on.
>>
>> If you really want to filter down and only add name...
>>
>> > setkey(locs[, list(loc, names)], loc)[setkey(obsv loc)]
>>
>> For the record, I personally define infix operators to make this look
>> nicer::
>>
>> `%lj%`=function(x,y) y[x]
>> `%rj%`=function(x,y) x[y]
>> `%oj%`=function(x,y) merge(x,y)
>> `%ij%`=function(x,y) x[y, nomatch=0]
>>
>> I should probably do some type checking, but I preferred the one liners.
>> The above then becomes:
>>
>> > setkey(locs[, list(loc, name)], loc) %rj% setkey(obsv, loc)
>>
>>
>> This unexpected and silent behavior could easily cause very bad results.
>>  (When I first did this test I used 1,2 and 3 instead of 10,20 and 30
>> and everything seemed to work!)  I think this inconsistency should be
>> addressed.  For example, consider modifying joins so that they only
>> happen when the i argument is a data.table or a list.

>> If it is a character, then it should fail.

That's interesting idea. I'm now so used to passing character vectors as i
that I wouldn't like to lose that. There isn't any ambiguity, so what else
could it mean (is the thinking)? An option could be added to turn that off
and make it work the way you want, though. How about new option
datatable.nocharacteri=TRUE? By default this would be FALSE, since code
depends on that feature. When TRUE, i would generate an error whenever
type 'character'.

>> Part of the problem here is the
>> inconsistency that A[,col1] returns a vector of characters, but
>> A[,list(col1,col2)] returns a data.table.  If instead, data.tables were
>> always returned unless, say, a simplify=TRUE argument was provided, then
>> we'd be in better shape because locs[obsv[,loc]] would always be a join
>> and locs[obsv[,loc,simplify=TRUE]] would be a row retrieval as for
>> data.frame.

That would be one approach. But then there would be 4 cases :

    obsv[,loc,simplify=TRUE]
    obsv[,list(loc),simplify=TRUE]
    obsv[,loc,simplify=FALSE]
    obsv[,list(loc),simplify=FALSE]

and I'm getting confused what each should do (let alone having to remember
what the default for simplify might be). So there isn't a simplify
argument, and we just have 2 cases :

    obsv[,loc]        # returns vector
    obsv[,list(loc)]  # returns data.table

which is a little simpler. In the rare cases you really want a list and
not a data.table, then wrap the result with as.list() :

    as.list(obsv[,list(loc)])

When 'by' is used too, it's very important for speed that (the iterated) j
is a (primitive) list. Knowing that, it's easier to see why the following
might rarely be done (if a raw list result is required rather than
data.table) :

    as.list(obsv[,list(loc),by=grp])

>>
>> I personally prefer the return of vector when using locs[, loc] as it's
>> more consistent with the rest of the language. I agree that having DT[,
>> list(col1, col2)] return a DT is kind of confusing, would be more
>> consistent to have it return a list and then just have DT[,
>> data.table(col1, col2)]

Perhaps, but this design decision is for speed, and it makes a big
difference: list() is primitive. FAQ 2.6 mentions it.

FAQ 2.6 :
 Q: Why has the DT() alias been removed?
 A: DT() was introduced originally as a wrapper for a list of j
expressions. Since DT was an alias for data.table, this was a convenient
way to take care of silent recycling in cases where each item of the j
list evaluated to different lengths. The alias was one reason grouping
was slow, though. As of v1.3, list() should be passed instead to the j
argument. list() is a primitive and is much faster, especially when there
are many groups. [Snip...]

>>
>> Anyway, the solution to the above seems to be to create a list object
>> for i:
>>
>> > locs[list(obsv[,loc])]
>> Error in `[.data.table`(locs, list(obsv[, loc])) :
>>   typeof x.loc (integer) != typeof i.V1 (double)

That's one way, but I usually use J() so it's clear I mean to join :

   locs[J(obsv$loc)]

or

   locs[J(obsv[,loc])]

>>
>> but that doesn't work because obsv$loc is class numeric and locs$loc is
>> class integer.  This is because locs$loc is silently changed to integer
>> when the key is set.  So, to perform a lookup we need to coerce to
>> integer as follows:
>>
>>
>> Get the latest version, doubles are allowed in keys - no coercion to
>> int.
>
> Thank you!
>
>>
>>
>> I suppose one reply is that I should just temporarily set the key for
>> obsv and then reassign the entire obsv data.table.  I.e.,
>>
>> > setkey(obsv,loc)
>> > obsv=locs[obsv]
>> > setkey(obsv,id)
>>
>> This works, but is somehow to my eyes particularly dissatisfying.  Keys
>> must be reset twice.  Potentially large datasets must be reassigned in
>> their entirety.  Another solution that performs in-place assignment is
>> similar:
>>
>> I see what you are saying here. What I typically would have done is
>> reassign obsv to the joined version. I typically don't find that setting
>> key is the bottleneck, and I never profiled the reassignment...
>>
>> > obsv <- setkey(locs[, list(loc, name)], loc) %rj% setkey(obsv, loc)
>>
>> But I get the feeling that this reassign is done efficiently, as there
>> are lots of things that data.table warns you are inefficient. Maybe Matt
>> can chime in here.

Essentially David seems to be looking for secondary keys. Secondary keys
can be done manually now without too much difficulty (search
datatable-help archive for "secondary key").  There is a feature request
(#1007) to build secondary keys into the syntax :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1007&group_id=240&atid=978

>>
>> As for the rest of your comments, I do agree that having foreign keys
>> (that seems to be what you are asking for) would be more efficient. Not
>> sure how easy or hard it would be, both implementation wise and
>> syntactically. Also natural join would be nice to reduce frictions -
>> maybe use it as a default in the case where there are no keys.
>>
>>
>> > setkey(obsv,loc)
>> > obsv[,locname:=locs[obsv][,name]]
>>      id loc       mvar locname
>> [1,]  1   1 -0.6648842     baz
>> [2,]  3   1 -0.4477593     baz
>> [3,]  4   1 -1.1300506     baz
>> [4,]  6   1 -0.3041305     baz
>> [5,]  2   2 -0.8239177     bar
>> [6,]  7   2 -0.3416380     bar
>> [7,]  5   3  1.2745693     foo
>> > setkey(obsv,id)
>>
>> This is not so bad, but it would be a lot nicer to not have to set keys
>> and to simply say:
>>
>> > obsv[,locname := locs[obsv,name]]
>>
>> This could be achieved if (1) joins were performed by matching commonly
>> named columns (like an SQL natural join) if the the two tables did not
>> share keys of the same cardinality and types and (2) only explicitly
>> listed columns were returned.  In my opinion, this idea of "natural
>> joins" based on column names would simplify joins a lot, while making
>> them more generally useful and intuitive.  If column names differed,
>> then you might specify a list instead, e.g.
>>
>> > A[list(id=B$a_id), val]
>>
>> or maybe specify the mapping as an optional parameter that could be used
>> if A and B did not have common columns and if A and B's keys differed,
>> e.g.
>>
>> > A[B, val, map=c("id=a_id")]

This is what merge.data.table already does, iiuc. Albeit with feature
request #2033 "Add by.x and by.y to merge.data.table" :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2033&group_id=240&atid=978

With secondary keys we may be heading for a new argument to [.data.table
just like that, yes.

>>
>> If joins matched by name, then the implementation could check if the key
>> was sufficiently satisfied to be used and otherwise it would just
>> perform a more conventional non-key'd join.

Just to check here you know that i doesn't have to be keyed. Just x. It's
not a match by column name, but by position, though (which I prefer since
I find it onerous to make sure column names match). Note this from
?data.table :

   " When i is a data.table, x must have a key. i is joined to x using x's
key and the rows in x that match are returned. An equi-join is
performed between each column in i to each column in x's key; i.e.,
column 1 of i is matched to the 1st column of x's key, column 2 to the
second, etc. "

Note that "If i also has a key" comes later in that paragraph; i.e., i
doesn't have to be keyed.

However, #2175 now added ("Add natural joins i.e. X[Y] when X has no key
could use common column names"), thanks. I think it might be covered by
merge() though as that's what that does. Maybe not in combination with j
(that's the efficiency limitiation of merge) :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2175&group_id=240&atid=978

Finally, there is a feature request #1757 "Add drop to [.data.table", for
the case you mentioned somewhere to avoid all join columns (or by columns)
being returned, indeed useful with := and other cases.  Strictly speaking
drop is already an argument but deliberately doesn't do anything. The
reason it's there is documented in ?data.table.  So this feature request
would start to use it, rather than add it.

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1757&group_id=240&atid=978

In summary, hope that adds some colour to existing and deliberate
behaviour that is unlikely to change, whilst also pointing to
merge.data.table, manual secondary keys and other feature requests on the
agenda.

Matthew

>>
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help