[datatable-help] Using a data.table to perform a two-way lookup

Karl Ove Hufthammer karl at huftis.org
Tue Apr 12 09:57:50 CEST 2011


Dear list members,

I’m having some problems using a data.table to perform a two-way lookup.
Here’s a simple example of what I’m trying to do:

I have a data.frame/data.table that looks like this
x   y
1   a
1   b
2   a
3   c

For a certain value in ‘y’, I want to extract the corresponding values of
‘x’. Then I want to use these values of ‘x’ to extract the corresponding
values of ‘y’. So if I start with y=a, I will get x=1,2, which will give me
y=a,b,a (or preferably just y=a,b). (I would then iterate this to till each
iteration doesn’t increase the length of y, so that I finally get all ‘y’
values connected to y=a.)

I have found one way of achieving this, creating two identical data.tables
with different keys:

  options(stringsAsFactors=FALSE)
  dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
  dat
  A <- B <- data.table(dat)
  key(A)="x"
  key(B)="y"

  A[B["a"][,x]][,y]

The problem is performance (my real-life data.table is *much* larger),
since B["a"][,x] outputs a character vector. When this is used in A[…], the
character is converted to a factor with appropriate levels, and it turns
out (shown using ‘Rprof’) that the majority of the time running the
function is taken up by ‘levels<-’, i.e., creating this factor / attaching
the levels.

I believe one potential solution would be to have both ‘x’ and ‘y’ being
factors, so that there is no conversion to/from characters. This would
eliminate both the conversion ‘"a" to factor’ and ‘B["a"][,x] to factor’.
However, ‘data.table’ doesn’t accept ‘i’ being a factor (and if I convert
it to the internal numeric codes, it thinks I mean row numbers).

Any suggestions on how to solve this?

I also wonder if it is possible for a data.table to only return unique
values of a column? For the above example I would like the output y=a,b.
Note that for instance

  A[B["a"][,x],mult="first"][,y]

does not work, as this returns the first value of ‘y’ for each x (here
y=a,a).

My last question is why

  A[B["a"][,x], y, mult="all"]

returns a two-column data.table/data.frame, while

  A[B["a"][,x], y, mult="first"]

returns a character vector. I would expect both of them to return a
character vector. Is this a feature or simply a bug?

-- 
Regards,
Karl Ove Hufthammer


More information about the datatable-help mailing list