[datatable-help] Using a data.table to perform a two-way lookup
Karl Ove Hufthammer
karl at huftis.org
Tue Apr 12 09:57:50 CEST 2011
Dear list members,
I’m having some problems using a data.table to perform a two-way lookup.
Here’s a simple example of what I’m trying to do:
I have a data.frame/data.table that looks like this
x y
1 a
1 b
2 a
3 c
For a certain value in ‘y’, I want to extract the corresponding values of
‘x’. Then I want to use these values of ‘x’ to extract the corresponding
values of ‘y’. So if I start with y=a, I will get x=1,2, which will give me
y=a,b,a (or preferably just y=a,b). (I would then iterate this to till each
iteration doesn’t increase the length of y, so that I finally get all ‘y’
values connected to y=a.)
I have found one way of achieving this, creating two identical data.tables
with different keys:
options(stringsAsFactors=FALSE)
dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
dat
A <- B <- data.table(dat)
key(A)="x"
key(B)="y"
A[B["a"][,x]][,y]
The problem is performance (my real-life data.table is *much* larger),
since B["a"][,x] outputs a character vector. When this is used in A[…], the
character is converted to a factor with appropriate levels, and it turns
out (shown using ‘Rprof’) that the majority of the time running the
function is taken up by ‘levels<-’, i.e., creating this factor / attaching
the levels.
I believe one potential solution would be to have both ‘x’ and ‘y’ being
factors, so that there is no conversion to/from characters. This would
eliminate both the conversion ‘"a" to factor’ and ‘B["a"][,x] to factor’.
However, ‘data.table’ doesn’t accept ‘i’ being a factor (and if I convert
it to the internal numeric codes, it thinks I mean row numbers).
Any suggestions on how to solve this?
I also wonder if it is possible for a data.table to only return unique
values of a column? For the above example I would like the output y=a,b.
Note that for instance
A[B["a"][,x],mult="first"][,y]
does not work, as this returns the first value of ‘y’ for each x (here
y=a,a).
My last question is why
A[B["a"][,x], y, mult="all"]
returns a two-column data.table/data.frame, while
A[B["a"][,x], y, mult="first"]
returns a character vector. I would expect both of them to return a
character vector. Is this a feature or simply a bug?
--
Regards,
Karl Ove Hufthammer
More information about the datatable-help
mailing list