[datatable-help] Using a data.table to perform a two-way lookup

Short, Tom TShort at epri.com
Tue Apr 12 12:53:45 CEST 2011


> -----Original Message-----
> From: datatable-help-bounces at r-forge.wu-wien.ac.at 
> [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On 
> Behalf Of Karl Ove Hufthammer
>
> I have found one way of achieving this, creating two 
> identical data.tables with different keys:
> 
>   options(stringsAsFactors=FALSE)
>   dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
>   dat
>   A <- B <- data.table(dat)
>   key(A)="x"
>   key(B)="y"
> 
>   A[B["a"][,x]][,y]
> 
> The problem is performance (my real-life data.table is *much* 
> larger), since B["a"][,x] outputs a character vector. When 
> this is used in A[...], the character is converted to a factor 
> with appropriate levels, and it turns out (shown using 
> 'Rprof') that the majority of the time running the function 
> is taken up by 'levels<-', i.e., creating this factor / 
> attaching the levels.
> 
> I believe one potential solution would be to have both 'x' 
> and 'y' being factors, so that there is no conversion to/from 
> characters. This would eliminate both the conversion '"a" to 
> factor' and 'B["a"][,x] to factor'.
> However, 'data.table' doesn't accept 'i' being a factor (and 
> if I convert it to the internal numeric codes, it thinks I 
> mean row numbers).
> 
> Any suggestions on how to solve this?

To answer part of your inquiry, you can use factors by enclosing i with
J() as follows:

options(stringsAsFactors=TRUE)
dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
A <- B <- data.table(dat)
key(A)="x"
key(B)="y"
A[J(B["a"][,x])][,y]

- Tom


More information about the datatable-help mailing list