[datatable-help] Using a data.table to perform a two-way lookup

Thu Apr 14 10:02:08 CEST 2011

On Wed, 2011-04-13 at 04:04 -0700, Karl Ove Hufthammer wrote:
> Thank you for your detailed reply. Comments below:
> 
> On Wed, 13 Apr 2011 10:05:56 +0100, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
> >>>   options(stringsAsFactors=FALSE)
> >>>   dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
> >>>   dat
> >>>   A <- B <- data.table(dat)
> >>>   key(A)="x"
> >>>   key(B)="y"
> >>> 
> >>>   A[B["a"][,x]][,y]
> >>> 
> >>> The problem is performance (my real-life data.table is *much* 
> >>> larger), since B["a"][,x] outputs a character vector.
> > 
> > Not character, B["a"][,x] returns a factor for me.
> 
> Did you remember to run the line ‘options(stringsAsFactors=FALSE)’?
> The ‘x’ column in ‘B’ is a character vector when the data.table is
> created from a data.frame with ‘x‘ as a character vector (but a
> factor if I create the data.table directly).
I just created the data.table directly. Ok, I'm with you now.

> I have about 150,000 levels in one of the keys and 30,000 in the other.
Thanks. Might be the known issue, but looking at your code it's likely
something more basic.

> I‘ve tried to come up with a similar generated dataset and example code.
> The result is much faster than similar code on my real data set, but still
> 
> shows (using ‘Rprof’) that ‘levels<-’ is the main bottle-neck, as about
> one
> third of the time is spent there. Here’s the code. I’ve included both
> a tiny example to show how the function is supposed to work, and a larger
> and slower example (which is still pretty fast, about thirty seconds
> on my computer):
> 
> ------------------------------------------------------------------------

I've been looking at the code for 40 minutes. It generates data and runs
but I can't grasp the big picture. If I was doing iterative
connectedness I'd just iterate bulk joins until no new connections
turned up. I don't see why there is processed vector as long as the
table, or why it appears to do just the first y on the first iteration
(why not all the unique y in one go on the first step?), or what the
group column added at the end represents.

In terms of the levels<-, could I ask for a simpler example isolating
that please.

Other thoughts :
Why can't all columns of x2y and y2x be factor?
More to the point, why store the x integers as character? Can't they be
kept as integers and the levels<- thing goes away. Is it at all possible
you didn't know that x2y[J(c(1,6,8))] joins using the integer values and
doesn't refer to the row numbers?

Matthew