[datatable-help] Using a data.table to perform a two-way lookup

Thu Apr 14 13:57:47 CEST 2011

On Thu, 14 Apr 2011 09:02:08 +0100, Matthew Dowle <mdowle at mdowle.plus.com>
wrote:
> On Wed, 2011-04-13 at 04:04 -0700, Karl Ove Hufthammer wrote:
>> I‘ve tried to come up with a similar generated dataset and example
code.
>> The result is much faster than similar code on my real data set, but
>> still shows (using ‘Rprof’) that ‘levels<-’ is the main bottle-neck, as
about
>> one third of the time is spent there. Here’s the code. I’ve included
both
>> a tiny example to show how the function is supposed to work, and a
larger
>> and slower example (which is still pretty fast, about thirty seconds
>> on my computer):
> 
> I've been looking at the code for 40 minutes. It generates data and runs
> but I can't grasp the big picture. If I was doing iterative
> connectedness I'd just iterate bulk joins until no new connections
> turned up. I don't see why there is processed vector as long as the
> table, or why it appears to do just the first y on the first iteration
> (why not all the unique y in one go on the first step?), or what the
> group column added at the end represents.

It’s very possible that I have been missing something obvious. The goal of
the algorithm is to group the y values that are connected. So a y value
with group index 5 can reach all other y values with group index 5, by
going
to a x value, from the x value to a y value, perhaps back to an x value,
…,
before finally ending up the y value we are interested in.

I’m not sure what you mean with ‘iterate bulk joins’. Or, I thought that
was
what I was doing … :)

> In terms of the levels<-, could I ask for a simpler example isolating
> that please.

Well, I guess the relevant line is really just the one containing
  x2y[y2x[y.current,][,x]]

> Other thoughts :
> Why can't all columns of x2y and y2x be factor?

They can. They just happened to be characters. And having them as levels
doesn’t help, as the data.table function don’t know if they have identical
levels. (Though perhaps it’s faster to check if the levels are identical,
so you could skip the creation of a temporary object to match the levels?)

> More to the point, why store the x integers as character? Can't they be
> kept as integers and the levels<- thing goes away.

It’s only in this simplified example that they’re numbers. In my real
dataset
they’re character strings.

-- 
Karl Ove Hufthammer