[datatable-help] Using a data.table to perform a two-way lookup
Matthew Dowle
mdowle at mdowle.plus.com
Thu Apr 14 10:02:08 CEST 2011
On Wed, 2011-04-13 at 04:04 -0700, Karl Ove Hufthammer wrote:
> Thank you for your detailed reply. Comments below:
>
> On Wed, 13 Apr 2011 10:05:56 +0100, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
> >>> options(stringsAsFactors=FALSE)
> >>> dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
> >>> dat
> >>> A <- B <- data.table(dat)
> >>> key(A)="x"
> >>> key(B)="y"
> >>>
> >>> A[B["a"][,x]][,y]
> >>>
> >>> The problem is performance (my real-life data.table is *much*
> >>> larger), since B["a"][,x] outputs a character vector.
> >
> > Not character, B["a"][,x] returns a factor for me.
>
> Did you remember to run the line ‘options(stringsAsFactors=FALSE)’?
> The ‘x’ column in ‘B’ is a character vector when the data.table is
> created from a data.frame with ‘x‘ as a character vector (but a
> factor if I create the data.table directly).
I just created the data.table directly. Ok, I'm with you now.
> I have about 150,000 levels in one of the keys and 30,000 in the other.
Thanks. Might be the known issue, but looking at your code it's likely
something more basic.
> I‘ve tried to come up with a similar generated dataset and example code.
> The result is much faster than similar code on my real data set, but still
>
> shows (using ‘Rprof’) that ‘levels<-’ is the main bottle-neck, as about
> one
> third of the time is spent there. Here’s the code. I’ve included both
> a tiny example to show how the function is supposed to work, and a larger
> and slower example (which is still pretty fast, about thirty seconds
> on my computer):
>
> ------------------------------------------------------------------------
I've been looking at the code for 40 minutes. It generates data and runs
but I can't grasp the big picture. If I was doing iterative
connectedness I'd just iterate bulk joins until no new connections
turned up. I don't see why there is processed vector as long as the
table, or why it appears to do just the first y on the first iteration
(why not all the unique y in one go on the first step?), or what the
group column added at the end represents.
In terms of the levels<-, could I ask for a simpler example isolating
that please.
Other thoughts :
Why can't all columns of x2y and y2x be factor?
More to the point, why store the x integers as character? Can't they be
kept as integers and the levels<- thing goes away. Is it at all possible
you didn't know that x2y[J(c(1,6,8))] joins using the integer values and
doesn't refer to the row numbers?
Matthew
More information about the datatable-help
mailing list