[datatable-help] Using a data.table to perform a two-way lookup

Wed Apr 13 13:04:52 CEST 2011

Thank you for your detailed reply. Comments below:

On Wed, 13 Apr 2011 10:05:56 +0100, Matthew Dowle <mdowle at mdowle.plus.com>
wrote:
>>>   options(stringsAsFactors=FALSE)
>>>   dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
>>>   dat
>>>   A <- B <- data.table(dat)
>>>   key(A)="x"
>>>   key(B)="y"
>>> 
>>>   A[B["a"][,x]][,y]
>>> 
>>> The problem is performance (my real-life data.table is *much* 
>>> larger), since B["a"][,x] outputs a character vector.
> 
> Not character, B["a"][,x] returns a factor for me.

Did you remember to run the line ‘options(stringsAsFactors=FALSE)’?

The ‘x’ column in ‘B’ is a character vector when the data.table is
created from a data.frame with ‘x‘ as a character vector (but a
factor if I create the data.table directly).

>> > When 
>> > this is used in A[...], the character is converted to a factor 
>> > with appropriate levels, and it turns out (shown using 
>> > 'Rprof') that the majority of the time running the function 
>> > is taken up by 'levels<-', i.e., creating this factor / 
>> > attaching the levels.
> 
> Interesting. I doubt it's converting character to factor (because it's
> already factor), rather matching the levels of column in the i table to
> the levels of the column in the x table (in this case B). It then
> changes the levels in a working copy of the i column so the binary
> search can work on 2 integer columns where those integers mean the same
> level. There are some performance issues nestled in there which only
> come to light when you have very many levels (e.g. 100,000+ levels). I
> can't remember where it starts to bite. How many levels do you have in
> the real data?

I have about 150,000 levels in one of the keys and 30,000 in the other.

> Short answer then ... you might be hitting a known performance issue. If
> you have some way to generate the large dataset with large number of
> levels it would be great to have some more test cases to play with.

I‘ve tried to come up with a similar generated dataset and example code.
The result is much faster than similar code on my real data set, but still

shows (using ‘Rprof’) that ‘levels<-’ is the main bottle-neck, as about
one
third of the time is spent there. Here’s the code. I’ve included both
a tiny example to show how the function is supposed to work, and a larger
and slower example (which is still pretty fast, about thirty seconds
on my computer):

------------------------------------------------------------------------

# Don't automatically convert strings to factors ...
options(stringsAsFactors=FALSE)

library(data.table)
library(Hmisc) # For %nin%

## Small example

xy=matrix(c(1,"A",2,"A",2,"B",3,"B",3,"D",4,"B",5,"C",6,"C",7,"D"),
ncol=2, byrow=TRUE)
xy=as.data.frame(xy)
names(xy)=c("x","y")
xy
# For now, skip the large example below ...

## Large example

# Number of x values
nx=300e3

# Maximum number of unique y values
ny=60e3

# Probability of an x value having being
# connected to 1:length(probs) y values
probs=c(40,12,5,3,1,1)

# Sample some x and y values
values=replicate(nx, sample(ny, sample(length(probs), 1, prob=probs)))

# Create the data.frame
xy=data.frame(x=as.character(rep(seq_along(values), times=sapply(values,
length))),
              y=as.character(unlist(values)))

# Create lookup data.tables
x2y=data.table(xy)
key(x2y)="x"

y2x=data.table(xy)
key(y2x)="y"

## For each value of y, find the other values of y
## connected to y via x (directly or indirectly),
## (and group them in families)

processed=logical(nrow(xy)) # Indicates which rows we have already
processed
group=numeric(nrow(xy))     # Vector giving the group number of each y (or
x)

nr=1
while( any(!processed) ) {
    cat(nr,"\n") # Some progres info ...

    # Just grab the first y value not yet processed
    y.current=as.character(y2x$y[which(!processed)[1]])

    # List of connected y.values
    y.group=as.list(y.current)   

    # Find the y values 'directly' connected to the current values,
    # and repeat process until there are no new y values found
    repeat {
         y.connected=unique(x2y[y2x[y.current,][,x]]$y) # Directly
connected y values
         y.new=y.connected[y.connected %nin% y.group]   # Only the *new* y
values discovered

        if(length(y.new) == 0) # Stopp if didn't find any new y values
            break
        y.group=c(y.group, y.new) # Note: Since this is a list, not a
vector, increasing its length have little performance problems
        y.current=y.new           # Repeat the process, but with the new y
values found as the starting point
    }

    y.group=unlist(y.group)       # Convert to a vector
    used=y2x[y.group, which=TRUE] # Indices of the y valued grouped this
time
    processed[used]=TRUE          # Mark the rows as processed
    group[used]=nr                # Add the group numbers
    nr=nr+1                       # Increase the group number for the next
iteration
}

max(group) # Number of groups

res=cbind(y2x, group) # Add the group numbers to the data
res                   # Show the results

------------------------------------------------------------------------

Note that it is possible to improve the speed a bit (and making the code
more difficult to read) by using numeric values and levels()[...]
instead of (implicit or explicit) as.character coercion inside the
‘repeat’
loop, but most of the time is taken up by ‘levels<-’ still.

>>> I also wonder if it is possible for a data.table to only
>>> return unique values of a column? For the above example
>>> I would like the output y=a,b.
>>> Note that for instance
>>>  A[B["a"][,x],mult="first"][,y]
>>> does not work, as this returns the first value of ‘y’ for
>>> each x (here y=a,a).
> How about :
> A[J(B["a"][,x])][,unique(y)]

Thanks. I’m still getting used to the data.table syntax.
It works fine, but offers no speed benefit over just
running ‘unique()’ on the result, so I guess the which syntax
to use just depends on what you think looks best.

>>> My last question is why
>>>
>>>  A[B["a"][,x], y, mult="all"]
>>>
>>> returns a two-column data.table/data.frame, while
>>>
>>>  A[B["a"][,x], y, mult="first"]
>>>
>>> returns a character vector. I would expect both of
>>> them to return a character vector. Is this a feature
>>> or simply a bug?
[snip]
> With all the above in mind, mult="first" and mult="last" is not
> considered grouping. In those cases the table is subset by i, then
> j is run once on the result. Thats when j=y vs j=list(y)
> makes a difference.  mult="all" is different
> because it has a dual meaning: subset and grouping.

OK. It’s at least somewhat clearer to me now. :)
Thank you.

-- 
Karl Ove Hufthammer