[datatable-help] Using a data.table to perform a two-way lookup
Karl Ove Hufthammer
karl at huftis.org
Wed Apr 13 13:04:52 CEST 2011
Thank you for your detailed reply. Comments below:
On Wed, 13 Apr 2011 10:05:56 +0100, Matthew Dowle <mdowle at mdowle.plus.com>
wrote:
>>> options(stringsAsFactors=FALSE)
>>> dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
>>> dat
>>> A <- B <- data.table(dat)
>>> key(A)="x"
>>> key(B)="y"
>>>
>>> A[B["a"][,x]][,y]
>>>
>>> The problem is performance (my real-life data.table is *much*
>>> larger), since B["a"][,x] outputs a character vector.
>
> Not character, B["a"][,x] returns a factor for me.
Did you remember to run the line ‘options(stringsAsFactors=FALSE)’?
The ‘x’ column in ‘B’ is a character vector when the data.table is
created from a data.frame with ‘x‘ as a character vector (but a
factor if I create the data.table directly).
>> > When
>> > this is used in A[...], the character is converted to a factor
>> > with appropriate levels, and it turns out (shown using
>> > 'Rprof') that the majority of the time running the function
>> > is taken up by 'levels<-', i.e., creating this factor /
>> > attaching the levels.
>
> Interesting. I doubt it's converting character to factor (because it's
> already factor), rather matching the levels of column in the i table to
> the levels of the column in the x table (in this case B). It then
> changes the levels in a working copy of the i column so the binary
> search can work on 2 integer columns where those integers mean the same
> level. There are some performance issues nestled in there which only
> come to light when you have very many levels (e.g. 100,000+ levels). I
> can't remember where it starts to bite. How many levels do you have in
> the real data?
I have about 150,000 levels in one of the keys and 30,000 in the other.
> Short answer then ... you might be hitting a known performance issue. If
> you have some way to generate the large dataset with large number of
> levels it would be great to have some more test cases to play with.
I‘ve tried to come up with a similar generated dataset and example code.
The result is much faster than similar code on my real data set, but still
shows (using ‘Rprof’) that ‘levels<-’ is the main bottle-neck, as about
one
third of the time is spent there. Here’s the code. I’ve included both
a tiny example to show how the function is supposed to work, and a larger
and slower example (which is still pretty fast, about thirty seconds
on my computer):
------------------------------------------------------------------------
# Don't automatically convert strings to factors ...
options(stringsAsFactors=FALSE)
library(data.table)
library(Hmisc) # For %nin%
## Small example
xy=matrix(c(1,"A",2,"A",2,"B",3,"B",3,"D",4,"B",5,"C",6,"C",7,"D"),
ncol=2, byrow=TRUE)
xy=as.data.frame(xy)
names(xy)=c("x","y")
xy
# For now, skip the large example below ...
## Large example
# Number of x values
nx=300e3
# Maximum number of unique y values
ny=60e3
# Probability of an x value having being
# connected to 1:length(probs) y values
probs=c(40,12,5,3,1,1)
# Sample some x and y values
values=replicate(nx, sample(ny, sample(length(probs), 1, prob=probs)))
# Create the data.frame
xy=data.frame(x=as.character(rep(seq_along(values), times=sapply(values,
length))),
y=as.character(unlist(values)))
# Create lookup data.tables
x2y=data.table(xy)
key(x2y)="x"
y2x=data.table(xy)
key(y2x)="y"
## For each value of y, find the other values of y
## connected to y via x (directly or indirectly),
## (and group them in families)
processed=logical(nrow(xy)) # Indicates which rows we have already
processed
group=numeric(nrow(xy)) # Vector giving the group number of each y (or
x)
nr=1
while( any(!processed) ) {
cat(nr,"\n") # Some progres info ...
# Just grab the first y value not yet processed
y.current=as.character(y2x$y[which(!processed)[1]])
# List of connected y.values
y.group=as.list(y.current)
# Find the y values 'directly' connected to the current values,
# and repeat process until there are no new y values found
repeat {
y.connected=unique(x2y[y2x[y.current,][,x]]$y) # Directly
connected y values
y.new=y.connected[y.connected %nin% y.group] # Only the *new* y
values discovered
if(length(y.new) == 0) # Stopp if didn't find any new y values
break
y.group=c(y.group, y.new) # Note: Since this is a list, not a
vector, increasing its length have little performance problems
y.current=y.new # Repeat the process, but with the new y
values found as the starting point
}
y.group=unlist(y.group) # Convert to a vector
used=y2x[y.group, which=TRUE] # Indices of the y valued grouped this
time
processed[used]=TRUE # Mark the rows as processed
group[used]=nr # Add the group numbers
nr=nr+1 # Increase the group number for the next
iteration
}
max(group) # Number of groups
res=cbind(y2x, group) # Add the group numbers to the data
res # Show the results
------------------------------------------------------------------------
Note that it is possible to improve the speed a bit (and making the code
more difficult to read) by using numeric values and levels()[...]
instead of (implicit or explicit) as.character coercion inside the
‘repeat’
loop, but most of the time is taken up by ‘levels<-’ still.
>>> I also wonder if it is possible for a data.table to only
>>> return unique values of a column? For the above example
>>> I would like the output y=a,b.
>>> Note that for instance
>>> A[B["a"][,x],mult="first"][,y]
>>> does not work, as this returns the first value of ‘y’ for
>>> each x (here y=a,a).
> How about :
> A[J(B["a"][,x])][,unique(y)]
Thanks. I’m still getting used to the data.table syntax.
It works fine, but offers no speed benefit over just
running ‘unique()’ on the result, so I guess the which syntax
to use just depends on what you think looks best.
>>> My last question is why
>>>
>>> A[B["a"][,x], y, mult="all"]
>>>
>>> returns a two-column data.table/data.frame, while
>>>
>>> A[B["a"][,x], y, mult="first"]
>>>
>>> returns a character vector. I would expect both of
>>> them to return a character vector. Is this a feature
>>> or simply a bug?
[snip]
> With all the above in mind, mult="first" and mult="last" is not
> considered grouping. In those cases the table is subset by i, then
> j is run once on the result. Thats when j=y vs j=list(y)
> makes a difference. mult="all" is different
> because it has a dual meaning: subset and grouping.
OK. It’s at least somewhat clearer to me now. :)
Thank you.
--
Karl Ove Hufthammer
More information about the datatable-help
mailing list