[datatable-help] Using a data.table to perform a two-way lookup

Matthew Dowle mdowle at mdowle.plus.com
Wed Apr 13 11:05:56 CEST 2011


Hi Karl,
For a while the "1" and "2" (integers stored as character) confused me.
I thought you wanted a,b because that was the value of A[1:2,x]. Now I'm
over that, comments inserted below ...

On Tue, 2011-04-12 at 03:53 -0700, Short, Tom wrote:
> > -----Original Message-----
> > From: datatable-help-bounces at r-forge.wu-wien.ac.at 
> > [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On 
> > Behalf Of Karl Ove Hufthammer
> >
> > I have found one way of achieving this, creating two 
> > identical data.tables with different keys:
> > 
> >   options(stringsAsFactors=FALSE)
> >   dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
> >   dat
> >   A <- B <- data.table(dat)
> >   key(A)="x"
> >   key(B)="y"
> > 
> >   A[B["a"][,x]][,y]
> > 
> > The problem is performance (my real-life data.table is *much* 
> > larger), since B["a"][,x] outputs a character vector.

Not character, B["a"][,x] returns a factor for me.

> > When 
> > this is used in A[...], the character is converted to a factor 
> > with appropriate levels, and it turns out (shown using 
> > 'Rprof') that the majority of the time running the function 
> > is taken up by 'levels<-', i.e., creating this factor / 
> > attaching the levels.

Interesting. I doubt it's converting character to factor (because it's
already factor), rather matching the levels of column in the i table to
the levels of the column in the x table (in this case B). It then
changes the levels in a working copy of the i column so the binary
search can work on 2 integer columns where those integers mean the same
level. There are some performance issues nestled in there which only
come to light when you have very many levels (e.g. 100,000+ levels). I
can't remember where it starts to bite. How many levels do you have in
the real data?

I've done quite a bit of work on allowing keys to be character. The need
to match levels to levels then goes away and saves some copies too.
Sorting character is very slow (which is why data.table likes integers
of course) but to solve that we now have a counting sort for character.
That C code is committed but not hooked up yet.

Short answer then ... you might be hitting a known performance issue. If
you have some way to generate the large dataset with large number of
levels it would be great to have some more test cases to play with.
More comments below ...

> > 
> > I believe one potential solution would be to have both 'x' 
> > and 'y' being factors, so that there is no conversion to/from 
> > characters. This would eliminate both the conversion '"a" to 
> > factor' and 'B["a"][,x] to factor'.
> > However, 'data.table' doesn't accept 'i' being a factor (and 
> > if I convert it to the internal numeric codes, it thinks I 
> > mean row numbers).
Yes, that means row numbers, don't do that! :-)
> > 
> > Any suggestions on how to solve this?
> 
> To answer part of your inquiry, you can use factors by enclosing i with
> J() as follows:
> 
> options(stringsAsFactors=TRUE)
> dat=data.frame(x=c("1","1","2","3"), y=c("a","b","a","c"))
> A <- B <- data.table(dat)
> key(A)="x"
> key(B)="y"
> A[J(B["a"][,x])][,y]
Only slight improvement on that for speed I can think of is :
  A[J(B["a"]$x)]$y
but that would only make a difference if this is being repeated many
times. Karl's example seems like a very rare but valid reason to iterate
the query.

>> I also wonder if it is possible for a data.table to only
>> return unique values of a column? For the above example
>> I would like the output y=a,b.
>> Note that for instance
>>  A[B["a"][,x],mult="first"][,y]
>> does not work, as this returns the first value of ‘y’ for
>> each x (here y=a,a).
How about :
A[J(B["a"][,x])][,unique(y)]
>>
>> My last question is why
>>
>>  A[B["a"][,x], y, mult="all"]
>>
>> returns a two-column data.table/data.frame, while
>>
>>  A[B["a"][,x], y, mult="first"]
>>
>> returns a character vector. I would expect both of
>> them to return a character vector. Is this a feature
>> or simply a bug?
Try FAQs (latest on homepage): 2.2 and 2.17
Also ...
It seems strange perhaps because "a" is a single group.
It may have been a variable containing "a" or c("a","b")
and in data.table we prefer consistent return types
regardless of the data passed in. [Background: I dislike
having to remember to write drop=FALSE in [.data.frame and
believe that is the source of many bugs in user code.]
In general the rule is that the type of j is the type of
the result. When j is a vector the result will be vector,
and if you want a data.table, then wrap with list(). [As per FAQ 1.3].
*Except* when grouping (either by i or by by) since you
are unlikely to want to lose which items in the result
correspond to which groups. That is tightly coupled to
.SD (now) excluding the grouping columns.
With all the above in mind, mult="first" and mult="last" is not
considered grouping. In those cases the table is subset by i, then
j is run once on the result. Thats when j=y vs j=list(y)
makes a difference.  mult="all" is different
because it has a dual meaning: subset and grouping.

All that said, it has been tricky to get right. If anyone can see a
better way to do it then please do suggest.

Matthew


> 
> - Tom
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list