[datatable-help] Question on data.table::chmatch and fastmatch::fmatch

Matthew Dowle mdowle at mdowle.plus.com
Wed Jan 30 01:16:15 CET 2013


On 29.01.2013 17:46, stat quant wrote:
> Hello all,
> I have a lot of character columns in my data.table (usually only a 
> few
> factors like: 5e6 values drawn from {"A","B","C","D"}).
> Looking on page 7-8 of the package vignette M.Dowle mention that:
>
> * the package fastmatch is a faster alternative for string lookups,
> using fastmatch::fmatch will build a hash map and will speed up 
> things
> considerably
> * ...but poinpoint that the first pass is less efficient (compared to
> data.table::chmatch)
> * and finish by saying that he suggested Simon Urbanek (the fastmatch
> package maintainer) to adopt chmatch for the first call.
>
> I have a few questions regarding data.table/fastmatch:
>
> * if I use something like DT[ fmatch(X,"A"),...], shall I expect
> lightening-quick subsequent selects, I mean, would DT[
> fmatch(X,c("B","D")),...] be much quicker (the select part of if)

Yes it should be faster. Because it's a single hash lookup rather than
a binary search. But remember binary search is for finding groups of 
tied values, in multiple columns too. When the binary search finds group 
"FOO" in the first key column, the binary search for the 2nd column is 
just within the "FOO" group, not the entire 2nd key column, and so on. 
Not sure fmatch could do that. The more columns, the more unique 
combinations, and the larger the hash table needs to be. For a single 
column unique key, fmatch may be a better choice if many repeated 
lookups are needed. data.table's ordered mult column keys aims to solve 
a different problem.

At least, that's my current thinking. It's something I might be wrong 
about.  Best to test yourself.  Unexpected speed differences may of 
course just be unintended bugs, too, so please continue to report if you 
find anything.

> * Are M.D or Simon Urbanek planing to use one-another code to enhance
> both package ?

There aren't any plans at the moment. You would need to convince one or 
both of us into action by presenting benchmarks or detailing use cases 
that would benefit significantly. Or submit a patch to either package. 
But they can both be used together already of course thanks to R's 
package system.

>
> Thanks for reading
> Regards




More information about the datatable-help mailing list