[datatable-help] Question on data.table::chmatch and fastmatch::fmatch
Matthew Dowle
mdowle at mdowle.plus.com
Wed Jan 30 01:16:15 CET 2013
On 29.01.2013 17:46, stat quant wrote:
> Hello all,
> I have a lot of character columns in my data.table (usually only a
> few
> factors like: 5e6 values drawn from {"A","B","C","D"}).
> Looking on page 7-8 of the package vignette M.Dowle mention that:
>
> * the package fastmatch is a faster alternative for string lookups,
> using fastmatch::fmatch will build a hash map and will speed up
> things
> considerably
> * ...but poinpoint that the first pass is less efficient (compared to
> data.table::chmatch)
> * and finish by saying that he suggested Simon Urbanek (the fastmatch
> package maintainer) to adopt chmatch for the first call.
>
> I have a few questions regarding data.table/fastmatch:
>
> * if I use something like DT[ fmatch(X,"A"),...], shall I expect
> lightening-quick subsequent selects, I mean, would DT[
> fmatch(X,c("B","D")),...] be much quicker (the select part of if)
Yes it should be faster. Because it's a single hash lookup rather than
a binary search. But remember binary search is for finding groups of
tied values, in multiple columns too. When the binary search finds group
"FOO" in the first key column, the binary search for the 2nd column is
just within the "FOO" group, not the entire 2nd key column, and so on.
Not sure fmatch could do that. The more columns, the more unique
combinations, and the larger the hash table needs to be. For a single
column unique key, fmatch may be a better choice if many repeated
lookups are needed. data.table's ordered mult column keys aims to solve
a different problem.
At least, that's my current thinking. It's something I might be wrong
about. Best to test yourself. Unexpected speed differences may of
course just be unintended bugs, too, so please continue to report if you
find anything.
> * Are M.D or Simon Urbanek planing to use one-another code to enhance
> both package ?
There aren't any plans at the moment. You would need to convince one or
both of us into action by presenting benchmarks or detailing use cases
that would benefit significantly. Or submit a patch to either package.
But they can both be used together already of course thanks to R's
package system.
>
> Thanks for reading
> Regards
More information about the datatable-help
mailing list