[datatable-help] matching for both equality and greater than

Matthew Dowle mdowle at mdowle.plus.com
Tue Jul 26 19:35:55 CEST 2011


Hi,

Good question. The vector scan B>=2 should be quite quick provided it
follows the DT["a"]. It will be a vector scan, yes, but only over a
small subset of DT$B, and that subset will be contiguous in memory. What
might be biting instead is that the chained query DT["a"][B>=2] will
subset all the columns of DT in the first []. That inefficiency could be
dominating depending on how many columns DT has vs how many you really
need to use. If that's the case you can speed it up a lot like this : 
    DT["a",list(columns I know I need)][B>=2, expression using those
columns]

Or, on a different tack, to go as fast as possible (as requested),
perhaps (untested) :

setkey(DT,A,B)
start = DT[J("A",2),which=TRUE,mult="first"]
end = DT["A",which=TRUE,mult="last"]
DT[start:end, ...]

Or, getting fancy now in one less step (again, untested) :

w = DT[J("A",c(2,Inf)),which=TRUE,roll=TRUE]
DT[w[1]:w[2],...]

but that only works if you know 2 exists in B, and that there are no
duplicates of 2. Possibly check DT$B[w[1]]>=2 and +1 to w[1] if not.

Much neater would be the FR to do range (i.e. between) queries :

https://r-forge.r-project.org/tracker/index.php?func=detail&aid=203&group_id=240&atid=978

So, if list() columns were easily created as per previous threads,
it might be simply :

setkey(DT,A,B)
DT[J("a",V(low,upp)),...]

where V() stands for vector, and would create a list() join column. Open and closed
ends can be done via +/-1 to low and upp. One sided via setting low to -Inf or
upp to +Inf.  That idiom might allow some funky queries such as a different range for
each row of i, efficiently both in terms of amount of code, and execution speed.

Matthew


On Mon, 2011-07-25 at 20:46 -0700, Steve Harman wrote:
> Hello All,
> 
> I have a data table, DT, and two columns, A and B. A has character
> values and B has numeric values.
> I need to find the rows matching "a" AND greater than or equal to 2.
> After setkey(DT,A), I am using DT["a"][B>=2].
> 
> However, since this command needs to be repeated many times for many
> different values,
> I would like it to be as fast as possible.
> 
> If I had to test for equality for both variables, then I would use
> setkey(DT,A,B) followed by DT[J("A",2)]. However, the second condition
> is greater than or equal to, which, results in slower execution
> compared to matching for equality for both variables.
> 
> I wanted to direct this question to the list to take advantage of any
> speed improvement that can be possible and I might be missing. Thank
> you very much in advance.
> 
> Steve
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help





More information about the datatable-help mailing list