[datatable-help] matching for both equality and greater than

Steve Harman stvharman at gmail.com
Sat Jul 30 17:23:43 CEST 2011


Matthew,

I experimented with them and the one that worked best
for me was this:


setkey(DT,A,B)
start = DT[J("A",2),which=TRUE,mult="
first"]
end = DT["A",which=TRUE,mult="last"]
DT[start:end, ...]

Thanks for the suggestions!

Steve

On Tue, Jul 26, 2011 at 1:35 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> Hi,
>
> Good question. The vector scan B>=2 should be quite quick provided it
> follows the DT["a"]. It will be a vector scan, yes, but only over a
> small subset of DT$B, and that subset will be contiguous in memory. What
> might be biting instead is that the chained query DT["a"][B>=2] will
> subset all the columns of DT in the first []. That inefficiency could be
> dominating depending on how many columns DT has vs how many you really
> need to use. If that's the case you can speed it up a lot like this :
>    DT["a",list(columns I know I need)][B>=2, expression using those
> columns]
>
> Or, on a different tack, to go as fast as possible (as requested),
> perhaps (untested) :
>
> setkey(DT,A,B)
> start = DT[J("A",2),which=TRUE,mult="first"]
> end = DT["A",which=TRUE,mult="last"]
> DT[start:end, ...]
>
> Or, getting fancy now in one less step (again, untested) :
>
> w = DT[J("A",c(2,Inf)),which=TRUE,roll=TRUE]
> DT[w[1]:w[2],...]
>
> but that only works if you know 2 exists in B, and that there are no
> duplicates of 2. Possibly check DT$B[w[1]]>=2 and +1 to w[1] if not.
>
> Much neater would be the FR to do range (i.e. between) queries :
>
>
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=203&group_id=240&atid=978
>
> So, if list() columns were easily created as per previous threads,
> it might be simply :
>
> setkey(DT,A,B)
> DT[J("a",V(low,upp)),...]
>
> where V() stands for vector, and would create a list() join column. Open
> and closed
> ends can be done via +/-1 to low and upp. One sided via setting low to -Inf
> or
> upp to +Inf.  That idiom might allow some funky queries such as a different
> range for
> each row of i, efficiently both in terms of amount of code, and execution
> speed.
>
> Matthew
>
>
> On Mon, 2011-07-25 at 20:46 -0700, Steve Harman wrote:
> > Hello All,
> >
> > I have a data table, DT, and two columns, A and B. A has character
> > values and B has numeric values.
> > I need to find the rows matching "a" AND greater than or equal to 2.
> > After setkey(DT,A), I am using DT["a"][B>=2].
> >
> > However, since this command needs to be repeated many times for many
> > different values,
> > I would like it to be as fast as possible.
> >
> > If I had to test for equality for both variables, then I would use
> > setkey(DT,A,B) followed by DT[J("A",2)]. However, the second condition
> > is greater than or equal to, which, results in slower execution
> > compared to matching for equality for both variables.
> >
> > I wanted to direct this question to the list to take advantage of any
> > speed improvement that can be possible and I might be missing. Thank
> > you very much in advance.
> >
> > Steve
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20110730/eeaca091/attachment.htm>


More information about the datatable-help mailing list