[datatable-help] between() versus %between% - why different results?

Eduard Antonyan eduard.antonyan at gmail.com
Mon Oct 7 20:31:30 CEST 2013


This is because `x %between% y` works by calling `between(x, y[1], y[2])`,
so your call becomes:

   dt[date %between c(start, end)]  ----> dt[between(date, c(start,
end)[1], c(start, end)[2])]

I don't know if there is anything that can be done about it (aside from not
using the operator version with vectors).


On Sun, Oct 6, 2013 at 5:29 PM, drclark <clark9876 at airquality.dk> wrote:

> Dear data.table experts,
>
> I was inspired by SO topic How to match two data.frames with an inexact
> matching identifier (one identifier has to be in the range of the other)
> for
> a problem I have to calculate pollutant statistics during various episodes
> from monitoring data. The episodes (like the fiscal quarters in the SO
> topic) are defined for each site in a lookup table with starting and ending
> dates. The start and end dates can be different at different sites. The SO
> answer used >= and <= to check the date was in the range from start to end.
>   mD[qD][Month>=startMonth & Month<=endMonth]
>
> This approach may suit my problem, but I thought that I could use "between"
> rather than the two logical comparisons.  I tried both the between()
> function and its equivalent %between% operator -- and I get two different
> results. The between() version is correct, but %between% gives a wrong
> answer. Am I missing something in the syntax for using between?
>
> My version of the SO data, merge and results below. I changed the variable
> names to suit my work: ID->site, Month->date, MonValue->conc,
> QTRValue->episodeID.
>
> require(data.table)   # data.table 1.8.10  on R 3.0.2 under Win7x64
> # the measurement data
> dat <- data.table(site = rep(c("A","B"), each=10),
>                   date = rep(1:10, times = 2),     # could be day or hour
>                   conc = sample(30:50,2*10,replace=TRUE),  # the pollutant
> data
>                   key="site,date")
> dat
> #    site date conc
> # 1:    A    1   48
> # 2:    A    2   44
> # 3:    A    3   50
> # 4:    A    4   47
> # 5:    A    5   35
> # 6:    A    6   47
> # 7:    A    7   38
> # 8:    A    8   34
> # 9:    A    9   46
> #10:    A   10   35
> #11:    B    1   45
> #12:    B    2   35
> #13:    B    3   40
> #14:    B    4   41
> #15:    B    5   37
> #16:    B    6   37
> #17:    B    7   32
> #18:    B    8   41
> #19:    B    9   31
> #20:    B   10   32
> #
> # definitions for the episodes
> episode <- data.table(
>                 site = rep(c("A", "B"), each = 3),
>                 start = c(1, 4, 7, 1, 3, 8),
>                 end = c(3, 5, 10, 2, 5, 10),
>                 episodeID = rep(1:3, 2),
>                 key="site")
> episode
> #   site start end episodeID
> # 1:    A     1   3         1
> # 2:    A     4   5         2
> # 3:    A     7  10         3
> # 4:    B     1   2         1
> # 5:    B     3   5         2
> # 6:    B     8  10         3
> #
> # join measurement data and episode list  (for later aggregation using
> mean() etc.)
> # approach from the SO thread -- gives the right result
> dat[episode, allow.cartesian=TRUE][date>=start & date<=end]
>     site date conc start end episodeID
> #   1:    A    1   48     1   3         1
> #   2:    A    2   44     1   3         1
> #   3:    A    3   50     1   3         1
> #   4:    A    4   47     4   5         2
> #   5:    A    5   35     4   5         2
> #   6:    A    7   38     7  10         3
> #   7:    A    8   34     7  10         3
> #   8:    A    9   46     7  10         3
> #   9:    A   10   35     7  10         3
> # 10:    B    1   45     1   2         1
> # 11:    B    2   35     1   2         1
> # 12:    B    3   40     3   5         2
> # 13:    B    4   41     3   5         2
> # 14:    B    5   37     3   5         2
> # 15:    B    8   41     8  10         3
> # 16:    B    9   31     8  10         3
> # 17:    B   10   32     8  10         3
>
> # using between() -- also gives the desired result
> dat[episode, allow.cartesian=TRUE][between (date,start,end)]
> #  (returns same result as above)
>
> # using %between% -- gives different result - not the right answer
> dat[episode, allow.cartesian=TRUE][date %between% c(start,end)]
> #    site date conc start end episodeID
> # 1:    A    1   48     1   3         1
> # 2:    A    1   48     4   5         2
> # 3:    A    1   48     7  10         3
> # 4:    B    1   45     1   2         1
> # 5:    B    1   45     3   5         2
> # 6:    B    1   45     8  10         3
>
> So why does the %between% operator give a different result than between()?
> There must be some detail of syntax I need to learn here.  I also tried
> putting the whole %between% expression in parenthesis, but that doesn't
> make
> any difference:
>   dat[episode, allow.cartesian=TRUE][(date %between% c(start,end))]
>
> Best regards.
> Douglas Clark
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/between-versus-between-why-different-results-tp4677718.html
> Sent from the datatable-help mailing list archive at Nabble.com.
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131007/137f332c/attachment.html>


More information about the datatable-help mailing list