[datatable-help] between() versus %between% - why different results?

drclark clark9876 at airquality.dk
Mon Oct 7 00:29:29 CEST 2013


Dear data.table experts,

I was inspired by SO topic How to match two data.frames with an inexact
matching identifier (one identifier has to be in the range of the other) for
a problem I have to calculate pollutant statistics during various episodes
from monitoring data. The episodes (like the fiscal quarters in the SO
topic) are defined for each site in a lookup table with starting and ending
dates. The start and end dates can be different at different sites. The SO
answer used >= and <= to check the date was in the range from start to end.
  mD[qD][Month>=startMonth & Month<=endMonth]

This approach may suit my problem, but I thought that I could use "between"
rather than the two logical comparisons.  I tried both the between()
function and its equivalent %between% operator -- and I get two different
results. The between() version is correct, but %between% gives a wrong
answer. Am I missing something in the syntax for using between?

My version of the SO data, merge and results below. I changed the variable
names to suit my work: ID->site, Month->date, MonValue->conc,
QTRValue->episodeID.

require(data.table)   # data.table 1.8.10  on R 3.0.2 under Win7x64
# the measurement data
dat <- data.table(site = rep(c("A","B"), each=10),
                  date = rep(1:10, times = 2),     # could be day or hour
                  conc = sample(30:50,2*10,replace=TRUE),  # the pollutant
data
                  key="site,date")
dat
#    site date conc
# 1:    A    1   48
# 2:    A    2   44
# 3:    A    3   50
# 4:    A    4   47
# 5:    A    5   35
# 6:    A    6   47
# 7:    A    7   38
# 8:    A    8   34
# 9:    A    9   46
#10:    A   10   35
#11:    B    1   45
#12:    B    2   35
#13:    B    3   40
#14:    B    4   41
#15:    B    5   37
#16:    B    6   37
#17:    B    7   32
#18:    B    8   41
#19:    B    9   31
#20:    B   10   32
#
# definitions for the episodes                  
episode <- data.table(
                site = rep(c("A", "B"), each = 3),
                start = c(1, 4, 7, 1, 3, 8),
                end = c(3, 5, 10, 2, 5, 10),
                episodeID = rep(1:3, 2),
                key="site")
episode
#   site start end episodeID
# 1:    A     1   3         1
# 2:    A     4   5         2
# 3:    A     7  10         3
# 4:    B     1   2         1
# 5:    B     3   5         2
# 6:    B     8  10         3
#
# join measurement data and episode list  (for later aggregation using
mean() etc.)
# approach from the SO thread -- gives the right result
dat[episode, allow.cartesian=TRUE][date>=start & date<=end]
    site date conc start end episodeID
#   1:    A    1   48     1   3         1
#   2:    A    2   44     1   3         1
#   3:    A    3   50     1   3         1
#   4:    A    4   47     4   5         2
#   5:    A    5   35     4   5         2
#   6:    A    7   38     7  10         3
#   7:    A    8   34     7  10         3
#   8:    A    9   46     7  10         3
#   9:    A   10   35     7  10         3
# 10:    B    1   45     1   2         1
# 11:    B    2   35     1   2         1
# 12:    B    3   40     3   5         2
# 13:    B    4   41     3   5         2
# 14:    B    5   37     3   5         2
# 15:    B    8   41     8  10         3
# 16:    B    9   31     8  10         3
# 17:    B   10   32     8  10         3

# using between() -- also gives the desired result
dat[episode, allow.cartesian=TRUE][between (date,start,end)]
#  (returns same result as above)

# using %between% -- gives different result - not the right answer
dat[episode, allow.cartesian=TRUE][date %between% c(start,end)]
#    site date conc start end episodeID
# 1:    A    1   48     1   3         1
# 2:    A    1   48     4   5         2
# 3:    A    1   48     7  10         3
# 4:    B    1   45     1   2         1
# 5:    B    1   45     3   5         2
# 6:    B    1   45     8  10         3

So why does the %between% operator give a different result than between()? 
There must be some detail of syntax I need to learn here.  I also tried
putting the whole %between% expression in parenthesis, but that doesn't make
any difference:
  dat[episode, allow.cartesian=TRUE][(date %between% c(start,end))]

Best regards.
Douglas Clark 



--
View this message in context: http://r.789695.n4.nabble.com/between-versus-between-why-different-results-tp4677718.html
Sent from the datatable-help mailing list archive at Nabble.com.


More information about the datatable-help mailing list