[datatable-help] between() versus %between% - why different results?
drclark
clark9876 at airquality.dk
Mon Oct 7 00:29:29 CEST 2013
Dear data.table experts,
I was inspired by SO topic How to match two data.frames with an inexact
matching identifier (one identifier has to be in the range of the other) for
a problem I have to calculate pollutant statistics during various episodes
from monitoring data. The episodes (like the fiscal quarters in the SO
topic) are defined for each site in a lookup table with starting and ending
dates. The start and end dates can be different at different sites. The SO
answer used >= and <= to check the date was in the range from start to end.
mD[qD][Month>=startMonth & Month<=endMonth]
This approach may suit my problem, but I thought that I could use "between"
rather than the two logical comparisons. I tried both the between()
function and its equivalent %between% operator -- and I get two different
results. The between() version is correct, but %between% gives a wrong
answer. Am I missing something in the syntax for using between?
My version of the SO data, merge and results below. I changed the variable
names to suit my work: ID->site, Month->date, MonValue->conc,
QTRValue->episodeID.
require(data.table) # data.table 1.8.10 on R 3.0.2 under Win7x64
# the measurement data
dat <- data.table(site = rep(c("A","B"), each=10),
date = rep(1:10, times = 2), # could be day or hour
conc = sample(30:50,2*10,replace=TRUE), # the pollutant
data
key="site,date")
dat
# site date conc
# 1: A 1 48
# 2: A 2 44
# 3: A 3 50
# 4: A 4 47
# 5: A 5 35
# 6: A 6 47
# 7: A 7 38
# 8: A 8 34
# 9: A 9 46
#10: A 10 35
#11: B 1 45
#12: B 2 35
#13: B 3 40
#14: B 4 41
#15: B 5 37
#16: B 6 37
#17: B 7 32
#18: B 8 41
#19: B 9 31
#20: B 10 32
#
# definitions for the episodes
episode <- data.table(
site = rep(c("A", "B"), each = 3),
start = c(1, 4, 7, 1, 3, 8),
end = c(3, 5, 10, 2, 5, 10),
episodeID = rep(1:3, 2),
key="site")
episode
# site start end episodeID
# 1: A 1 3 1
# 2: A 4 5 2
# 3: A 7 10 3
# 4: B 1 2 1
# 5: B 3 5 2
# 6: B 8 10 3
#
# join measurement data and episode list (for later aggregation using
mean() etc.)
# approach from the SO thread -- gives the right result
dat[episode, allow.cartesian=TRUE][date>=start & date<=end]
site date conc start end episodeID
# 1: A 1 48 1 3 1
# 2: A 2 44 1 3 1
# 3: A 3 50 1 3 1
# 4: A 4 47 4 5 2
# 5: A 5 35 4 5 2
# 6: A 7 38 7 10 3
# 7: A 8 34 7 10 3
# 8: A 9 46 7 10 3
# 9: A 10 35 7 10 3
# 10: B 1 45 1 2 1
# 11: B 2 35 1 2 1
# 12: B 3 40 3 5 2
# 13: B 4 41 3 5 2
# 14: B 5 37 3 5 2
# 15: B 8 41 8 10 3
# 16: B 9 31 8 10 3
# 17: B 10 32 8 10 3
# using between() -- also gives the desired result
dat[episode, allow.cartesian=TRUE][between (date,start,end)]
# (returns same result as above)
# using %between% -- gives different result - not the right answer
dat[episode, allow.cartesian=TRUE][date %between% c(start,end)]
# site date conc start end episodeID
# 1: A 1 48 1 3 1
# 2: A 1 48 4 5 2
# 3: A 1 48 7 10 3
# 4: B 1 45 1 2 1
# 5: B 1 45 3 5 2
# 6: B 1 45 8 10 3
So why does the %between% operator give a different result than between()?
There must be some detail of syntax I need to learn here. I also tried
putting the whole %between% expression in parenthesis, but that doesn't make
any difference:
dat[episode, allow.cartesian=TRUE][(date %between% c(start,end))]
Best regards.
Douglas Clark
--
View this message in context: http://r.789695.n4.nabble.com/between-versus-between-why-different-results-tp4677718.html
Sent from the datatable-help mailing list archive at Nabble.com.
More information about the datatable-help
mailing list