[datatable-help] data.table and time series subsetting

Wed Jun 13 13:16:57 CEST 2012

Hi,

> Data-table experts,
>
> Thanks to this stack-overflow post
> (
> http://stackoverflow.com/questions/10748253/idiomatic-r-code-for-partitioning-a-vector-by-an-index-and-performing-an-operati
> ) i am convinced that my projects may be sped up via use of data table.
>
> At least i hope.
>
> I have read the introduction, and it is not clear if it is possible to do
> time
> series sub-setting. I presently use XTS, and find that it is decently
> quick
> for my large data sets which contain observations each second. Most
> importantly XTS is robust.
>
> My XTS sets have date/time strings that are formatted
> “%Y-%m-%d %H:%M:%S”, and i wish to sub set across these:

XTS is very nice the way it does that. In data.table it's up to you which
datetime class you use. People generally use ?IDateTime or POSIXct. XTS
has some nice range syntax which data.table doesn't have. OTOH, data.table
typically has multiple columns in one key, generally the last one being
datetime. I don't know XTS very well but I believe each symbol has its own
time series matrix?  So it uses hashed environments for quick symbol
retrieval which is neat.  data.table tries to be more general (ordered
queries which aren't necessarily time series) and has a general form
analogous to SQL (see FAQ 2.16).

>
> A typical data frame might be formatted as follows:
>
>>Names(dfx) = c(’price’, ’volume’)
>
>>Head(dfx)
>
> 2012-06-01 09:07:34 97.55 60
> 2012-06-01 09:07:36 97.58 150
> 2012-06-01 09:07:36 97.56 10
> 2012-06-01 09:07:36 97.57 19
> 2012-06-01 09:07:39 97.58 100
> 2012-06-01 09:07:40 97.55 40
>
> I would like to perform the following sub setting operations: for example,
> the minute after 9:07am (on a single date

Historically I've often used raw integers i.e. YYYYMMDD and HHMMSS.  I
know it seems strange but there are three advantages: easier to type,
conversions such as by=YYYYMMDD%%100L is very quick to aggregate by month
for example, and, no possible epoch/origin confusion. Serious downside are
that invalid dates are possible (20120231L) and datetime ops needs
conversion to an epoch base first to work. However, often the advantages
outweigh the disadvantages.

So,

onedate = 20120301L

dfx[J(onedate,090800L),roll=TRUE] # prevailing row at 09:08:00 am

> or all dates);

alldates = c(20120301L,20120302L,20120303L)
dfx[J(alldates,090800L),roll=TRUE]

There the J() function recycles 090800L to the length of alldates.

You can use IDateTime or POSIXct instead, above.

A common use case is CJ(alldates,alltimes) which retrieves all the times
on all the dates (cartesian product). See ?CJ.

> the first ten
> hours on each monday; 12 to 16 on the first friday of each month; all days
> in June for years 2001 to 2011; all mondays in 2008.

All these, use your favourite datetime class to create the regular series
and then pass that in to i. Time ranges in data.table are not as nice as
xts, though.  One way is by=hour(datetime) and then join to that hour. 
Aggregate first then join to the result, is a strong rule of thumb in
data.table.

>
> Additionally, i would like to sub-set such that i obtain a table showing
> the volume of trades at each price on a given time interval.

Easier in XTS I suspect.

In data.table for one range it's manual :

from = dfx[J(start),roll=TRUE,which=TRUE]
to = dfx[J(end),roll=TRUE,which=TRUE]
dfx[from:to,sum(volume),by=price]

or if the time intervals can be regularized, it's easier :

dfx[,sum(volume),by=list(hour(datetime),price)]

>
> For example, the printed data above would present as:
>
> Price volume
> 97.55 100
> 97.56 10
> 97.57 19
> 97.58 250

Yes, that would do it.  And grouping is very efficient in data.table. But
XTS has better syntax for time range queries.

>
> Are such sub-setting operations possible in data.table?
>
> I anticipate that the time series sub-setting in data.table may be more
> restrictive, but as my data sets are very large and the reported speed up
> relative to most methods is huge in big-data cases, i am hopeful that a
> modest amount of trouble will bring large performance gains.
>
> Is this the case?

If you're already happy with performance of xts then I doubt data.table
will improve on that. They both use binary search. If you have an example
dataset, input and output with xts, then that would really help to
confirm. I don't know how fast XTS is at grouping.

One thing to consider is data.table's :=. Adding and removing columns in
data.table is extremely quick due to over-allocation of the vector of
column pointers.  See example(":=") for speed example of := and set().  
XTS has this too using regular <- in R, by using matrix as its storage
class since data.frame is slow. The only downside of matrix is that all
columns need to be the same type (double), iiuc, which may well be the
case anyway.  In data.table we're not only using columns of different
types (character, integer and double) but list columns too where each cell
can itself be a vector.  That can be useful for genome sequences to save
creating long comma separated character, for example.  Those are the sort
of use cases we're thinking about that aren't time series.

> Sorry for the long question - an answer in the affirmative and link to a
> manual that i have missed would suffice.

?IDateTime perhaps.

But feel free to send in some examples and we'll see what we can do.

The original idea was that zoo/xts could be used a column type of key.
Then you could get their nice time series/range syntax inside a data.table
join. Don't know if that works or not.

Btw, 'double' is only just supported in keys as from v1.8.1. Before 1.8.1
POSIXct would be coerced to integer and any fractional seconds lost.

You asked about stability. I suspect XTS is more stable than data.table. 
We had a lot of crash bugs at the start of the year when over-allocation
was brought in. We seem to have got over that now.  But 22 bug fixes in
v1.8.1 is rather a lot I have to admit. We are adding a lot of new
features though, such as assignment by reference (:=) by group. We have
very few regression bugs due to being ruthless about adding tests for
every bug fix.  There are now over 650 tests in test.data.table().

All the above is just my current understanding about xts.  Please do
correct where I'm wrong.

Clear as mud?

Matthew

>
> Thanks + best regards
>
> matt johnson
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help