[datatable-help] Dealing with dateTime

Matthew Dowle mdowle at mdowle.plus.com
Thu Jan 3 19:38:24 CET 2013


 

This starts to get interesting, FYI see :


http://codercorner.com/RadixSortRevisited.htm [9] 

Btw, note the note
in ?setkey that although base::sort.list calls it a "radix" sort, it's
actually a counting sort (and an order not a sort, too). We call that,
and data.table has implemented counting sort for character (deviously).
But double and bit64::integer64 (which is stored in the double type)
needs some work for speed. Best do some benchmarking. I haven't done too
much benchmarking of sorting the various types and I haven't seen anyone
else post them as far as I recall. Would be great to see, and certainly
a nudge if there are gremlins there. 

Yes IMonth would be nice. What I
sometimes do for dates in YYYYMMDD integer format is by=date%/%100L.
That's easy and fast. An advantage of non-epoch based dates. 

For
POSIXct there is as.integer(format(date, "%Y%m")), but a faster version
of that would be nice (that doesn't go via character). There's been a
bit of benchmarking of converting strings to POSIXct on S.O. which
basically showed it was awful, but I don't recall benchmarking of
as.integer(format(date,"%Y%m")). 

IMonth.IDate method would be good too
to cover all options. Have you checked zoo and xts to see if they have
this? They can be used with data.table, I think, although there were
some S.O. question(s) about that. I'm not sure about the name "IMonth"
though, needs to convey that it includes the year doesn't it: IYM, YM,
YYYYMM, YYMM, YQ, Q, etc ? 

We've also discussed xts's nice ISO date
syntax in the past on this list. 

To make progress it needs someone to
provide benchmarks (perhaps showing how bad or not sorting double is),
and reproducible large data examples of grouping and indexes that can be
worked on, and to decide what we need. I'm aware that as the number of
columns in a key grows, setkey's performance degrades and could be
improved. So that's another dimension: it's not just the column types,
but how many there are in the key. FR#2419 is to improve that. 

On
03.01.2013 16:58, colin umansky wrote: 

> Ok, but sorting on
POSIXct(double) should be less efficient than on int64 isn't it (via a
radix sort)? 
> 
> Additionally, I don't know what you think of adding
IMonth (looking like "2011-02"), when grouping, at present we can use
month but it does not dissociate the year, it could be quick and useful
for stats computed by group.
> 
> Regards 
> 
> 2013/1/3 Matthew Dowle
<mdowle at mdowle.plus.com [8]>
> 
>> Hi, 
>> 
>> One reason 'double' type
was added to setkey was to allow POSIXct in keys. That was as recently
as v1.8.2 : 
>> 
>> o Numeric columns (type 'double') are now allowed in
keys and ad hoc
>> by. J() and SJ() no longer coerce 'double' to
'integer'. i join columns
>> which mismatch on numeric type are coerced
silently to match
>> the type of x's join column. Two floating point
values
>> are considered equal (by grouping and binary search joins) if
their
>> difference is within sqrt(.Machine$double.eps), by default. See
example
>> in ?unique.data.table. Completes FRs #951, #1609 and #1075.
This paves the
>> way for other atomic types which use 'double' (such as
POSIXct and bit64).
>> Thanks to Chris Neff for beta testing and finding
problems with keys
>> of two numeric columns (bug #2004), fixed and
tests added.
>> 
>> So, POSIXct, or using integer64 to store
YYYYMMDDHHMMSSmmm is another possibility (no epoch has some pros as well
as cons), or date and time held in separate columns. 
>> 
>> The
thinking is, rightly or wrongly, that R already supports milliseconds in
various ways. data.table doesn't aim to prescribe which datetime class
you place in the data.table; it's up to you what you use. It only has
IDate because Date in R is (oddly) stored as numeric rather than integer
which (I at least) have never really understood. For a long time
data.table only supported integer columns in keys and joins (including
factors which are integers/enumerations). But now double (and character)
are fine in keys too. 
>> 
>> So to answer your question as asked:
as.POSIXct("2010-01-03 09:34:54.342697") already works. But note : 
>>

>>
http://stackoverflow.com/questions/10931972/r-issue-with-rounding-milliseconds
[3] 
>> 
>>
http://stackoverflow.com/questions/11136340/zoo-xts-microsecond-read-issue
[4] 
>> 
>>
http://stackoverflow.com/questions/8889554/milliseconds-puzzle-when-calling-strptime-in-r
[5] 
>> 
>>
http://stackoverflow.com/questions/2150138/how-to-parse-milliseconds-in-r
[6] 
>> 
>> HTH, also : 
>> 
>>
http://stackoverflow.com/a/14063077/403310 [7] 
>> 
>> But yes I'm sure
we can do better, just not quite sure precisely how. 
>> 
>> Matthew 
>>

>> On 03.01.2013 11:17, colin umansky wrote: 
>> 
>>> Hello, 
>>> I
have been thinking about how data.table deals with dateTime and would
like to share my questions/opinions. 
>>> Where I think data.table is
(likely to be wrong :)) 
>>> At the moment data.table deals
independently with IDate and ITime (%H:%M:%S) that are simple (Matthew
Doyle words) derived class. As I understand it they are stored as
integers to enable fast radix sorting etc... 
>>> There is no
milli/micro/nano which is a problem as far as financial time series are
concerned. 
>>> Suggestions: 
>>> Would that be possible to store a
IDateTime as the number of micro since epoch-time ? 
>>> an IDateTime
object would be represented like a=as.IDateTime("2010-01-03
09:34:54.342697"), then 
>>> year: asIYear(a); #would display "2010"

>>> month: as.IMonth(a); #would display "2010-01" 
>>> date:
as.IDate(a); #would display "2010-01-03" 
>>> etc... 
>>> Having all
those built-in types would probably be useful to efficient grouping.

>>> PS: 
>>> The best soft I have experienced, to deal with timeseries,
data is kdb (http://kx.com/ [1]) 
>>> I particularly like the way
datetimes are handled (http://code.kx.com/wiki/JB:QforMortals/atoms#time
[2]), it may be a source of inspiration...

 

Links:
------
[1]
http://kx.com/
[2] http://code.kx.com/wiki/JB:QforMortals/atoms#time
[3]
http://stackoverflow.com/questions/10931972/r-issue-with-rounding-milliseconds
[4]
http://stackoverflow.com/questions/11136340/zoo-xts-microsecond-read-issue
[5]
http://stackoverflow.com/questions/8889554/milliseconds-puzzle-when-calling-strptime-in-r
[6]
http://stackoverflow.com/questions/2150138/how-to-parse-milliseconds-in-r
[7]
http://stackoverflow.com/a/14063077/403310
[8]
mailto:mdowle at mdowle.plus.com
[9]
http://codercorner.com/RadixSortRevisited.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130103/e014e337/attachment-0001.html>


More information about the datatable-help mailing list