<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN">
<html><body>
<p> </p>
<p>This starts to get interesting, FYI see :</p>
<p><a href="http://codercorner.com/RadixSortRevisited.htm">http://codercorner.com/RadixSortRevisited.htm</a></p>
<p> </p>
<p>Btw, note the note in ?setkey that although base::sort.list calls it a "radix" sort, it's actually a counting sort (and an order not a sort, too). We call that, and data.table has implemented counting sort for character (deviously). But double and bit64::integer64 (which is stored in the double type) needs some work for speed. Best do some benchmarking. I haven't done too much benchmarking of sorting the various types and I haven't seen anyone else post them as far as I recall. Would be great to see, and certainly a nudge if there are gremlins there.</p>
<p>Yes IMonth would be nice. What I sometimes do for dates in YYYYMMDD integer format is by=date%/%100L. That's easy and fast. An advantage of non-epoch based dates.</p>
<p>For POSIXct there is as.integer(format(date, "%Y%m")), but a faster version of that would be nice (that doesn't go via character). There's been a bit of benchmarking of converting strings to POSIXct on S.O. which basically showed it was awful, but I don't recall benchmarking of as.integer(format(date,"%Y%m")).</p>
<p>IMonth.IDate method would be good too to cover all options. Have you checked zoo and xts to see if they have this? They can be used with data.table, I think, although there were some S.O. question(s) about that. I'm not sure about the name "IMonth" though, needs to convey that it includes the year doesn't it: IYM, YM, YYYYMM, YYMM, YQ, Q, etc ?</p>
<p>We've also discussed xts's nice ISO date syntax in the past on this list.</p>
<p>To make progress it needs someone to provide benchmarks (perhaps showing how bad or not sorting double is), and reproducible large data examples of grouping and indexes that can be worked on, and to decide what we need. I'm aware that as the number of columns in a key grows, setkey's performance degrades and could be improved. So that's another dimension: it's not just the column types, but how many there are in the key. FR#2419 is to improve that.</p>
<p> </p>
<p>On 03.01.2013 16:58, colin umansky wrote:</p>
<blockquote type="cite" style="padding-left:5px; border-left:#1010ff 2px solid; margin-left:5px; width:100%"><!-- html ignored --><!-- head ignored --><!-- meta ignored -->
<p>Ok, but sorting on POSIXct(double) should be less efficient than on int64 isn't it (via a radix sort)?</p>
<div><br />
<div>Additionally, I don't know what you think of adding IMonth (looking like "2011-02"), when grouping, at present we can use month but it does not dissociate the year, it could be quick and useful for stats computed by group.<br /><br />Regards</div>
<div><br />
<div class="gmail_quote">2013/1/3 Matthew Dowle <span><<a href="mailto:mdowle@mdowle.plus.com">mdowle@mdowle.plus.com</a>></span><br />
<blockquote class="gmail_quote" style="margin: 0 0 0 .8ex; border-left: 1px #ccc solid; padding-left: 1ex;"><span style="text-decoration: underline;"></span>
<div>
<p> </p>
<p>Hi,</p>
<p>One reason 'double' type was added to setkey was to allow POSIXct in keys. That was as recently as v1.8.2 :</p>
<pre>o Numeric columns (type 'double') are now allowed in keys and ad hoc
by. J() and SJ() no longer coerce 'double' to 'integer'. i join columns
which mismatch on numeric type are coerced silently to match
the type of x's join column. Two floating point values
are considered equal (by grouping and binary search joins) if their
difference is within sqrt(.Machine$double.eps), by default. See example
in ?unique.data.table. Completes FRs #951, #1609 and #1075. This paves the
way for other atomic types which use 'double' (such as POSIXct and bit64).
Thanks to Chris Neff for beta testing and finding problems with keys
of two numeric columns (bug #2004), fixed and tests added.</pre>
<p>So, POSIXct, or using integer64 to store YYYYMMDDHHMMSSmmm is another possibility (no epoch has some pros as well as cons), or date and time held in separate columns.</p>
<p>The thinking is, rightly or wrongly, that R already supports milliseconds in various ways. data.table doesn't aim to prescribe which datetime class you place in the data.table; it's up to you what you use. It only has IDate because Date in R is (oddly) stored as numeric rather than integer which (I at least) have never really understood. For a long time data.table only supported integer columns in keys and joins (including factors which are integers/enumerations). But now double (and character) are fine in keys too.</p>
<p>So to answer your question as asked: as.POSIXct("2010-01-03 09:34:54.342697") already works. But note :</p>
<p><a href="http://stackoverflow.com/questions/10931972/r-issue-with-rounding-milliseconds">http://stackoverflow.com/questions/10931972/r-issue-with-rounding-milliseconds</a></p>
<p><a href="http://stackoverflow.com/questions/11136340/zoo-xts-microsecond-read-issue">http://stackoverflow.com/questions/11136340/zoo-xts-microsecond-read-issue</a></p>
<p><a href="http://stackoverflow.com/questions/8889554/milliseconds-puzzle-when-calling-strptime-in-r">http://stackoverflow.com/questions/8889554/milliseconds-puzzle-when-calling-strptime-in-r</a></p>
<p><a href="http://stackoverflow.com/questions/2150138/how-to-parse-milliseconds-in-r">http://stackoverflow.com/questions/2150138/how-to-parse-milliseconds-in-r</a></p>
<p>HTH, also :</p>
<p><a href="http://stackoverflow.com/a/14063077/403310">http://stackoverflow.com/a/14063077/403310</a></p>
<p>But yes I'm sure we can do better, just not quite sure precisely how.</p>
<p>Matthew</p>
<div>
<div class="h5">
<p> </p>
<p>On 03.01.2013 11:17, colin umansky wrote:</p>
<blockquote style="padding-left: 5px; border-left: #1010ff 2px solid; margin-left: 5px; width: 100%;">
<p>Hello,</p>
<div>I have been thinking about how data.table deals with dateTime and would like to share my questions/opinions.</div>
<div>Where I think data.table is (likely to be wrong :))</div>
<div>At the moment data.table deals independently with IDate and ITime (%H:%M:%S) that are simple (Matthew Doyle words) derived class. As I understand it they are stored as integers to enable fast radix sorting etc...</div>
<div>There is no milli/micro/nano which is a problem as far as financial time series are concerned.</div>
<div>Suggestions:</div>
<div>Would that be possible to store a IDateTime as the number of micro since epoch-time ?</div>
<div>an IDateTime object would be represented like a=as.IDateTime("2010-01-03 09:34:54.342697"), then </div>
<div>year: asIYear(a); #would display "2010"</div>
<div>month: as.IMonth(a); #would display "2010-01"</div>
<div>date: as.IDate(a); #would display "2010-01-03"</div>
<div>etc...</div>
<div>Having all those built-in types would probably be useful to efficient grouping.</div>
<div>PS:</div>
<div>The best soft I have experienced, to deal with timeseries, data is kdb (<a href="http://kx.com/">http://kx.com/</a>)</div>
<div>I particularly like the way datetimes are handled (<a href="http://code.kx.com/wiki/JB:QforMortals/atoms#time">http://code.kx.com/wiki/JB:QforMortals/atoms#time</a>), it may be a source of inspiration...</div>
</blockquote>
<p> </p>
<div> </div>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
<p> </p>
<div> </div>
</body></html>