[datatable-help] join results aren't always sorted?

Matthew Dowle mdowle at mdowle.plus.com
Fri May 31 10:59:38 CEST 2013


Hi,

> why is clicks but not shares sorted by time?

the groups (each unique time in this case) are returned in the order of 
first appearance, when you use 'by'. This is important and relied upon. 
When this result is known, a quick check is made to see if this is 
ordered (using the very fast is.unsorted()) and if so the result is then 
marked as keyed (which is the "sorted" attribute seen in the str() 
output).

> What I actually want is a single data table keyed by time with ...

How about 'keyby' rather than 'by' :

     dates.dt[s.c=="SHARE", list(sum(count)), keyby="time"]

Even if I know the data is already sorted in group order, I often use 
'keyby' anyway for robustness.

Matthew


On 30.05.2013 20:48, Sam Steingold wrote:
> Hi,
> I have a table:
> --8<---------------cut here---------------start------------->8---
>> str(dates.dt)
> Classes ‘data.table’ and 'data.frame':	1343 obs. of  4 variables:
>  $ sid  : chr  "missing" "missing" "missing" "missing" ...
>  $ s.c  : chr  "CLICK" "CLICK" "CLICK" "CLICK" ...
>  $ count: int  70559 71555 79985 84385 88147 94130 100195 109031
> 116890 129726 ...
>  $ time : POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 
> 01:00:00" ...
>  - attr(*, ".internal.selfref")=<externalptr>
>  - attr(*, "sorted")= chr  "sid" "s.c" "time"
>> dates.dt
>           sid   s.c count                time
>    1: missing CLICK 70559 2013-05-15 00:00:00
>    2: missing CLICK 71555 2013-05-15 01:00:00
>    3: missing CLICK 79985 2013-05-15 02:00:00
>    4: missing CLICK 84385 2013-05-15 03:00:00
>    5: missing CLICK 88147 2013-05-15 04:00:00
>   ---
> 1339: present SHARE 35295 2013-05-28 19:00:00
> 1340: present SHARE 36284 2013-05-28 20:00:00
> 1341: present SHARE 69504 2013-05-28 21:00:00
> 1342: present SHARE 67037 2013-05-28 22:00:00
> 1343: present SHARE 61014 2013-05-28 23:00:00
> --8<---------------cut here---------------end--------------->8---
> I summarise them by various fields:
> --8<---------------cut here---------------start------------->8---
>> shares <- dates.dt[s.c=="SHARE", list(sum(count)) , by="time"]
>> clicks <- dates.dt[s.c=="CLICK", list(sum(count)) , by="time"]
>> str(shares)
> Classes ‘data.table’ and 'data.frame':	336 obs. of  2 variables:
>  $ time: POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" 
> ...
>  $ V1  : int  60531 57837 67495 76716 83465 86822 91318 100520 112352
> 124784 ...
>  - attr(*, ".internal.selfref")=<externalptr>
>> str(clicks)
> Classes ‘data.table’ and 'data.frame':	336 obs. of  2 variables:
>  $ time: POSIXct, format: "2013-05-15 00:00:00" "2013-05-15 01:00:00" 
> ...
>  $ V1  : int  129450 137222 157721 171319 183720 195652 216003 238295
> 260715 279235 ...
>  - attr(*, "sorted")= chr "time"
>  - attr(*, ".internal.selfref")=<externalptr>
> --8<---------------cut here---------------end--------------->8---
> why is clicks but not shares sorted by time?
> (if I make "time" the first key in dates.dt, the problem goes away, 
> so,
> I guess, this is expected).
>
> What I actually want is a single data table keyed by time with 
> columns
> shares,clicks,missing,present,missing/clicks &c
> I can, obviously, construct it by hand:
> --8<---------------cut here---------------start------------->8---
> setkeyv(shares,"time")
> stopifnot(identical(shares$time,clicks$time))
> dt <- data.table(time=shares$time, clicks=clicks$V1, 
> shares=shares$V1)
> --8<---------------cut here---------------end--------------->8---
> but I was wondering if there is a better way.
> Thanks.


More information about the datatable-help mailing list