[datatable-help] Basic join question

Yang Zhang yanghatespam at gmail.com
Thu Aug 18 03:26:17 CEST 2011


I'm going to continue here since the question is a bit more
complicated and SO isn't the best forum for back-and-forth.

If I'm trying to do a join where I'm trying to aggregate counts
(including 0s for nomatches), is there something more concise than the
following, which is what I'm currently using since it works?

# assume dt is a data.frame(user_id=..., age=...)
y = dt[, list(count=length(age)), by=user_id]
key(y) = 'user_id'
y = y[J(unique(x$user_id))]
y$count[is.na(y$count)] = 0

I tried:

> key(y) = 'user_id'
> y = y[J(unique(x$user_id)), list(count=length(age))]
> summary(y$count)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
     1.00      1.00      1.00     75.55      5.00 127200.00
> dim(y)
[1] 7655    2

which gives me the right number of output rows but none of the lengths
are 0, presumably because length(NA) == 1. (There are definitely users
in x that are not in y.)

But then when I tried (and there are no NAs in y$age):

> count = function(x) if (any(is.na(x))) integer(0) else length(x)
> key(y) = 'user_id'
> y = y[J(unique(x$user_id)), list(count=count(age))]
> summary(y$count)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
     1.0      2.0      6.0    160.4     21.0 127200.0
> dim(y)
[1] 3581    2

Rows seem to be disappearing, and still the min is 1.

At this point I'm pretty disoriented. Any explanation? Thanks in advance.


On Wed, Aug 17, 2011 at 12:34 PM, Yang Zhang <yanghatespam at gmail.com> wrote:
> Thanks, edited the question.
>
> On Wed, Aug 17, 2011 at 3:53 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>> Yang,
>> Since you also asked on SO, suggest we answer there (after your edit please)
>> :
>> http://stackoverflow.com/questions/7090621/how-to-do-a-basic-left-outer-join-with-data-table-in-r
>> Matthew
>>
>>
>> "Yang Zhang" <yanghatespam at gmail.com> wrote in message
>> news:CAKxBDU_o3i_+xujsCa0CmukDizctx6fnzbidOYZW9Co9w9iTvw at mail.gmail.com...
>>> How do I do the equivalent to the following?
>>>
>>> with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0
>>> union select 2, 1 union select 3, 1 union select 3, 1),
>>>  above as (select a, b from dt where b > .5),
>>>  below as (select a, b from dt where b < .5)
>>> select above.a, count(below.a) from above left outer join below on
>>> (above.a = below.a) group by above.a;
>>> a | count
>>> ---+-------
>>> 3 |     0
>>> 2 |     1
>>> (2 rows)
>>>
>>> How do I accomplish the same thing with data.tables?  This is what I
>>> have so far:
>>>
>>> DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
>>> above = DT[DT$b > .5]
>>> below = DT[DT$b < .5, list(a=a)]
>>> key(below) = 'a'
>>> below[above, list(count=length(a)), by=a]
>>>
>>> but this gives me:
>>>
>>>      a count
>>> [1,]  2 1
>>> [2,] NA 1
>>>
>>> Thanks in advance for any tips.
>>>
>>> --
>>> Yang Zhang
>>> http://yz.mit.edu/
>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
>
> --
> Yang Zhang
> http://yz.mit.edu/
>



-- 
Yang Zhang
http://yz.mit.edu/


More information about the datatable-help mailing list