[datatable-help] calculating lagged values of panel data using data.table
Jelmer Ypma
jelmerypma at gmail.com
Fri Jul 15 22:09:35 CEST 2011
Hi,
many thanks for your reply, the Cross Join bit is very useful to know!
Thanks,
Jelmer
On Fri, Jul 15, 2011 at 20:36, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> Hi,
>
> Welcome!
>
> Yes, that's safe to assume. Others have asked that in the past, so I've
> now added this text to ?data.table (under 'by'): "The order of the rows
> within each group is preserved."
>
> The tmp.keys operation is a common one, so there's a short cut for it.
> We call it a Cross Join. If you run the CJ() bit on it's own you'll find
> it's the same as tmp.keys :
>
>> DT[CJ(id.vec,time.vec)]
> id time value
> [1,] 1 1 7
> [2,] 1 2 6
> [3,] 1 3 NA
> [4,] 1 4 5
> [5,] 1 5 4
> [6,] 2 1 NA
> [7,] 2 2 3
> [8,] 2 3 2
> [9,] 2 4 1
> [10,] 2 5 NA
>>
>
> Rather than filling the gaps with NA, you can roll the prevailing
> observation forward :
>
>> DT[CJ(id.vec,time.vec),roll=TRUE]
> id time value
> [1,] 1 1 7
> [2,] 1 2 6
> [3,] 1 3 6
> [4,] 1 4 5
> [5,] 1 5 4
> [6,] 2 1 NA # correct not to be 4 (new group)
> [7,] 2 2 3
> [8,] 2 3 2
> [9,] 2 4 1
> [10,] 2 5 1
>>
>
> and that's useful because, now, instead of lagging in the way you did
> it, you can just subtract one day from your dates or times, and find the
> prevailing record then. It is subtly different of course, it depends on
> what you want and what assumptions you make about the data.
>
>> DT[CJ(id.vec,time.vec-1),roll=TRUE]
> id time value
> [1,] 1 0 NA
> [2,] 1 1 7
> [3,] 1 2 6
> [4,] 1 3 6
> [5,] 1 4 5
> [6,] 2 0 NA
> [7,] 2 1 NA
> [8,] 2 2 3
> [9,] 2 3 2
> [10,] 2 4 1
>>
>
> On your final line I didn't see why you chained it into two queries, but
> fortunately I tested before posting and it seems there is a bug :
>
>> DT[ tmp.keys, # removed your "][," here
> + list(time,value,'value.lag'=c(NA,value[-length(value)])), by=id ]
> Error in abs(j) : Non-numeric argument to mathematical function
> In addition: Warning message:
> In is.na(j) : is.na() applied to non-(list or vector) of type 'NULL'
>>
>
> That should work fine, but doesn't. I'll raise it as a bug ...
>
> So, not bad going for a first post!
>
> Thanks,
> Matthew
>
>
> On Fri, 2011-07-15 at 11:33 +0200, Jelmer Ypma wrote:
>> Dear all,
>>
>> I found out about the data.table package yesterday, and I've been very
>> impressed with the speed so far.
>>
>> I'm currently trying to calculate lagged values of a variable for some
>> panel data set. The approach I take is the one below, but I'm not sure
>> whether that is correct, and whether it is safe to do in this way
>> (e.g. if data is re-ordered).
>>
>> Hopefully, one of you will be able to guide me in the right direction.
>>
>> Many thanks in advance,
>> Jelmer
>>
>> # create some data, with missing periods
>> DT <- data.table( 'id'=c(1,1,1,1,2,2,2), 'time'=c(1,2,4,5,2,3,4),
>> 'value'=c(7,6,5,4,3,2,1) )
>> setkey( DT, id, time )
>>
>> # extract id and time values to construct data.table with keys,
>> including missing periods
>> id.vec <- unique( DT$id )
>> time.vec <- unique( DT$time )
>> num.periods <- max( time.vec ) - min( time.vec ) + 1
>> tmp.keys <- data.table( 'id'=rep( id.vec, each=num.periods ),
>> 'time'=rep( time.vec, length(id.vec) ) )
>> setkey( tmp.keys, id, time )
>>
>> # show data
>> DT # original data
>> DT[ tmp.keys ] # fill in missing
>> periods with NA
>>
>> # shift 'value' column by id to get lagged 'value'
>> DT[ tmp.keys ][,
>> list(time,value,'value.lag'=c(NA,value[-length(value)])), by=id ]
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
More information about the datatable-help
mailing list