[datatable-help] calculating lagged values of panel data using data.table

Matthew Dowle mdowle at mdowle.plus.com
Fri Jul 15 20:36:03 CEST 2011


Hi,

Welcome!

Yes, that's safe to assume. Others have asked that in the past, so I've
now added this text to ?data.table (under 'by'):  "The order of the rows
within each group is preserved."

The tmp.keys operation is a common one, so there's a short cut for it.
We call it a Cross Join. If you run the CJ() bit on it's own you'll find
it's the same as tmp.keys :

> DT[CJ(id.vec,time.vec)]
      id time value
 [1,]  1    1     7
 [2,]  1    2     6
 [3,]  1    3    NA
 [4,]  1    4     5
 [5,]  1    5     4
 [6,]  2    1    NA
 [7,]  2    2     3
 [8,]  2    3     2
 [9,]  2    4     1
[10,]  2    5    NA
> 

Rather than filling the gaps with NA, you can roll the prevailing
observation forward :

> DT[CJ(id.vec,time.vec),roll=TRUE]
      id time value
 [1,]  1    1     7
 [2,]  1    2     6
 [3,]  1    3     6
 [4,]  1    4     5
 [5,]  1    5     4
 [6,]  2    1    NA     # correct not to be 4 (new group)
 [7,]  2    2     3
 [8,]  2    3     2
 [9,]  2    4     1
[10,]  2    5     1
> 

and that's useful because, now, instead of lagging in the way you did
it, you can just subtract one day from your dates or times, and find the
prevailing record then. It is subtly different of course, it depends on
what you want and what assumptions you make about the data.

> DT[CJ(id.vec,time.vec-1),roll=TRUE]
      id time value
 [1,]  1    0    NA
 [2,]  1    1     7
 [3,]  1    2     6
 [4,]  1    3     6
 [5,]  1    4     5
 [6,]  2    0    NA
 [7,]  2    1    NA
 [8,]  2    2     3
 [9,]  2    3     2
[10,]  2    4     1
> 

On your final line I didn't see why you chained it into two queries, but
fortunately I tested before posting and it seems there is a bug :

> DT[ tmp.keys,     # removed your "][," here
+ list(time,value,'value.lag'=c(NA,value[-length(value)])), by=id ]
Error in abs(j) : Non-numeric argument to mathematical function
In addition: Warning message:
In is.na(j) : is.na() applied to non-(list or vector) of type 'NULL'
> 

That should work fine, but doesn't. I'll raise it as a bug ...

So, not bad going for a first post!

Thanks,
Matthew


On Fri, 2011-07-15 at 11:33 +0200, Jelmer Ypma wrote:
> Dear all,
> 
> I found out about the data.table package yesterday, and I've been very
> impressed with the speed so far.
> 
> I'm currently trying to calculate lagged values of a variable for some
> panel data set. The approach I take is the one below, but I'm not sure
> whether that is correct, and whether it is safe to do in this way
> (e.g. if data is re-ordered).
> 
> Hopefully, one of you will be able to guide me in the right direction.
> 
> Many thanks in advance,
> Jelmer
> 
> # create some data, with missing periods
> DT <- data.table( 'id'=c(1,1,1,1,2,2,2), 'time'=c(1,2,4,5,2,3,4),
> 'value'=c(7,6,5,4,3,2,1) )
> setkey( DT, id, time )
> 
> # extract id and time values to construct data.table with keys,
> including missing periods
> id.vec <- unique( DT$id )
> time.vec <- unique( DT$time )
> num.periods <- max( time.vec ) - min( time.vec ) + 1
> tmp.keys <- data.table( 'id'=rep( id.vec, each=num.periods ),
> 'time'=rep( time.vec, length(id.vec) ) )
> setkey( tmp.keys, id, time )
> 
> # show data
> DT                                                  # original data
> DT[ tmp.keys ]                                      # fill in missing
> periods with NA
> 
> # shift 'value' column by id to get lagged 'value'
> DT[ tmp.keys ][,
> list(time,value,'value.lag'=c(NA,value[-length(value)])), by=id ]
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list