[datatable-help] J() casts to int?

Johann Hibschman jhibschman+r at gmail.com
Wed Oct 5 17:44:52 CEST 2011


This technique is something I just came up with yesterday.  I grepped
through my code for uses of J(), and I use the implicit conversion to
int all over the place.  However, most of those follow a pattern like
this:

  tmp12 <- mkt.data[J(month + 12), list(value12=value)][, -1, with=FALSE]
  tmp24 <- mkt.data[J(month + 24), list(value24=value)][, -1, with=FALSE]
  mkt.data <- cbind(mkt.data, tmp12, tmp24)
  mkt.data$diff1y <- mkt.data[, value - value12]
  mkt.data$diff2y <- mkt.data[, value - value24]

I want to compute lagged columns and add them back into the table.  I've
never found a way to do this that seems easy and straightforward to me.

I'm inclined to leave J() as coercing-to-int, but make sure that
behavior is documented in its help text.  If I want to do a "fancy"
merge, I can always write out data.table.  That leaves J as a simple way
to specify join keys, with easily-explainable semantics.  If it gets too
clever, it becomes too hard to remember what exactly it does.

Similarly, I'd rather have the current behavior (join cols must be int),
rather than have a difference between what happens with a small amount
of data and with a large amount of data.  It's just simpler and keeps
with the mental model of "key columns must be integers".

Johann

Matthew Dowle <mdowle at mdowle.plus.com> writes:

> Thanks for illustrating so clearly. J() has always cast double columns
> to int (as far as I remember anyway) for convenience when looking up
> data from the prompt, say, to save having to remember L on typed in
> values inside J().  This case, where J() is deliberately created with
> more columns than x's key, using join inherited scope, I didn't
> anticipate.  Or rather, was planning to achieve that output via x. and
> i. prefixes in j (previous thread I think but it seems no FR number).
> The way you've done it is kind of a manual way of achieving 'i.', where
> 'i.' corresponds to your 'prev.'. I'm thinking automatic i. will still
> be nice for convenience, but I have to admit I thought it wasn't
> possible at all (at least, as elegantly in one query). Presumably this
> is most useful with roll=TRUE : age as well as delta.
>
> Back to J() ... inside J() it doesn't know it's being calling as the i
> argument, so it doesn't know the length of x's key. Otherwise, simple
> fix would be for J() to only coerce double columns involved in the join
> to x's key. Should be possible to use parent.frame() inside J() to work
> out where it's being called from and the length of x's key. 
>
> Or, perhaps all data.table joins should allow double columns in i to be
> joined to int, with inefficiency warning if say the number of rows in i
> is > 1000,  error/warning if fractional data is truncated, and silently
> otherwise.  Then the coercion to int in J() could be removed and it's
> more consistent.
>
> Thoughts anyone?
>
> In the meantime I can't think of any other way than using data.table()
> instead of J(), which looks to work and give the right result.
>
> Matthew
>
>
> On Tue, 2011-10-04 at 10:03 -0500, Johann Hibschman wrote:
>> I just noticed that J casts all its arguments to int.  Has this always
>> been the case?  I can't find it documented anywhere.
>> 
>> I came across this while trying to do a self join, like this:
>> 
>>   > tmp <- data.table(date=1:5, value=10*rnorm(5), key="date")
>>   > tmp
>>        date     value
>>   [1,]    1  3.710278
>>   [2,]    2  4.571288
>>   [3,]    3  2.009627
>>   [4,]    4  8.237882
>>   [5,]    5 -9.004814
>>   > with(tmp, J(date, value))
>>        date value
>>   [1,]    1     3
>>   [2,]    2     4
>>   [3,]    3     2
>>   [4,]    4     8
>>   [5,]    5    -9
>>   > tmp[J(date + 2, prev.date=date, prev.value=value),
>>         list(prev.date, value, prev.value, delta=value-prev.value)]
>>        date prev.date     value prev.value       delta
>>   [1,]    3         1  2.009627          3  -0.9903734
>>   [2,]    4         2  8.237882          4   4.2378817
>>   [3,]    5         3 -9.004814          2 -11.0048141
>>   [4,]    6         4        NA          8          NA
>>   [5,]    7         5        NA         -9          NA
>>   > tmp[data.table(date + 2L, prev.date=date, prev.value=value),
>>         list(prev.date, value, prev.value, delta=value-prev.value)]
>>        date prev.date     value prev.value      delta
>>   [1,]    3         1  2.009627   3.710278  -1.700652
>>   [2,]    4         2  8.237882   4.571288   3.666594
>>   [3,]    5         3 -9.004814   2.009627 -11.014441
>>   [4,]    6         4        NA   8.237882         NA
>>   [5,]    7         5        NA  -9.004814         NA
>> 
>> Is this intended?  Using J let me be sloppy and do "+2" while data.table
>> made me use "+2L", but then it clobbered the non-int values.
>> 
>> Is there a better way?
>> 
>> Thanks,
>> Johann
>> 
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



More information about the datatable-help mailing list