[datatable-help] J() casts to int?

Wed Oct 5 21:38:20 CEST 2011

On Wed, 2011-10-05 at 10:44 -0500, Johann Hibschman wrote:
> This technique is something I just came up with yesterday.  I grepped
> through my code for uses of J(), and I use the implicit conversion to
> int all over the place.  
Then that sounds inefficient. Convenient, but inefficient. Every
conversion is a memory allocation and copy. If that happens a lot,
especially in self joins (e.g. month+12 rather than month+12L) where the
conversion is happening on a vector as long as the (large) table then it
might speed up a lot changing everywhere to 12L. The month+12 happens
first to create the long double vector, then that long double vector is
coerced to integer (a copy). Even one number (a vector length 1)
converted from double to int inside a loop a lot, say, will create many
copies of small objects and that can be bad too.

> However, most of those follow a pattern like
> this:
> 
>   tmp12 <- mkt.data[J(month + 12), list(value12=value)][, -1, with=FALSE]
>   tmp24 <- mkt.data[J(month + 24), list(value24=value)][, -1, with=FALSE]
>   mkt.data <- cbind(mkt.data, tmp12, tmp24)
>   mkt.data$diff1y <- mkt.data[, value - value12]
>   mkt.data$diff2y <- mkt.data[, value - value24]
> 
> I want to compute lagged columns and add them back into the table.  I've
> never found a way to do this that seems easy and straightforward to me.

This should be faster and shorter (easier to read and slightly less
error prone) :

mkt.data[,value12:=mkt.data[J(month + 12L), value][[2]]]
mkt.data[,value24:=mkt.data[J(month + 24L), value][[2]]]
mkt.data[,diff1y:=value-value12]
mkt.data[,diff2y:=value-value24]

and when FR#1492 (multiple := in j) is implemented it might be :

mkt.data[,{ value12:=mkt.data[J(month + 12L), value][[2]]
            value24:=mkt.data[J(month + 24L), value][[2]]
            diff1y:=value-value12
            diff2y:=value-value24 }]

> 
> I'm inclined to leave J() as coercing-to-int, but make sure that
> behavior is documented in its help text.  If I want to do a "fancy"
> merge, I can always write out data.table.  That leaves J as a simple way
> to specify join keys, with easily-explainable semantics.  If it gets too
> clever, it becomes too hard to remember what exactly it does.
Agreed. Something needs improving somewhere though I'm thinking, to
encourage (at least inform you) not to rely on J()s auto conversion, but
to use 12L.

> 
> Similarly, I'd rather have the current behaviour (join cols must be int),
> rather than have a difference between what happens with a small amount
> of data and with a large amount of data.
Good point, ok. A difference for small amount of data is out then.

> It's just simpler and keeps
> with the mental model of "key columns must be integers".
> 
> Johann
> 
> Matthew Dowle <mdowle at mdowle.plus.com> writes:
> 
> > Thanks for illustrating so clearly. J() has always cast double columns
> > to int (as far as I remember anyway) for convenience when looking up
> > data from the prompt, say, to save having to remember L on typed in
> > values inside J().  This case, where J() is deliberately created with
> > more columns than x's key, using join inherited scope, I didn't
> > anticipate.  Or rather, was planning to achieve that output via x. and
> > i. prefixes in j (previous thread I think but it seems no FR number).
> > The way you've done it is kind of a manual way of achieving 'i.', where
> > 'i.' corresponds to your 'prev.'. I'm thinking automatic i. will still
> > be nice for convenience, but I have to admit I thought it wasn't
> > possible at all (at least, as elegantly in one query). Presumably this
> > is most useful with roll=TRUE : age as well as delta.
> >
> > Back to J() ... inside J() it doesn't know it's being calling as the i
> > argument, so it doesn't know the length of x's key. Otherwise, simple
> > fix would be for J() to only coerce double columns involved in the join
> > to x's key. Should be possible to use parent.frame() inside J() to work
> > out where it's being called from and the length of x's key. 
> >
> > Or, perhaps all data.table joins should allow double columns in i to be
> > joined to int, with inefficiency warning if say the number of rows in i
> > is > 1000,  error/warning if fractional data is truncated, and silently
> > otherwise.  Then the coercion to int in J() could be removed and it's
> > more consistent.
> >
> > Thoughts anyone?
> >
> > In the meantime I can't think of any other way than using data.table()
> > instead of J(), which looks to work and give the right result.
> >
> > Matthew
> >
> >
> > On Tue, 2011-10-04 at 10:03 -0500, Johann Hibschman wrote:
> >> I just noticed that J casts all its arguments to int.  Has this always
> >> been the case?  I can't find it documented anywhere.
> >> 
> >> I came across this while trying to do a self join, like this:
> >> 
> >>   > tmp <- data.table(date=1:5, value=10*rnorm(5), key="date")
> >>   > tmp
> >>        date     value
> >>   [1,]    1  3.710278
> >>   [2,]    2  4.571288
> >>   [3,]    3  2.009627
> >>   [4,]    4  8.237882
> >>   [5,]    5 -9.004814
> >>   > with(tmp, J(date, value))
> >>        date value
> >>   [1,]    1     3
> >>   [2,]    2     4
> >>   [3,]    3     2
> >>   [4,]    4     8
> >>   [5,]    5    -9
> >>   > tmp[J(date + 2, prev.date=date, prev.value=value),
> >>         list(prev.date, value, prev.value, delta=value-prev.value)]
> >>        date prev.date     value prev.value       delta
> >>   [1,]    3         1  2.009627          3  -0.9903734
> >>   [2,]    4         2  8.237882          4   4.2378817
> >>   [3,]    5         3 -9.004814          2 -11.0048141
> >>   [4,]    6         4        NA          8          NA
> >>   [5,]    7         5        NA         -9          NA
> >>   > tmp[data.table(date + 2L, prev.date=date, prev.value=value),
> >>         list(prev.date, value, prev.value, delta=value-prev.value)]
> >>        date prev.date     value prev.value      delta
> >>   [1,]    3         1  2.009627   3.710278  -1.700652
> >>   [2,]    4         2  8.237882   4.571288   3.666594
> >>   [3,]    5         3 -9.004814   2.009627 -11.014441
> >>   [4,]    6         4        NA   8.237882         NA
> >>   [5,]    7         5        NA  -9.004814         NA
> >> 
> >> Is this intended?  Using J let me be sloppy and do "+2" while data.table
> >> made me use "+2L", but then it clobbered the non-int values.
> >> 
> >> Is there a better way?
> >> 
> >> Thanks,
> >> Johann
> >> 
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help