[datatable-help] J() casts to int?

Matthew Dowle mdowle at mdowle.plus.com
Sat Apr 28 03:43:03 CEST 2012


This one is now cleared up in 1.8.1 hopefully. J() was casting the
non-join i columns to int, as well as the i join columns. No casting is
now done, even for join columns, now that 'double' is allowed in keys.

o Numeric columns (type 'double') are now allowed in keys and ad hoc
  by. J() and SJ() no longer coerce 'double' to 'integer'. i join 
  columns which mismatch on numeric type are coerced silently to match
  the type of x's join column. Other types which use 'double' (such as
  POSIXct and bit64) can now be fully supported. Two floating point
  values are considered equal (by grouping and binary search joins) if 
  their difference is within sqrt(.Machine$double.eps), by default. See 
  example in ?unique.data.table. Completes FRs #951, #1609 and #1075.


On Wed, 2011-10-05 at 20:38 +0100, Matthew Dowle wrote:
> On Wed, 2011-10-05 at 10:44 -0500, Johann Hibschman wrote:
> > This technique is something I just came up with yesterday.  I grepped
> > through my code for uses of J(), and I use the implicit conversion to
> > int all over the place.  
> Then that sounds inefficient. Convenient, but inefficient. Every
> conversion is a memory allocation and copy. If that happens a lot,
> especially in self joins (e.g. month+12 rather than month+12L) where the
> conversion is happening on a vector as long as the (large) table then it
> might speed up a lot changing everywhere to 12L. The month+12 happens
> first to create the long double vector, then that long double vector is
> coerced to integer (a copy). Even one number (a vector length 1)
> converted from double to int inside a loop a lot, say, will create many
> copies of small objects and that can be bad too.
> 
> > However, most of those follow a pattern like
> > this:
> > 
> >   tmp12 <- mkt.data[J(month + 12), list(value12=value)][, -1, with=FALSE]
> >   tmp24 <- mkt.data[J(month + 24), list(value24=value)][, -1, with=FALSE]
> >   mkt.data <- cbind(mkt.data, tmp12, tmp24)
> >   mkt.data$diff1y <- mkt.data[, value - value12]
> >   mkt.data$diff2y <- mkt.data[, value - value24]
> > 
> > I want to compute lagged columns and add them back into the table.  I've
> > never found a way to do this that seems easy and straightforward to me.
> 
> This should be faster and shorter (easier to read and slightly less
> error prone) :
> 
> mkt.data[,value12:=mkt.data[J(month + 12L), value][[2]]]
> mkt.data[,value24:=mkt.data[J(month + 24L), value][[2]]]
> mkt.data[,diff1y:=value-value12]
> mkt.data[,diff2y:=value-value24]
> 
> and when FR#1492 (multiple := in j) is implemented it might be :
> 
> mkt.data[,{ value12:=mkt.data[J(month + 12L), value][[2]]
>             value24:=mkt.data[J(month + 24L), value][[2]]
>             diff1y:=value-value12
>             diff2y:=value-value24 }]
> 
> > 
> > I'm inclined to leave J() as coercing-to-int, but make sure that
> > behavior is documented in its help text.  If I want to do a "fancy"
> > merge, I can always write out data.table.  That leaves J as a simple way
> > to specify join keys, with easily-explainable semantics.  If it gets too
> > clever, it becomes too hard to remember what exactly it does.
> Agreed. Something needs improving somewhere though I'm thinking, to
> encourage (at least inform you) not to rely on J()s auto conversion, but
> to use 12L.
> 
> > 
> > Similarly, I'd rather have the current behaviour (join cols must be int),
> > rather than have a difference between what happens with a small amount
> > of data and with a large amount of data.
> Good point, ok. A difference for small amount of data is out then.
> 
> > It's just simpler and keeps
> > with the mental model of "key columns must be integers".
> > 
> > Johann
> > 
> > Matthew Dowle <mdowle at mdowle.plus.com> writes:
> > 
> > > Thanks for illustrating so clearly. J() has always cast double columns
> > > to int (as far as I remember anyway) for convenience when looking up
> > > data from the prompt, say, to save having to remember L on typed in
> > > values inside J().  This case, where J() is deliberately created with
> > > more columns than x's key, using join inherited scope, I didn't
> > > anticipate.  Or rather, was planning to achieve that output via x. and
> > > i. prefixes in j (previous thread I think but it seems no FR number).
> > > The way you've done it is kind of a manual way of achieving 'i.', where
> > > 'i.' corresponds to your 'prev.'. I'm thinking automatic i. will still
> > > be nice for convenience, but I have to admit I thought it wasn't
> > > possible at all (at least, as elegantly in one query). Presumably this
> > > is most useful with roll=TRUE : age as well as delta.
> > >
> > > Back to J() ... inside J() it doesn't know it's being calling as the i
> > > argument, so it doesn't know the length of x's key. Otherwise, simple
> > > fix would be for J() to only coerce double columns involved in the join
> > > to x's key. Should be possible to use parent.frame() inside J() to work
> > > out where it's being called from and the length of x's key. 
> > >
> > > Or, perhaps all data.table joins should allow double columns in i to be
> > > joined to int, with inefficiency warning if say the number of rows in i
> > > is > 1000,  error/warning if fractional data is truncated, and silently
> > > otherwise.  Then the coercion to int in J() could be removed and it's
> > > more consistent.
> > >
> > > Thoughts anyone?
> > >
> > > In the meantime I can't think of any other way than using data.table()
> > > instead of J(), which looks to work and give the right result.
> > >
> > > Matthew
> > >
> > >
> > > On Tue, 2011-10-04 at 10:03 -0500, Johann Hibschman wrote:
> > >> I just noticed that J casts all its arguments to int.  Has this always
> > >> been the case?  I can't find it documented anywhere.
> > >> 
> > >> I came across this while trying to do a self join, like this:
> > >> 
> > >>   > tmp <- data.table(date=1:5, value=10*rnorm(5), key="date")
> > >>   > tmp
> > >>        date     value
> > >>   [1,]    1  3.710278
> > >>   [2,]    2  4.571288
> > >>   [3,]    3  2.009627
> > >>   [4,]    4  8.237882
> > >>   [5,]    5 -9.004814
> > >>   > with(tmp, J(date, value))
> > >>        date value
> > >>   [1,]    1     3
> > >>   [2,]    2     4
> > >>   [3,]    3     2
> > >>   [4,]    4     8
> > >>   [5,]    5    -9
> > >>   > tmp[J(date + 2, prev.date=date, prev.value=value),
> > >>         list(prev.date, value, prev.value, delta=value-prev.value)]
> > >>        date prev.date     value prev.value       delta
> > >>   [1,]    3         1  2.009627          3  -0.9903734
> > >>   [2,]    4         2  8.237882          4   4.2378817
> > >>   [3,]    5         3 -9.004814          2 -11.0048141
> > >>   [4,]    6         4        NA          8          NA
> > >>   [5,]    7         5        NA         -9          NA
> > >>   > tmp[data.table(date + 2L, prev.date=date, prev.value=value),
> > >>         list(prev.date, value, prev.value, delta=value-prev.value)]
> > >>        date prev.date     value prev.value      delta
> > >>   [1,]    3         1  2.009627   3.710278  -1.700652
> > >>   [2,]    4         2  8.237882   4.571288   3.666594
> > >>   [3,]    5         3 -9.004814   2.009627 -11.014441
> > >>   [4,]    6         4        NA   8.237882         NA
> > >>   [5,]    7         5        NA  -9.004814         NA
> > >> 
> > >> Is this intended?  Using J let me be sloppy and do "+2" while data.table
> > >> made me use "+2L", but then it clobbered the non-int values.
> > >> 
> > >> Is there a better way?
> > >> 
> > >> Thanks,
> > >> Johann
> > >> 
> > >> _______________________________________________
> > >> datatable-help mailing list
> > >> datatable-help at lists.r-forge.r-project.org
> > >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list