[datatable-help] Joins in data.table

Matthew Dowle mdowle at mdowle.plus.com
Wed Feb 2 22:28:03 CET 2011


Hi Harish,

You may remember this thread from a while back.

Now that join inherited scope is back on and X[Y]
includes Y's non-join columns, I revisited your example.

It now works as you originally thought it should :

> x <- data.table( a=c("a","b","c"), b=1:3, key="a" )
> y <- data.table( a=c("b","d","e"), d=c(8,9,10) )
> x
     a b
[1,] a 1
[2,] b 2
[3,] c 3
> y
     a  d
[1,] b  8
[2,] d  9
[3,] e 10
> x[y]
     a  b  d
[1,] b  2  8
[2,] d NA  9   # used to be a full row of NA
[3,] e NA 10   # i's non join column (d) is included now
> 

We'll now need to add a prefix to the x variables so the
value joined to can be accessed. In this case 'x.a' could
be used in j and would contain the NA's matched to. This
is useful for key only tables which have no non-key columns
to indicate the no match, and also rolling joins where it's
useful to calculate the staleness of data being rolled forward.

Since those example are more unusual, this seems the 'right'
way round now in terms of syntax; closer to what most people
naturally expect.

Feedback much appreciated.

Matthew


On Sun, 2010-07-18 at 11:18 +0100, Matthew Dowle wrote: 
> On question #1 there are quite a few reasons. I'll mention each one in
> brief, just let us know if more detail is required.
> 
> i) x may be a key only table, perhaps just one column. You need the NA
> returned in the key column when nomatch=NA otherwise you wouldn't know
> there wasn't a match because there are no non-key columns to hold the NA
> value in that case.
> 
> ii) You do know what the NA correspond to: the y you passed in. If you
> need the correspondence then you can cbind(y,x[y]).  Thats works when
> mult="first" which the default. When mult="all" you don't know how many
> rows of the result each row of y matched to, so either merge(), or we
> have discussed a new inci argument.
> 
> iii) However, why join to create and hold some bigger intermediate
> table, only then to calculate some aggregate.  I try to do both in one
> step:  x[y,<calc>]. This is quite a big feature of the syntax that
> allows to do more in less code.
> 
> iv) using roll=TRUE, it is normal for the keys not to match when there
> is a prevailing join on the last column. You need the value the y joined
> _to_ in x, so you can do a diff if needed. You may need to join to the
> prevailing row in x as long as it wasn't more than 20 days ago, for
> example. NA in the key tells you there was no prevailing row at all.
> 
> Yes join inherited scope [i.e. accessing y's columns within the j] is
> high on the to do list.  I'm thinking those will likely be accessed with
> an "i." prefix because we often use the same column name such as "date"
> or "id" in several tables e.g. in use-case iv above j might include a
> column 'stale=(date-i.date)>20'.
> 
> v) I was going to say consistency with n-dim-array[n-col-matrix] in
> base, where the x[y] syntax in data.table is generalised from. However,
> when I checked just know it gives "subscript out of bounds" rather than
> NA, but I mention it anyway as an aside.
> 
> HTH?
> Matthew
> 
> 
> On Fri, 2010-07-16 at 10:56 -0700, Harish wrote:
> > Thanks Tom.  That answers my question #2.
> > 
> > It looks rather obvious once I see the solution.  :)
> > 
> > 
> > Harish
> > 
> > 
> > --- On Fri, 7/16/10, Short, Tom <TShort at epri.com> wrote:
> > 
> > > From: Short, Tom <TShort at epri.com>
> > > Subject: RE: [datatable-help] Joins in data.table
> > > To: "Harish" <harishv_99 at yahoo.com>
> > > Cc: datatable-help at lists.r-forge.r-project.org
> > > Date: Friday, July 16, 2010, 4:37 AM
> > > Harish, I'm not sure on the answer to
> > > your first question.
> > > 
> > > I'm not sure what VLOOKUP does, but based on your
> > > description, is this
> > > what you want?
> > > 
> > > > x <- data.table( a=c("a","b","c"), b=1:3, key="a"
> > > )
> > > > y <- data.table( a=c("b","d","e"), d=c(8,9,10),
> > > key="a" ) # key added
> > > > 
> > > > x$d <- y[x, d]     # or 
> > > y[x]$d
> > > > x
> > >      a b  d
> > > [1,] a 1 NA
> > > [2,] b 2  8
> > > [3,] c 3 NA
> > > 
> > > 
> > > - Tom
> > > 
> > > 
> > >  
> > > 
> > > > -----Original Message-----
> > > > From: datatable-help-bounces at lists.r-forge.r-project.org
> > > 
> > > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > > 
> > > > On Behalf Of Harish
> > > > Sent: Friday, July 16, 2010 03:04
> > > > To: datatable-help at lists.r-forge.r-project.org
> > > > Subject: [datatable-help] Joins in data.table
> > > > 
> > > > Question 1 -- Why get a full row of NAs with
> > > x[y]?  Bug?
> > > > Question 2 -- What is an easy way to do the equivalent
> > > of 
> > > > VLOOKUP in Excel?
> > > > 
> > > > ====================================
> > > > 
> > > > Question 1 -- Why get a full row of NAs with
> > > x[y]?  Bug?
> > > > 
> > > > I was experimenting with x[y] syntax and I don't
> > > understand 
> > > > the use case for getting NA with nomatch=NA.
> > > > 
> > > > x <- data.table( a=c("a","b","c"), b=1:3, key="a" )
> > > y <- 
> > > > data.table( a=c("b","d","e"), d=c(8,9,10) )
> > > > 
> > > > x[y] returns:
> > > > 
> > > >         a  b
> > > > [1,]    b  2
> > > > [2,] <NA> NA
> > > > [3,] <NA> NA
> > > > 
> > > > 
> > > > What is the point of getting an entire row of
> > > NA?  I don't 
> > > > even know what the NA row corresponds to in data.table
> > > y.  If 
> > > > I got the values in column "a", then it would make
> > > more sense.
> > > > 
> > > > I would expect:
> > > > 
> > > >      a  b
> > > > [1,] b  2
> > > > [2,] d NA
> > > > [3,] e NA
> > > > 
> > > > Am I misunderstanding this join?
> > > > 
> > > > (I do realize that nomatch=0 will prevent the NA
> > > rows.)
> > > > 
> > > > Also, is there still a plan to allowing access to y's
> > > columns 
> > > > within the "j" expression as indicated in FAQ 1.10?
> > > >    x[y, d]
> > > > 
> > > > (I did ask this before and I believe Matthew said that
> > > it is 
> > > > in the works.  I just want to make sure I didn't
> > > misunderstand.)
> > > > 
> > > > Otherwise, the whole x[y] join is only a subset. 
> > > (I do 
> > > > understand that I can use merge() to do a full
> > > merge.)
> > > > 
> > > > ====================================
> > > > 
> > > > Question 2 -- What is an easy way to do the equivalent
> > > of 
> > > > VLOOKUP in Excel?
> > > > 
> > > > Now, suppose both my data.tables (x and y) had a lot
> > > of 
> > > > columns in it.  I want to add a new column "d" to
> > > x.  The 
> > > > value "d" will be from the data.table y such that x$a
> > > == y$a.
> > > > 
> > > > What is the easiest way for me to do it?  Do I
> > > have to 
> > > > merge() both the data.tables and remove the unwanted
> > > columns 
> > > > of y from the result?
> > > > 
> > > > If the feature in the FAQ 1.10 were implemented, could
> > > I could do
> > > >    x[y, list(.SD,d)]
> > > > or something to that effect?
> > > > 
> > > > 
> > > > Regards,
> > > > Harish
> > > > 
> > > > 
> > > > 
> > > >       
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > > atatable-help
> > > > 
> > > 
> > 
> > 
> >       
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help





More information about the datatable-help mailing list