[datatable-help] Joins in data.table

Matthew Dowle mdowle at mdowle.plus.com
Sun Jul 18 12:18:32 CEST 2010


On question #1 there are quite a few reasons. I'll mention each one in
brief, just let us know if more detail is required.

i) x may be a key only table, perhaps just one column. You need the NA
returned in the key column when nomatch=NA otherwise you wouldn't know
there wasn't a match because there are no non-key columns to hold the NA
value in that case.

ii) You do know what the NA correspond to: the y you passed in. If you
need the correspondence then you can cbind(y,x[y]).  Thats works when
mult="first" which the default. When mult="all" you don't know how many
rows of the result each row of y matched to, so either merge(), or we
have discussed a new inci argument.

iii) However, why join to create and hold some bigger intermediate
table, only then to calculate some aggregate.  I try to do both in one
step:  x[y,<calc>]. This is quite a big feature of the syntax that
allows to do more in less code.

iv) using roll=TRUE, it is normal for the keys not to match when there
is a prevailing join on the last column. You need the value the y joined
_to_ in x, so you can do a diff if needed. You may need to join to the
prevailing row in x as long as it wasn't more than 20 days ago, for
example. NA in the key tells you there was no prevailing row at all.

Yes join inherited scope [i.e. accessing y's columns within the j] is
high on the to do list.  I'm thinking those will likely be accessed with
an "i." prefix because we often use the same column name such as "date"
or "id" in several tables e.g. in use-case iv above j might include a
column 'stale=(date-i.date)>20'.

v) I was going to say consistency with n-dim-array[n-col-matrix] in
base, where the x[y] syntax in data.table is generalised from. However,
when I checked just know it gives "subscript out of bounds" rather than
NA, but I mention it anyway as an aside.

HTH?
Matthew


On Fri, 2010-07-16 at 10:56 -0700, Harish wrote:
> Thanks Tom.  That answers my question #2.
> 
> It looks rather obvious once I see the solution.  :)
> 
> 
> Harish
> 
> 
> --- On Fri, 7/16/10, Short, Tom <TShort at epri.com> wrote:
> 
> > From: Short, Tom <TShort at epri.com>
> > Subject: RE: [datatable-help] Joins in data.table
> > To: "Harish" <harishv_99 at yahoo.com>
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Date: Friday, July 16, 2010, 4:37 AM
> > Harish, I'm not sure on the answer to
> > your first question.
> > 
> > I'm not sure what VLOOKUP does, but based on your
> > description, is this
> > what you want?
> > 
> > > x <- data.table( a=c("a","b","c"), b=1:3, key="a"
> > )
> > > y <- data.table( a=c("b","d","e"), d=c(8,9,10),
> > key="a" ) # key added
> > > 
> > > x$d <- y[x, d]     # or 
> > y[x]$d
> > > x
> >      a b  d
> > [1,] a 1 NA
> > [2,] b 2  8
> > [3,] c 3 NA
> > 
> > 
> > - Tom
> > 
> > 
> >  
> > 
> > > -----Original Message-----
> > > From: datatable-help-bounces at lists.r-forge.r-project.org
> > 
> > > [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> > 
> > > On Behalf Of Harish
> > > Sent: Friday, July 16, 2010 03:04
> > > To: datatable-help at lists.r-forge.r-project.org
> > > Subject: [datatable-help] Joins in data.table
> > > 
> > > Question 1 -- Why get a full row of NAs with
> > x[y]?  Bug?
> > > Question 2 -- What is an easy way to do the equivalent
> > of 
> > > VLOOKUP in Excel?
> > > 
> > > ====================================
> > > 
> > > Question 1 -- Why get a full row of NAs with
> > x[y]?  Bug?
> > > 
> > > I was experimenting with x[y] syntax and I don't
> > understand 
> > > the use case for getting NA with nomatch=NA.
> > > 
> > > x <- data.table( a=c("a","b","c"), b=1:3, key="a" )
> > y <- 
> > > data.table( a=c("b","d","e"), d=c(8,9,10) )
> > > 
> > > x[y] returns:
> > > 
> > >         a  b
> > > [1,]    b  2
> > > [2,] <NA> NA
> > > [3,] <NA> NA
> > > 
> > > 
> > > What is the point of getting an entire row of
> > NA?  I don't 
> > > even know what the NA row corresponds to in data.table
> > y.  If 
> > > I got the values in column "a", then it would make
> > more sense.
> > > 
> > > I would expect:
> > > 
> > >      a  b
> > > [1,] b  2
> > > [2,] d NA
> > > [3,] e NA
> > > 
> > > Am I misunderstanding this join?
> > > 
> > > (I do realize that nomatch=0 will prevent the NA
> > rows.)
> > > 
> > > Also, is there still a plan to allowing access to y's
> > columns 
> > > within the "j" expression as indicated in FAQ 1.10?
> > >    x[y, d]
> > > 
> > > (I did ask this before and I believe Matthew said that
> > it is 
> > > in the works.  I just want to make sure I didn't
> > misunderstand.)
> > > 
> > > Otherwise, the whole x[y] join is only a subset. 
> > (I do 
> > > understand that I can use merge() to do a full
> > merge.)
> > > 
> > > ====================================
> > > 
> > > Question 2 -- What is an easy way to do the equivalent
> > of 
> > > VLOOKUP in Excel?
> > > 
> > > Now, suppose both my data.tables (x and y) had a lot
> > of 
> > > columns in it.  I want to add a new column "d" to
> > x.  The 
> > > value "d" will be from the data.table y such that x$a
> > == y$a.
> > > 
> > > What is the easiest way for me to do it?  Do I
> > have to 
> > > merge() both the data.tables and remove the unwanted
> > columns 
> > > of y from the result?
> > > 
> > > If the feature in the FAQ 1.10 were implemented, could
> > I could do
> > >    x[y, list(.SD,d)]
> > > or something to that effect?
> > > 
> > > 
> > > Regards,
> > > Harish
> > > 
> > > 
> > > 
> > >       
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> > > 
> > 
> 
> 
>       
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list