[datatable-help] merging syntax

Ivan Alves papucho at me.com
Thu Jun 20 18:43:58 CEST 2013


Many thanks to both Eduard and Frank on the issue of the needed key.  One aspect of the merging that is not clear is how to do 'inner' vs. 'outer' 'joins'  (like in SQL).  Whereas it works with merge (using the all.x=TRUE option), how is it done with data.table?  In the improved example below

g_ctpty = data.table(ctpty_head=c("a","b","c","d"), ctpty_cty=c("US","DE","JP","CN"))
g_iss = data.table(iss_head=c("a","b","c","d"), iss_cty=c("US","DE","JP","CN"))
links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a"))
setkey(g_ctpty,ctpty_head)
setkey(g_iss,iss_head)
merge(
  merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"),
  g_iss,
  all.x = TRUE, by = "iss_head"
)
g_iss[g_ctpty[links]] # error
links[g_ctpty][g_iss] # still error
setkey(links,ctpty_head,iss_head) # keys are needed
links[g_ctpty][g_iss] # how to get inner join?

How do it not include the last line in the link? (Again, it works with merge).  Many thanks.

Ivan


On 18 Jun 2013, at 17:34, Eduard Antonyan <eduard.antonyan at gmail.com> wrote:

> Frank has already answered why you're getting the results you are.
> 
> Re "how do I tell DT to do separate matchings at each join?":
> 
> Currently you have to use 'merge' or 'data.table', see this - https://r-forge.r-project.org/tracker/?func=detail&atid=978&aid=4675&group_id=240 and the SO link inside.
> 
> 
> On Tue, Jun 18, 2013 at 10:08 AM, Ivan Alves <papucho at me.com> wrote:
> Hi Frank,
> 
> Many thanks for the thoughts.
> 
> It is something that has to do with the keys, that is for sure (see below). A double DT join does not understand that it has to do one join by one variable and the other by another variable: the output simply has lines where the two keys are the same (ctpty_head==iss_head),which is of course not optimal.
> 
> how do I tell DT to do separate matchings at each join?  Setting setkey(links,ctpty_head,iss_head) before the join does not work either.
> 
> > key(g_ctpty[links])
> NULL
> > key(g_ctpty)
> [1] "ctpty_head"
> > key(g_iss)
> [1] "iss_head"
> > key(links)
> NULL
> 
> Hi Arunkumar,
> 
> An example would look like follows:
> 
> g_ctpty = data.table(ctpty_head=c("a","b","c"), ctpty_cty=c("US","DE","JP"))
> g_iss = data.table(iss_head=c("a","b","c"), iss_cty=c("US","DE","JP"))
> links = data.table(ctpty_head=c("a","b","c"), iss_head=c("b","b","a"))
> 
> merge(
> +     merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"),
> +     g_iss,
> +     all.x = TRUE, by = "iss_head"
> +   )
>    iss_head ctpty_head ctpty_cty iss_cty
> 1:        a          c        JP      US
> 2:        b          a        US      DE
> 3:        b          b        DE      DE
> > g_iss[g_ctpty[links]]
>    iss_head iss_cty ctpty_cty iss_head.1
> 1:        a      US        US          b
> 2:        b      DE        DE          b
> 3:        c      JP        JP          a
> > setkey(links,ctpty_head,iss_head)
> > g_iss[g_ctpty[links]]
>    iss_head iss_cty ctpty_cty iss_head.1
> 1:        a      US        US          b
> 2:        b      DE        DE          b
> 3:        c      JP        JP          a
> 
> On 17 Jun 2013, at 14:36, Frank Erickson <FErickson at psu.edu> wrote:
> 
>> I think that key(g_ctpty[links]) == key(g_ctpty) == "ctpty_head", which is used when you do your second merge with [, instead of "iss_head". You can check this by running key(g_ctpty[links]) . --Frank
>> 
>> 
>> On Mon, Jun 17, 2013 at 1:57 AM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
>> Since you have the data as well, why not provide it (a small part at least with which your issue is reproducible)? Isn't it much easier than to ask everyone who's willing to help to create a data and test your code?
>> 
>> Arun
>> 
>> On Monday, June 17, 2013 at 8:54 AM, Ivan Alves wrote:
>> 
>>> Dear all,
>>> 
>>> I am not sure I understand the syntax for merging data.tables. I have keyed the two 'satelite' tables from which I want to match information to the main table 'links'
>>> 
>>> g_ctpty <- gultimate[,list(ctpty_head,ctpty_cty)]
>>> setkey(g_ctpty,ctpty_head)
>>> g_iss <- gultimate[,list(iss_head,iss_cty)]
>>> setkey(g_iss,iss_head)
>>> 
>>> Why are the two below not equivalent?
>>> 
>>> This works:
>>> 
>>> data = merge(
>>> merge(links, g_ctpty, all.x = TRUE, by = "ctpty_head"),
>>> g_iss,
>>> all.x = TRUE, by = "iss_head"
>>> ),
>>> 
>>> And this does not:
>>> 
>>> data = g_iss[g_ctpty[links]],
>>> 
>>> Any guidance would be appreciated.
>>> Kind regards,
>>> Ivan
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> 
>> 
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130620/0cf96b2d/attachment.html>


More information about the datatable-help mailing list