[datatable-help] help in joining two data tables

ravi rv15i at yahoo.se
Thu Dec 31 17:20:03 CET 2015


Hi,I have some trouble in understanding the data.table procedure for joining two tables. Let me start by taking up two example data tables :
library(data.table)
############ the first data.table example
mt<-data.table(mtcars)
## some modifications to the data.table
s1<-1:32;s1[seq(2,32,by=2)]<-NA
mt[,"cntrl":=s1];mt[,"cylO":=cyl];mt[,"cyl":=cyl*2]
setkey(mt,gear,carb,cylO,cntrl)
mt
##  More modifications
mt[gear == 3 & carb ==3 & cylO == 8 & mpg == 16.4,cntrl:=14]
str(mt)
mt
############## the second data.table example
nt<-data.table(gear= c(3,3,3),carb=c(1,3,3),cylO=c(4,8,8),price=c(11,44,55),cntrl=c(21,13,14))
setkey(nt,gear,carb,cylO,cntrl)
############# merging as a data frame
rdJoin<-merge.data.frame(mt,nt,by.x=c("gear","carb","cylO","cntrl"),by.y=c("gear","carb","cylO","cntrl"),all.x=TRUE)
str(rdJoin)
rdJoin
############## questions
# What is the data.table command to get rdJoin?
# How is it possible to specify the key variables for the join -see below
# For example, c("gear","carb")      c("gear","carb","cylO")   etc.
# Also, where the variables have different names in the two tables
# For example, if the cntrl variable in the first DT is "cntrl1" and "cntrl2" in the second
Let me elaborate on te questions shown above. First, I would like to start with some general questions :1. In the documentation for data.table (which includes the vignettes available so far), it is mentioned that it is sufficient if one of the two data tables being considered has keys. This is a bit confusing. The straightforward situation is if both the tables have keys. When would it be of advantage to have keys for just one of them? It would be nice if this can be explained in the to-be-released vignette on joins.2. The merge command in base R is very clear and easy to understand. It would be nice if the data table procedure is transparent in the same way. To start with, I would like to know how I can do the following things with data table :        (i) the data.table equivalent of the base R command                            merge.data.frame(mt,nt,by.x=c("gear","carb","cylO","cntrl"),by.y=c("gear","carb","cylO","cntrl"),all.x=TRUE)           (ii) How it is possible to choose the number of key variables from a list :                         c("gear","carb")          c("gear","carb","cylO")                  c("gear","carb","cylO","cntrl")                         It is very clear in the merge command how this is done. How to do that with data.table?
                        The on argument can be used for one of the tables. How can it be specified for the other? That is, without having to use the setkey command each time a change is needed.          (iii) How can this be done if the key variables in the two lists have different names? That is, if the cntrl variable in the first DT is "cntrl1" and "cntrl2" in the second, for example.
I have found the data.table package to be very useful. It would be nice if I can understand its use better.
Thanks for any help that I can get.Ravi




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20151231/3b3edf4a/attachment.html>


More information about the datatable-help mailing list