[datatable-help] Merging where key is large

Matthew Dowle mdowle at mdowle.plus.com
Thu Jul 1 01:39:05 CEST 2010


No problem. Both do outer joins. nomatch=NA does a 'left outer
join' (from i to x), while merge does a 'full outer join' both ways. FAQ
1.10 has some more details explaining the background here, does it help
or make sense? That FAQ is quite new so it would be good to have
feedback on that.

But anyway thats an aside.

There are several ways to do this. One would be a cross-join i.e.
i=CJ(people,area). That would add in the NA, creating a larger long
format,  which could then be unflattened easily by taking the score
vector and giving it 2 dimensional attributes. Thats method 2 below.

Another way could be grouping the long format using by. Thats method 1
below. I can paste this into my R session and it gives me example data
to work with. I have an input. Then it produces an output. Is this what
you need? It can't be, because there is no year dimension. If its along
the right lines though, perhaps you could modify this or it gives enough
hints, a starting point.

# Create dummy input data
PEOPLE = 1:10  # 10 people ids
AREA = letters[1:3]   # 3 content areas
Y = rnorm(length(PEOPLE))
dt = data.table(CJ(p=PEOPLE,a=AREA),score=sample(100,30,replace=TRUE))
dt = dt[sort(sample(30,20))]  # some people don't have scores in all
content areas so remove some rows

# Method 1 
wide = dt[,as.list(score[match(AREA,a)]),by=p]
colnames(wide)[-1] = AREA
wide$y = Y
lm(y~a+b+c,wide)

# Method 2
setkey(dt,p,a)
wide = matrix( dt[CJ(PEOPLE,AREA),score] ,ncol=length(AREA),byrow=TRUE)
colnames(wide) = AREA
wide = data.table(wide,y=Y)
lm(y~a+b+c,wide)

Hope that helps,

Matthew


On Wed, 2010-06-30 at 14:39 -0500, Damian Betebenner wrote:
> Thanks,
> 
> The original data.table from which everything starts is in long format with each case representing a unique person by content area by year score. With, for example, over a million persons in three content areas across 6 years, the long file can have millions of rows. The analyses are done by content area and require the data to be in wide format where each year's score becomes a separate variable. That was what my contrived example was trying to motivate. 
> 
> The reason for going from this long format to a wide format is because of the need to do regression analyses which require the data to be lined up neatly in wide format where each case represents a unique person and the variables are broken out by year. Because there might be missing data for individual in a given year, the conversion from long to wide format adds in appropriate NAs where they belong. The function that gets applied to the wide data could be something as simple as lm(y ~ x1 + x2 + x3).
> 
> What I want is to get merge(x, y, all=TRUE). This is what I understood was an "outer join". However, this is not the same thing one gets with x[y, nomatch=NA]
> 
> For example, consider the following two data.tables:
> 
> > dt1_1
>       id       y1
>  [1,]  1 -1.22404
>  [2,]  2  1.39052
>  [3,]  3  0.12374
>  [4,]  4 -0.19946
>  [5,]  5  0.85520
>  [6,]  6  0.80668
>  [7,]  7  1.75842
>  [8,]  8 -0.20672
>  [9,]  9 -0.70667
> [10,] 10 -0.22142
> 
> > dt1_2
>       id y2
>  [1,]  6  a
>  [2,]  7  b
>  [3,]  8  c
>  [4,]  9  d
>  [5,] 10  e
>  [6,] 11  f
>  [7,] 12  g
>  [8,] 13  h
>  [9,] 14  i
> [10,] 15  j
> 
> > dt1_1[dt1_2, nomatch=NA]
>       id       y1
>  [1,]  6  0.80668
>  [2,]  7  1.75842
>  [3,]  8 -0.20672
>  [4,]  9 -0.70667
>  [5,] 10 -0.22142
>  [6,] NA       NA
>  [7,] NA       NA
>  [8,] NA       NA
>  [9,] NA       NA
> [10,] NA       NA
> 
> > merge(dt1_1, dt1_2, all=TRUE)
>       id       y1   y2
>  [1,]  1 -1.22404 <NA>
>  [2,]  2  1.39052 <NA>
>  [3,]  3  0.12374 <NA>
>  [4,]  4 -0.19946 <NA>
>  [5,]  5  0.85520 <NA>
>  [6,]  6  0.80668    a
>  [7,]  7  1.75842    b
>  [8,]  8 -0.20672    c
>  [9,]  9 -0.70667    d
> [10,] 10 -0.22142    e
> [11,] 11       NA    f
> [12,] 12       NA    g
> [13,] 13       NA    h
> [14,] 14       NA    i
> [15,] 15       NA    j
> 
> 
> Thanks for your patience and help,
> 
> Damian
> 
> 
> 
> -----Original Message-----
> From: mdowle at mdowle.plus.com [mailto:mdowle at mdowle.plus.com] 
> Sent: Wednesday, June 30, 2010 10:39 AM
> To: Damian Betebenner
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: RE: [datatable-help] Merging where key is large
> 
> In the random_ids, return factor to make things simpler later :
> 
>     factor(sample(100000:999999, num_ids, replace=FALSE)*10000)
> 
> Then setting keys should be simpler in one step.
> 
> Multiple tables can be joined using :
> 
>     dt1[dt2][dt3][dt4]
> 
> I don't know if that helps - depends on the bigger picture i.e. once you
> merged them all then you need to do something, whats the something? You
> can put j and by in each bit too.
> 
> Or how about something like :
> 
>     onedt[,as.list(table(...)),by=list(person,year)][,lm(...)]
> 
> Do you have a known number of tables in advance, like a database join of
> multiple tables, or do you need to join a variable number of tables ?
> 
> But like Tom says, why are you splitting by year into different tables,
> can't you have one table ?  It may help us to know a bit more about the
> bigger picture, perhaps via a small example of the full calculation.
> 
> 
> > Thanks Tom,
> >
> > The data actually starts out in long form and is being reshaped to wide
> > for regression analyses with independent and dependent variables in the
> > same row.
> > Because I know the exact format of the data, I wanted to construct my own
> > long-to-wide converter.
> >
> > I'm sure I'm making this too hard somehow. But that's my sticking point
> > right now.
> >
> > Thanks,
> >
> > Damian
> >
> >
> > -----Original Message-----
> > From: Short, Tom [mailto:TShort at epri.com]
> > Sent: Wednesday, June 30, 2010 8:58 AM
> > To: Damian Betebenner; mdowle at mdowle.plus.com
> > Cc: datatable-help at lists.r-forge.r-project.org
> > Subject: RE: [datatable-help] Merging where key is large
> >
> > Your my.merge.2 is trying to merge lists. For that, merge reverts to
> > merge.default defined as:
> >
> > function (x, y, ...)
> > merge(as.data.frame(x), as.data.frame(y), ...)
> >
> > That's why you end up with data.frames and why they both work. I'm not
> > sure why my.merge.2 is faster than my.merge.1 when using factors.
> >
> > For this problem, you may want to investigate rbinding everything together
> > (stacking) and using a long format instead of a wide format with the
> > following columns:
> >
> > ID, SCORE, YEAR
> >
> > - Tom
> >
> >
> >
> >
> >
> >> -----Original Message-----
> >> From: datatable-help-bounces at lists.r-forge.r-project.org
> >> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> >> On Behalf Of Damian Betebenner
> >> Sent: Wednesday, June 30, 2010 08:25
> >> To: mdowle at mdowle.plus.com
> >> Cc: datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] Merging where key is large
> >>
> >> Hi Matt,
> >>
> >> There are possibly millions of persons.
> >>
> >> This question stems from the desire to do fast joins across
> >> multiple tables in the style of merge(x, y, all=TRUE). Later,
> >> aggregations across many grouping variables will be done
> >> which data.table will greatly speed up as well.
> >>
> >> Consider the following example:
> >>
> >> # function to calculate random 10 digit ids
> >>
> >> random_ids <- function(num_ids) {
> >>      if (num_ids > 900000) stop("Please select fewer than
> >> 900,000 ids")
> >>      sample(100000:999999, num_ids, replace=FALSE)*10000 }
> >>
> >>
> >> # Create 4 overlapping data tables that I want to merge using the ID.
> >>
> >> my_ids <- random_ids(10000)
> >>
> >> dt_1 <- data.table(ID=my_ids[1:4000], SCORE_2006=rnorm(4000,
> >> mean=50, sd=10))
> >> dt_2 <- data.table(ID=my_ids[2001:6000],
> >> SCORE_2007=rnorm(4000, mean=50, sd=10))
> >> dt_3 <- data.table(ID=my_ids[4001:8000],
> >> SCORE_2008=rnorm(4000, mean=50, sd=10))
> >> dt_4 <- data.table(ID=my_ids[6001:10000],
> >> SCORE_2009=rnorm(4000, mean=50, sd=10))
> >>
> >>
> >>
> >>
> >> # Create list whose elements are the four data.tables defined above
> >>
> >> my.list <- vector("list", 4)
> >>
> >> for (i in 1:4) {
> >>    my.list[[i]] <- get(paste("dt_", i, sep="")) }
> >>
> >>
> >> # Want to merge files "elegantly" and FAST so create
> >> functions to do this (these functions return different
> >> results which will show up later)
> >>
> >> my.merge1 <- function(data) {
> >>     if (length(data) == 2) {
> >>         merge(data[[1]], data[[2]], all = TRUE)
> >>     }
> >>     else {
> >>         merge(data[[1]], Recall(data[-1]), all = TRUE)
> >>     }
> >> }
> >>
> >> my.merge2 <- function(data) {
> >>     if (length(data) == 2) {
> >>         merge(data[1], data[2], all = TRUE)
> >>     }
> >>     else {
> >>         merge(data[1], Recall(data[-1]), all = TRUE)
> >>     }
> >> }
> >>
> >>
> >> # Now merge together
> >>
> >> # without a key set
> >>
> >> system.time(my.merge.nokey.1 <- my.merge1(my.list))  ###
> >> DOESN'T WORK AND NOT SURE WHY
> >> system.time(my.merge.nokey.2 <- my.merge2(my.list))  ###
> >> WORKS BUT RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????
> >>
> >>
> >> # Now trying to set a key
> >>
> >> setkey(my.list[[1]], "ID")
> >> Error in setkey(my.list[[1]], "ID") :
> >>   All keyed columns must be storage mode integer
> >>
> >> # If I convert the ID variable to integer, because some of
> >> the integers are bigger than the 32 integer limit, they are
> >> converted to NAs
> >>
> >> # If we work with the ID as a factor.
> >>
> >> dt_1f <- data.table(ID=as.factor(my_ids[1:4000]),
> >> SCORE_2006=rnorm(4000, mean=50, sd=10), key="ID") dt_2f <-
> >> data.table(ID=as.factor(my_ids[2001:6000]),
> >> SCORE_2007=rnorm(4000, mean=50, sd=10), key="ID") dt_3f <-
> >> data.table(ID=as.factor(my_ids[4001:8000]),
> >> SCORE_2008=rnorm(4000, mean=50, sd=10), key="ID") dt_4f <-
> >> data.table(ID=as.factor(my_ids[6001:10000]),
> >> SCORE_2009=rnorm(4000, mean=50, sd=10), key="ID")
> >>
> >>
> >> my.list.f <- vector("list", 4)
> >>
> >>
> >> for (i in 1:4) {
> >>    my.list.f[[i]] <- get(paste("dt_", i, "f", sep="")) }
> >>
> >>
> >> system.time(my.merge.1 <- my.merge1(my.list.f)) ### THIS DOES
> >> WORK THIS TIME AND NOT SURE WHY. RETURNS A DATA.TABLE
> >>
> >> system.time(my.merge.2 <- my.merge2(my.list.f)) ### WORKS BUT
> >> RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????  ALSO,
> >> MUCH FASTER THAN THE PRIOR LINE'S MERGE
> >>
> >>
> >> ### Would like to know if this is possible with the [,
> >> notation. Cannot figure out how to get this notation to
> >> return what one gets with merge(   ,all=TRUE)
> >>
> >>
> >>
> >>
> >> Thanks for the great package and your willingness to help,
> >>
> >> Best regards,
> >>
> >> Damian
> >>
> >>
> >>
> >>
> >> Damian Betebenner
> >> Center for Assessment
> >> PO Box 351
> >> Dover, NH   03821-0351
> >>  
> >>
> >>
> >> -----Original Message-----
> >> From: mdowle at mdowle.plus.com [mailto:mdowle at mdowle.plus.com]
> >> Sent: Wednesday, June 30, 2010 6:38 AM
> >> To: Damian Betebenner
> >> Cc: datatable-help at lists.r-forge.r-project.org
> >> Subject: Re: [datatable-help] Merging where key is large
> >>
> >> Damian,
> >> How many persons are there?
> >> How many rows are there?
> >> What is the 'hit' measured in seconds?
> >> Please post some code as often it turns out to be something
> >> simple e.g.
> >> we'd want to confirm you aren't vector scanning by using == in the i.
> >> Matthew
> >>
> >> > Amazing package! Was wondering how to use data table with keys that
> >> > exceed the max size of a 32 bit integer. I have unique person
> >> > identifiers that are 10 digits and that of exceed 2e-09. I've
> >> > converted that "integer" to a factor but the hit on performance and
> >> > object size using that solution isn't very good.
> >> >
> >> > Any insights greatly appreciated,
> >> >
> >> > Best,
> >> >
> >> > Damian
> >> >
> >> >
> >> > _______________________________________________
> >> > datatable-help mailing list
> >> > datatable-help at lists.r-forge.r-project.org
> >> >
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> >> > -help
> >> >
> >>
> >>
> >> _______________________________________________
> >> datatable-help mailing list
> >> datatable-help at lists.r-forge.r-project.org
> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> > atatable-help
> >>
> >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list