[datatable-help] Merging where key is large

Wed Jun 30 15:45:41 CEST 2010

Thanks Tom,

The data actually starts out in long form and is being reshaped to wide for regression analyses with independent and dependent variables in the same row. 
Because I know the exact format of the data, I wanted to construct my own long-to-wide converter.

I'm sure I'm making this too hard somehow. But that's my sticking point right now.

Thanks,

Damian

-----Original Message-----
From: Short, Tom [mailto:TShort at epri.com] 
Sent: Wednesday, June 30, 2010 8:58 AM
To: Damian Betebenner; mdowle at mdowle.plus.com
Cc: datatable-help at lists.r-forge.r-project.org
Subject: RE: [datatable-help] Merging where key is large

Your my.merge.2 is trying to merge lists. For that, merge reverts to merge.default defined as:

function (x, y, ...) 
merge(as.data.frame(x), as.data.frame(y), ...)

That's why you end up with data.frames and why they both work. I'm not sure why my.merge.2 is faster than my.merge.1 when using factors.

For this problem, you may want to investigate rbinding everything together (stacking) and using a long format instead of a wide format with the following columns:

ID, SCORE, YEAR

- Tom

> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Damian Betebenner
> Sent: Wednesday, June 30, 2010 08:25
> To: mdowle at mdowle.plus.com
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Merging where key is large
> 
> Hi Matt,
> 
> There are possibly millions of persons. 
> 
> This question stems from the desire to do fast joins across 
> multiple tables in the style of merge(x, y, all=TRUE). Later, 
> aggregations across many grouping variables will be done 
> which data.table will greatly speed up as well. 
> 
> Consider the following example:
> 
> # function to calculate random 10 digit ids
> 
> random_ids <- function(num_ids) {
>      if (num_ids > 900000) stop("Please select fewer than 
> 900,000 ids")
>      sample(100000:999999, num_ids, replace=FALSE)*10000 }
> 
> 
> # Create 4 overlapping data tables that I want to merge using the ID.
> 
> my_ids <- random_ids(10000)
> 
> dt_1 <- data.table(ID=my_ids[1:4000], SCORE_2006=rnorm(4000, 
> mean=50, sd=10))
> dt_2 <- data.table(ID=my_ids[2001:6000], 
> SCORE_2007=rnorm(4000, mean=50, sd=10))
> dt_3 <- data.table(ID=my_ids[4001:8000], 
> SCORE_2008=rnorm(4000, mean=50, sd=10))
> dt_4 <- data.table(ID=my_ids[6001:10000], 
> SCORE_2009=rnorm(4000, mean=50, sd=10))
> 
> 
> 
> 
> # Create list whose elements are the four data.tables defined above
> 
> my.list <- vector("list", 4)
> 
> for (i in 1:4) {
>    my.list[[i]] <- get(paste("dt_", i, sep="")) }
> 
> 
> # Want to merge files "elegantly" and FAST so create 
> functions to do this (these functions return different 
> results which will show up later)
> 
> my.merge1 <- function(data) {
>     if (length(data) == 2) {
>         merge(data[[1]], data[[2]], all = TRUE)
>     }
>     else {
>         merge(data[[1]], Recall(data[-1]), all = TRUE)
>     }
> }
> 
> my.merge2 <- function(data) {
>     if (length(data) == 2) {
>         merge(data[1], data[2], all = TRUE)
>     }
>     else {
>         merge(data[1], Recall(data[-1]), all = TRUE)
>     }
> }
> 
> 
> # Now merge together
> 
> # without a key set
> 
> system.time(my.merge.nokey.1 <- my.merge1(my.list))  ### 
> DOESN'T WORK AND NOT SURE WHY
> system.time(my.merge.nokey.2 <- my.merge2(my.list))  ### 
> WORKS BUT RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????
> 
> 
> # Now trying to set a key
> 
> setkey(my.list[[1]], "ID")
> Error in setkey(my.list[[1]], "ID") : 
>   All keyed columns must be storage mode integer
> 
> # If I convert the ID variable to integer, because some of 
> the integers are bigger than the 32 integer limit, they are 
> converted to NAs
> 
> # If we work with the ID as a factor.
> 
> dt_1f <- data.table(ID=as.factor(my_ids[1:4000]), 
> SCORE_2006=rnorm(4000, mean=50, sd=10), key="ID") dt_2f <- 
> data.table(ID=as.factor(my_ids[2001:6000]), 
> SCORE_2007=rnorm(4000, mean=50, sd=10), key="ID") dt_3f <- 
> data.table(ID=as.factor(my_ids[4001:8000]), 
> SCORE_2008=rnorm(4000, mean=50, sd=10), key="ID") dt_4f <- 
> data.table(ID=as.factor(my_ids[6001:10000]), 
> SCORE_2009=rnorm(4000, mean=50, sd=10), key="ID")
> 
> 
> my.list.f <- vector("list", 4)
> 
> 
> for (i in 1:4) {
>    my.list.f[[i]] <- get(paste("dt_", i, "f", sep="")) }
> 
> 
> system.time(my.merge.1 <- my.merge1(my.list.f)) ### THIS DOES 
> WORK THIS TIME AND NOT SURE WHY. RETURNS A DATA.TABLE
> 
> system.time(my.merge.2 <- my.merge2(my.list.f)) ### WORKS BUT 
> RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????  ALSO, 
> MUCH FASTER THAN THE PRIOR LINE'S MERGE
> 
> 
> ### Would like to know if this is possible with the [, 
> notation. Cannot figure out how to get this notation to 
> return what one gets with merge(   ,all=TRUE) 
> 
> 
> 
> 
> Thanks for the great package and your willingness to help,
> 
> Best regards,
> 
> Damian
> 
> 
> 
> 
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH   03821-0351
>  
> 
> 
> -----Original Message-----
> From: mdowle at mdowle.plus.com [mailto:mdowle at mdowle.plus.com]
> Sent: Wednesday, June 30, 2010 6:38 AM
> To: Damian Betebenner
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Merging where key is large
> 
> Damian,
> How many persons are there?
> How many rows are there?
> What is the 'hit' measured in seconds?
> Please post some code as often it turns out to be something 
> simple e.g.
> we'd want to confirm you aren't vector scanning by using == in the i.
> Matthew
> 
> > Amazing package! Was wondering how to use data table with keys that 
> > exceed the max size of a 32 bit integer. I have unique person 
> > identifiers that are 10 digits and that of exceed 2e-09. I've 
> > converted that "integer" to a factor but the hit on performance and 
> > object size using that solution isn't very good.
> >
> > Any insights greatly appreciated,
> >
> > Best,
> >
> > Damian
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> > -help
> >
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
>