[datatable-help] Merging where key is large

Wed Jun 30 14:24:42 CEST 2010

Hi Matt,

There are possibly millions of persons. 

This question stems from the desire to do fast joins across multiple tables in the style of merge(x, y, all=TRUE). Later, aggregations across many grouping variables will be done which data.table will greatly speed up as well. 

Consider the following example:

# function to calculate random 10 digit ids

random_ids <- function(num_ids) {
     if (num_ids > 900000) stop("Please select fewer than 900,000 ids")
     sample(100000:999999, num_ids, replace=FALSE)*10000
}

# Create 4 overlapping data tables that I want to merge using the ID.

my_ids <- random_ids(10000)

dt_1 <- data.table(ID=my_ids[1:4000], SCORE_2006=rnorm(4000, mean=50, sd=10))
dt_2 <- data.table(ID=my_ids[2001:6000], SCORE_2007=rnorm(4000, mean=50, sd=10))
dt_3 <- data.table(ID=my_ids[4001:8000], SCORE_2008=rnorm(4000, mean=50, sd=10))
dt_4 <- data.table(ID=my_ids[6001:10000], SCORE_2009=rnorm(4000, mean=50, sd=10))

# Create list whose elements are the four data.tables defined above

my.list <- vector("list", 4)

for (i in 1:4) {
   my.list[[i]] <- get(paste("dt_", i, sep=""))
}

# Want to merge files "elegantly" and FAST so create functions to do this (these functions return different results which will show up later)

my.merge1 <- function(data) {
    if (length(data) == 2) {
        merge(data[[1]], data[[2]], all = TRUE)
    }
    else {
        merge(data[[1]], Recall(data[-1]), all = TRUE)
    }
}

my.merge2 <- function(data) {
    if (length(data) == 2) {
        merge(data[1], data[2], all = TRUE)
    }
    else {
        merge(data[1], Recall(data[-1]), all = TRUE)
    }
}

# Now merge together

# without a key set

system.time(my.merge.nokey.1 <- my.merge1(my.list))  ### DOESN'T WORK AND NOT SURE WHY
system.time(my.merge.nokey.2 <- my.merge2(my.list))  ### WORKS BUT RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????

# Now trying to set a key

setkey(my.list[[1]], "ID")
Error in setkey(my.list[[1]], "ID") : 
  All keyed columns must be storage mode integer

# If I convert the ID variable to integer, because some of the integers are bigger than the 32 integer limit, they are converted to NAs

# If we work with the ID as a factor.

dt_1f <- data.table(ID=as.factor(my_ids[1:4000]), SCORE_2006=rnorm(4000, mean=50, sd=10), key="ID")
dt_2f <- data.table(ID=as.factor(my_ids[2001:6000]), SCORE_2007=rnorm(4000, mean=50, sd=10), key="ID")
dt_3f <- data.table(ID=as.factor(my_ids[4001:8000]), SCORE_2008=rnorm(4000, mean=50, sd=10), key="ID")
dt_4f <- data.table(ID=as.factor(my_ids[6001:10000]), SCORE_2009=rnorm(4000, mean=50, sd=10), key="ID")

my.list.f <- vector("list", 4)

for (i in 1:4) {
   my.list.f[[i]] <- get(paste("dt_", i, "f", sep=""))
}

system.time(my.merge.1 <- my.merge1(my.list.f))  
### THIS DOES WORK THIS TIME AND NOT SURE WHY. RETURNS A DATA.TABLE

system.time(my.merge.2 <- my.merge2(my.list.f))  
### WORKS BUT RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????  ALSO, MUCH FASTER THAN THE PRIOR LINE'S MERGE

### Would like to know if this is possible with the [, notation. Cannot figure out how to get this notation to return what one gets with merge(   ,all=TRUE) 

Thanks for the great package and your willingness to help,

Best regards,

Damian

Damian Betebenner
Center for Assessment
PO Box 351
Dover, NH   03821-0351

-----Original Message-----
From: mdowle at mdowle.plus.com [mailto:mdowle at mdowle.plus.com] 
Sent: Wednesday, June 30, 2010 6:38 AM
To: Damian Betebenner
Cc: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] Merging where key is large

Damian,
How many persons are there?
How many rows are there?
What is the 'hit' measured in seconds?
Please post some code as often it turns out to be something simple e.g.
we'd want to confirm you aren't vector scanning by using == in the i.
Matthew

> Amazing package! Was wondering how to use data table with keys that exceed
> the max size of a 32 bit integer. I have unique person identifiers that
> are 10 digits and that of exceed 2e-09. I've converted that "integer" to a
> factor but the hit on performance and object size using that solution
> isn't very good.
>
> Any insights greatly appreciated,
>
> Best,
>
> Damian
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>