[datatable-help] Merging where key is large
Short, Tom
TShort at epri.com
Wed Jun 30 14:57:46 CEST 2010
Your my.merge.2 is trying to merge lists. For that, merge reverts to merge.default defined as:
function (x, y, ...)
merge(as.data.frame(x), as.data.frame(y), ...)
That's why you end up with data.frames and why they both work. I'm not sure why my.merge.2 is faster than my.merge.1 when using factors.
For this problem, you may want to investigate rbinding everything together (stacking) and using a long format instead of a wide format with the following columns:
ID, SCORE, YEAR
- Tom
> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> On Behalf Of Damian Betebenner
> Sent: Wednesday, June 30, 2010 08:25
> To: mdowle at mdowle.plus.com
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Merging where key is large
>
> Hi Matt,
>
> There are possibly millions of persons.
>
> This question stems from the desire to do fast joins across
> multiple tables in the style of merge(x, y, all=TRUE). Later,
> aggregations across many grouping variables will be done
> which data.table will greatly speed up as well.
>
> Consider the following example:
>
> # function to calculate random 10 digit ids
>
> random_ids <- function(num_ids) {
> if (num_ids > 900000) stop("Please select fewer than
> 900,000 ids")
> sample(100000:999999, num_ids, replace=FALSE)*10000 }
>
>
> # Create 4 overlapping data tables that I want to merge using the ID.
>
> my_ids <- random_ids(10000)
>
> dt_1 <- data.table(ID=my_ids[1:4000], SCORE_2006=rnorm(4000,
> mean=50, sd=10))
> dt_2 <- data.table(ID=my_ids[2001:6000],
> SCORE_2007=rnorm(4000, mean=50, sd=10))
> dt_3 <- data.table(ID=my_ids[4001:8000],
> SCORE_2008=rnorm(4000, mean=50, sd=10))
> dt_4 <- data.table(ID=my_ids[6001:10000],
> SCORE_2009=rnorm(4000, mean=50, sd=10))
>
>
>
>
> # Create list whose elements are the four data.tables defined above
>
> my.list <- vector("list", 4)
>
> for (i in 1:4) {
> my.list[[i]] <- get(paste("dt_", i, sep="")) }
>
>
> # Want to merge files "elegantly" and FAST so create
> functions to do this (these functions return different
> results which will show up later)
>
> my.merge1 <- function(data) {
> if (length(data) == 2) {
> merge(data[[1]], data[[2]], all = TRUE)
> }
> else {
> merge(data[[1]], Recall(data[-1]), all = TRUE)
> }
> }
>
> my.merge2 <- function(data) {
> if (length(data) == 2) {
> merge(data[1], data[2], all = TRUE)
> }
> else {
> merge(data[1], Recall(data[-1]), all = TRUE)
> }
> }
>
>
> # Now merge together
>
> # without a key set
>
> system.time(my.merge.nokey.1 <- my.merge1(my.list)) ###
> DOESN'T WORK AND NOT SURE WHY
> system.time(my.merge.nokey.2 <- my.merge2(my.list)) ###
> WORKS BUT RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????
>
>
> # Now trying to set a key
>
> setkey(my.list[[1]], "ID")
> Error in setkey(my.list[[1]], "ID") :
> All keyed columns must be storage mode integer
>
> # If I convert the ID variable to integer, because some of
> the integers are bigger than the 32 integer limit, they are
> converted to NAs
>
> # If we work with the ID as a factor.
>
> dt_1f <- data.table(ID=as.factor(my_ids[1:4000]),
> SCORE_2006=rnorm(4000, mean=50, sd=10), key="ID") dt_2f <-
> data.table(ID=as.factor(my_ids[2001:6000]),
> SCORE_2007=rnorm(4000, mean=50, sd=10), key="ID") dt_3f <-
> data.table(ID=as.factor(my_ids[4001:8000]),
> SCORE_2008=rnorm(4000, mean=50, sd=10), key="ID") dt_4f <-
> data.table(ID=as.factor(my_ids[6001:10000]),
> SCORE_2009=rnorm(4000, mean=50, sd=10), key="ID")
>
>
> my.list.f <- vector("list", 4)
>
>
> for (i in 1:4) {
> my.list.f[[i]] <- get(paste("dt_", i, "f", sep="")) }
>
>
> system.time(my.merge.1 <- my.merge1(my.list.f)) ### THIS DOES
> WORK THIS TIME AND NOT SURE WHY. RETURNS A DATA.TABLE
>
> system.time(my.merge.2 <- my.merge2(my.list.f)) ### WORKS BUT
> RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ????? ALSO,
> MUCH FASTER THAN THE PRIOR LINE'S MERGE
>
>
> ### Would like to know if this is possible with the [,
> notation. Cannot figure out how to get this notation to
> return what one gets with merge( ,all=TRUE)
>
>
>
>
> Thanks for the great package and your willingness to help,
>
> Best regards,
>
> Damian
>
>
>
>
> Damian Betebenner
> Center for Assessment
> PO Box 351
> Dover, NH 03821-0351
>
>
>
> -----Original Message-----
> From: mdowle at mdowle.plus.com [mailto:mdowle at mdowle.plus.com]
> Sent: Wednesday, June 30, 2010 6:38 AM
> To: Damian Betebenner
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Merging where key is large
>
> Damian,
> How many persons are there?
> How many rows are there?
> What is the 'hit' measured in seconds?
> Please post some code as often it turns out to be something
> simple e.g.
> we'd want to confirm you aren't vector scanning by using == in the i.
> Matthew
>
> > Amazing package! Was wondering how to use data table with keys that
> > exceed the max size of a 32 bit integer. I have unique person
> > identifiers that are 10 digits and that of exceed 2e-09. I've
> > converted that "integer" to a factor but the hit on performance and
> > object size using that solution isn't very good.
> >
> > Any insights greatly appreciated,
> >
> > Best,
> >
> > Damian
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
> > -help
> >
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
atatable-help
>
More information about the datatable-help
mailing list