[datatable-help] Merging where key is large

mdowle at mdowle.plus.com mdowle at mdowle.plus.com
Wed Jun 30 16:39:22 CEST 2010


In the random_ids, return factor to make things simpler later :

    factor(sample(100000:999999, num_ids, replace=FALSE)*10000)

Then setting keys should be simpler in one step.

Multiple tables can be joined using :

    dt1[dt2][dt3][dt4]

I don't know if that helps - depends on the bigger picture i.e. once you
merged them all then you need to do something, whats the something? You
can put j and by in each bit too.

Or how about something like :

    onedt[,as.list(table(...)),by=list(person,year)][,lm(...)]

Do you have a known number of tables in advance, like a database join of
multiple tables, or do you need to join a variable number of tables ?

But like Tom says, why are you splitting by year into different tables,
can't you have one table ?  It may help us to know a bit more about the
bigger picture, perhaps via a small example of the full calculation.


> Thanks Tom,
>
> The data actually starts out in long form and is being reshaped to wide
> for regression analyses with independent and dependent variables in the
> same row.
> Because I know the exact format of the data, I wanted to construct my own
> long-to-wide converter.
>
> I'm sure I'm making this too hard somehow. But that's my sticking point
> right now.
>
> Thanks,
>
> Damian
>
>
> -----Original Message-----
> From: Short, Tom [mailto:TShort at epri.com]
> Sent: Wednesday, June 30, 2010 8:58 AM
> To: Damian Betebenner; mdowle at mdowle.plus.com
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: RE: [datatable-help] Merging where key is large
>
> Your my.merge.2 is trying to merge lists. For that, merge reverts to
> merge.default defined as:
>
> function (x, y, ...)
> merge(as.data.frame(x), as.data.frame(y), ...)
>
> That's why you end up with data.frames and why they both work. I'm not
> sure why my.merge.2 is faster than my.merge.1 when using factors.
>
> For this problem, you may want to investigate rbinding everything together
> (stacking) and using a long format instead of a wide format with the
> following columns:
>
> ID, SCORE, YEAR
>
> - Tom
>
>
>
>
>
>> -----Original Message-----
>> From: datatable-help-bounces at lists.r-forge.r-project.org
>> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
>> On Behalf Of Damian Betebenner
>> Sent: Wednesday, June 30, 2010 08:25
>> To: mdowle at mdowle.plus.com
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Merging where key is large
>>
>> Hi Matt,
>>
>> There are possibly millions of persons.
>>
>> This question stems from the desire to do fast joins across
>> multiple tables in the style of merge(x, y, all=TRUE). Later,
>> aggregations across many grouping variables will be done
>> which data.table will greatly speed up as well.
>>
>> Consider the following example:
>>
>> # function to calculate random 10 digit ids
>>
>> random_ids <- function(num_ids) {
>>      if (num_ids > 900000) stop("Please select fewer than
>> 900,000 ids")
>>      sample(100000:999999, num_ids, replace=FALSE)*10000 }
>>
>>
>> # Create 4 overlapping data tables that I want to merge using the ID.
>>
>> my_ids <- random_ids(10000)
>>
>> dt_1 <- data.table(ID=my_ids[1:4000], SCORE_2006=rnorm(4000,
>> mean=50, sd=10))
>> dt_2 <- data.table(ID=my_ids[2001:6000],
>> SCORE_2007=rnorm(4000, mean=50, sd=10))
>> dt_3 <- data.table(ID=my_ids[4001:8000],
>> SCORE_2008=rnorm(4000, mean=50, sd=10))
>> dt_4 <- data.table(ID=my_ids[6001:10000],
>> SCORE_2009=rnorm(4000, mean=50, sd=10))
>>
>>
>>
>>
>> # Create list whose elements are the four data.tables defined above
>>
>> my.list <- vector("list", 4)
>>
>> for (i in 1:4) {
>>    my.list[[i]] <- get(paste("dt_", i, sep="")) }
>>
>>
>> # Want to merge files "elegantly" and FAST so create
>> functions to do this (these functions return different
>> results which will show up later)
>>
>> my.merge1 <- function(data) {
>>     if (length(data) == 2) {
>>         merge(data[[1]], data[[2]], all = TRUE)
>>     }
>>     else {
>>         merge(data[[1]], Recall(data[-1]), all = TRUE)
>>     }
>> }
>>
>> my.merge2 <- function(data) {
>>     if (length(data) == 2) {
>>         merge(data[1], data[2], all = TRUE)
>>     }
>>     else {
>>         merge(data[1], Recall(data[-1]), all = TRUE)
>>     }
>> }
>>
>>
>> # Now merge together
>>
>> # without a key set
>>
>> system.time(my.merge.nokey.1 <- my.merge1(my.list))  ###
>> DOESN'T WORK AND NOT SURE WHY
>> system.time(my.merge.nokey.2 <- my.merge2(my.list))  ###
>> WORKS BUT RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????
>>
>>
>> # Now trying to set a key
>>
>> setkey(my.list[[1]], "ID")
>> Error in setkey(my.list[[1]], "ID") :
>>   All keyed columns must be storage mode integer
>>
>> # If I convert the ID variable to integer, because some of
>> the integers are bigger than the 32 integer limit, they are
>> converted to NAs
>>
>> # If we work with the ID as a factor.
>>
>> dt_1f <- data.table(ID=as.factor(my_ids[1:4000]),
>> SCORE_2006=rnorm(4000, mean=50, sd=10), key="ID") dt_2f <-
>> data.table(ID=as.factor(my_ids[2001:6000]),
>> SCORE_2007=rnorm(4000, mean=50, sd=10), key="ID") dt_3f <-
>> data.table(ID=as.factor(my_ids[4001:8000]),
>> SCORE_2008=rnorm(4000, mean=50, sd=10), key="ID") dt_4f <-
>> data.table(ID=as.factor(my_ids[6001:10000]),
>> SCORE_2009=rnorm(4000, mean=50, sd=10), key="ID")
>>
>>
>> my.list.f <- vector("list", 4)
>>
>>
>> for (i in 1:4) {
>>    my.list.f[[i]] <- get(paste("dt_", i, "f", sep="")) }
>>
>>
>> system.time(my.merge.1 <- my.merge1(my.list.f)) ### THIS DOES
>> WORK THIS TIME AND NOT SURE WHY. RETURNS A DATA.TABLE
>>
>> system.time(my.merge.2 <- my.merge2(my.list.f)) ### WORKS BUT
>> RETURNS A DATA.FRAME INSTEAD OF A DATA.TABLE ?????  ALSO,
>> MUCH FASTER THAN THE PRIOR LINE'S MERGE
>>
>>
>> ### Would like to know if this is possible with the [,
>> notation. Cannot figure out how to get this notation to
>> return what one gets with merge(   ,all=TRUE)
>>
>>
>>
>>
>> Thanks for the great package and your willingness to help,
>>
>> Best regards,
>>
>> Damian
>>
>>
>>
>>
>> Damian Betebenner
>> Center for Assessment
>> PO Box 351
>> Dover, NH   03821-0351
>>  
>>
>>
>> -----Original Message-----
>> From: mdowle at mdowle.plus.com [mailto:mdowle at mdowle.plus.com]
>> Sent: Wednesday, June 30, 2010 6:38 AM
>> To: Damian Betebenner
>> Cc: datatable-help at lists.r-forge.r-project.org
>> Subject: Re: [datatable-help] Merging where key is large
>>
>> Damian,
>> How many persons are there?
>> How many rows are there?
>> What is the 'hit' measured in seconds?
>> Please post some code as often it turns out to be something
>> simple e.g.
>> we'd want to confirm you aren't vector scanning by using == in the i.
>> Matthew
>>
>> > Amazing package! Was wondering how to use data table with keys that
>> > exceed the max size of a 32 bit integer. I have unique person
>> > identifiers that are 10 digits and that of exceed 2e-09. I've
>> > converted that "integer" to a factor but the hit on performance and
>> > object size using that solution isn't very good.
>> >
>> > Any insights greatly appreciated,
>> >
>> > Best,
>> >
>> > Damian
>> >
>> >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> >
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable
>> > -help
>> >
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/d
> atatable-help
>>
>




More information about the datatable-help mailing list