[datatable-help] data.table syntax Data Warehouse use case simulation

Jan Gorecki J.Gorecki at wit.edu.pl
Wed Jun 4 11:40:45 CEST 2014


Hi All,

I would rather not go deep into description of DW star schema model. It may
not be necessary as you have the initial structure and expected structure.

We have our measures (numeric values) in the "facts" tables. Facts are
connected to dimensions which contains the reference field from facts tables
plus some higher level attributes (may be seen as: dim1="Paris",
dim1h="France").

I'm looking for memory, time and syntax optimal solution to perform
denormalization of my data and join the facts table to all the dimension
tables.

# populate data
library(data.table)
facts <- data.table(dim1=letters[1:6], dim2=letters[7:12],
dim3=letters[13:18], dim4=letters[19:24],
                    quantity = rnorm(6,100,40),
                    value = rnorm(6,1000,200))
dim1 <- data.table(dim1=letters[1:6], dim1h=rep(letters[1:3],2), key="dim1")
dim2 <- data.table(dim2=letters[7:12], dim2h=rep(letters[7:9],2),
key="dim2")
dim3 <- data.table(dim3=letters[13:18], dim3h=rep(letters[13:15],2),
key="dim3")
dim4 <- data.table(dim4=letters[19:24], dim4h=rep(letters[19:21],2),
key="dim4")

# my proposed solution
joinby <- function(master, join, by){
  stopifnot(by %in% names(master) & by %in% names(join))
  join[setkeyv(master,by)]
}

# denormalize
dt <-
joinby(joinby(joinby(joinby(facts,dim1,"dim1"),dim2,"dim2"),dim3,"dim3"),dim4,"dim4")

# aggregate - expected results
dt[,list(quantity=sum(quantity),value=sum(value)),by=c("dim1h","dim2h","dim3h","dim4h")]


My solution assume the column names to be used on joins are identical.
The syntax isn't that great, but I couldn't figure out any better.
I'm not aware of the performance, it may be as issue because of resorting
the master tables on each join.

Anybody would propose better (more optimal) solution?

Regards,
Jan



--
View this message in context: http://r.789695.n4.nabble.com/data-table-syntax-Data-Warehouse-use-case-simulation-tp4691697.html
Sent from the datatable-help mailing list archive at Nabble.com.


More information about the datatable-help mailing list