[datatable-help] datatable-help Digest, Vol 17, Issue 10

Fri Jul 15 20:06:15 CEST 2011

On Fri, Jul 15, 2011 at 8:23 AM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi Dennis,
>
> I didn't see your post before I sent my latest reply.
>
> Nice detective work!

Thanks, Steve. I just followed my nose and the docs, which I have
conveniently kept in a small binder for such occasions :) Like JV, I
don't use data.table every day, so some of its idiosyncracies get
cobwebbed in the hard drive over time. The wiki entries helped a lot.

>
> For what it's worth, from what I understand your
> "punchline"/kewpie-prize solution is so much faster because it avoids
> building the .SD data.table within each group.

That was my deduction from having read the first entry in the wiki. I
still can't believe I got that thing to work :)

>
> I'll let Matthew leave a more detailed comment, since he's (obviously)
> much more intimately familiar w/ the inner voodoo of data.table. But
> as a last comment -- if the speed differences are so drastic because
> of the cost of creating the .SD data.table, maybe we should think
> about taking some "inspiration" from plyr and define a similar
> `colwise` function -- which would operate across each "column" of
> supposedly-build .SD object applying a function to each of them w/o
> actually building an .SD object itself.

Your clairvoyance skills are clearly operating today :)  More
seriously, this is what I would consider an 'obvious' "big-data"
problem - I could easily see situations arising in finance and genomic
applications where a fairly large subset of variables of the same
type, but not necessarily all of them, need to be summarized in a
particular way. The colwise() functions would be problematic as well
in the scenario described in my eariler post, but I haven't tried
ddply() to verify that assertion so I could be mistaken.

It would be *really* helpful to have a convenient, fast  mechanism in
data.table that allows one to substitute a (possibly large) vector of
variable names into a function. Alas, I don't have any bright ideas
about how to program it. Fortunately, there are some nice functions in
R to select variable subsets efficiently in data frames (e.g., the
grep() family of functions, regular expressions, %in% and so on), but
I don't know how that would translate easily to data.table() since the
internals are so different.

Looking forward to the team's take on this...

Dennis

>
> -steve
>
> On Fri, Jul 15, 2011 at 10:34 AM, Dennis Murphy <djmuser at gmail.com> wrote:
>> Hi:
>>
>> <A bunch snipped because I get the archives in digest form>
>>
>> Re Prof. Voelkel's recent posts:
>>
>> (1) Quoting does not work well in data.table; this is mentioned in
>> several of the FAQs. Apropos to this discussion, some of the relevant
>> ones include 1.2, 1.6 and 2.1; there may be others :)
>>
>> (2) Steve's response seems to be the right way to go (although see
>> below), but I thought I'd up the stakes a little and assume that Prof.
>> Voelkel has a large number of variables, only a subset of which he may
>> want summarized in a particular go. To that end, I created the
>> following toy data frame cum data.table; this is as much for my own
>> edification as anyone else's (which explains the eventual length of
>> this post...I got curious :)
>>
>> This goes against the advice given in the first example of the
>> data.table wiki, but if you have, say, 100 variables to select out of
>> a possible 1000, it doesn't make sense to list them individually as
>> recommended on the wiki. (But see below...)
>>
>> library('data.table')
>> set.seed(1043)
>> m <- matrix(rpois(240, 10), nrow = 6)
>> colnames(m) <- paste('A', 1:40, sep = '')
>> m <- as.data.frame(m)
>> dt2 <- data.table(x = rep(1:3, 2), y = rep(1:3, each = 2), m, key = 'x')
>> dim(dt2)
>> # [1]  6 42       ...so far, so good
>>
>> # Subset of variables for which sums are desired
>> vars <- paste('A', c(1, 4, 10, 15, 31), sep = '')
>>
>> # One approach: use the select = argument of subset() to restrict
>> # the variables under consideration:
>> dt2[, lapply(subset(.SD, select = vars), sum), by = 'x']
>>     x A1 A4 A10 A15 A31
>> [1,] 1 18 21  22  22  24
>> [2,] 2 20 13 27 23 21
>> [3,] 3 22 15  16  23  15
>>
>> # Use the with = FALSE construct of data.table to do the same:
>> dt2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y']
>>     x y A1 A4 A10 A15 A31
>> [1,] 1 1 11 13  12  11  16
>> [2,] 1 2  7  8  10  11   8
>> [3,] 2 1 10  4  16   7  11
>> [4,] 2 3 10 9 11 16 10
>> [5,] 3 2 11  8   7  11   7
>> [6,] 3 3 11  7   9  12   8
>>
>> # For this example, it is the same (apart from the key variables) as
>> dt2[, vars, with = FALSE]
>>
>> Not bad for this small example, but what happens in a much larger data table?
>>
>> To find out, I created a 10000 x 1000 matrix that I converted into a
>> data table, added two grouping variables of 100 levels each and then
>> tried both approaches above again. Performance isn't bad when
>> summarizing over one variable, but there is a definite hit when two
>> variables are summarized. [It makes some sense since one is grouping
>> over 10000 level combinations rather than 100, but once again, keep
>> reading.] Curiously, it makes no difference if there is one key
>> variable or two, which made me wonder what the preferred approach is
>> in this circumstance.
>>
>> m <- matrix(rpois(10000000, 10), nrow = 10000)
>> m <- as.data.table(m)
>> m <- transform(m, x = rep(1:100, each = 100), y = rep(1:100, 100))
>> setkey(m, 'x')
>> dim(m)
>> # [1] 10000  1002
>>
>> # Randomly select 150 variables from the 1000
>> vars <- paste('A', sample(1:1000, 150, replace = FALSE), sep = '')
>> length(vars)
>> # [1] 150
>> key(m)
>> # [1] "x"
>>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x'])
>>   user  system elapsed
>>   0.75    0.00    0.75
>>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x'])
>>   user  system elapsed
>>   0.64    0.00    0.64
>>> system.time(m[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
>>   user  system elapsed
>>  53.65    0.00   53.85
>>> system.time(m[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
>>   user  system elapsed
>>  44.21    0.01   44.35
>>
>> m2 <- data.table(m, key = 'x, y')
>> rm(m)
>> key(m2)
>> # [1] "x" "y"
>>> system.time(m2[, lapply(subset(.SD, select = vars), sum), by = 'x, y'])
>>   user  system elapsed
>>  53.54    0.00   53.73
>>> system.time(m2[, lapply(.SD[, vars, with = FALSE], sum), by = 'x, y'])
>>   user  system elapsed
>>  44.30    0.04   44.60
>>
>> The first question in the wiki
>> (http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table) says
>> to use the columns directly rather than to rely on .SD. I wanted to
>> know how to pass new names to the summaries instead of overwriting the
>> original variable names. For the fun of it, I tried the following:
>>
>> select <- sample(1:1000, 150, replace = FALSE)
>> vars <- paste('A', select, sep = '')
>> outvars <- paste('S', select, sep = '')
>>
>> # Create a long expression of the form 'list(..., Sn = sum(An), ...)',
>> # n a subscript from 1 to 150.
>> expr <- paste('list(', paste(outvars, paste('sum(', vars, ')', sep =
>> ''), sep = '=', collapse = ','),
>>               ')', sep = '')
>> u <- m2[, eval(parse(text = expr)), by = 'x']
>>> dim(u)
>> # [1] 100 151     seems reasonable...
>>
>> This seemed to run rather fast, so I decided to time it:
>>
>>> system.time(m2[, eval(parse(text = expr)), by = 'x'])
>>   user  system elapsed
>>   0.03    0.00    0.03
>>> system.time(m2[, eval(parse(text = expr)), by = 'x, y'])
>>   user  system elapsed
>>   1.05    0.00    1.04
>>
>> I've got to admit, this is not the approach I would have taken
>> normally, is certainly not intuitively obvious to me and flouts the
>> usual advice to avoid the eval(parse(text = )) mantra, but the data
>> don't lie :)  Please tell me there's a more code-efficient way to do
>> this (the new variable names included), because my 'solution' was a
>> complete kludge and accidental kewpie prize.
>>
>> Cheers,
>> Dennis
>>
>>> Message: 1
>>> Date: Thu, 14 Jul 2011 16:36:11 -0400
>>> From: Joseph Voelkel <jgvcqa at rit.edu>
>>> Subject: [datatable-help] Skipping some Vi names
>>> To: "datatable-help at lists.r-forge.r-project.org"
>>>        <datatable-help at r-forge.wu-wien.ac.at>
>>> Message-ID:
>>>        <70EFCDD908F9264785FA08EC3A471320282158585C at ex02mail01.ad.rit.edu>
>>> Content-Type: text/plain; charset="us-ascii"
>>>
>>> I don't use data.table too much (though I probably should use it more...).
>>>
>>> I was surprised at the results below. It appears that the name V1 gets assigned to the first result, but then the keys ("in the background") are assigned the next set of Vi names, creating a gap in the names depending on the number of keys. I would like to see the Vi names appear in their natural, sequential, order. Not a show stopper, but it's annoying. (I have over 40 Vi's and it'd be good to have them numbered more rationally.) Thanks.
>>>
>>>> dt<-data.table(x=c(1,2,3,1,2,3),y=c(1,1,2,2,3,3),A1=1:6,A2=7:12,A3=13:18,key="x")
>>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x"]
>>>     x V1 V3 V4
>>> [1,] 1  5 17 29
>>> [2,] 2  7 19 31
>>> [3,] 3  9 21 33
>>>> key(dt)<-c("x","y")
>>>> dt[,list("sum(A1),sum(A2),sum(A3)"),by="x,y"]
>>>     x y V1 V4 V5
>>> [1,] 1 1  1  7 13
>>> [2,] 1 2  4 10 16
>>> [3,] 2 1  2  8 14
>>> [4,] 2 3  5 11 17
>>> [5,] 3 2  3  9 15
>>> [6,] 3 3  6 12 18
>>>
>>>
>>>
>>> Joseph G. Voelkel, Ph.D.
>>> Professor, Center for Quality and Applied Statistics
>>> Kate Gleason College of Engineering
>>> Rochester Institute of Technology
>>> V 585-475-2231
>>> F 585-475-5959
>>> joseph.voelkel at rit.edu
>>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>