[datatable-help] Speeding up column references with roll

Stavros Macrakis (Σταῦρος Μακράκης) macrakis at alum.mit.edu
Tue Jul 1 00:51:36 CEST 2014


Thanks for your reply, but your code doesn't do the same thing as mine.
Here's a very small example of what I'm trying to do.

# Test data

> dd <-
data.table(groups=rep(1:2,each=4),time=1:8,hit=1:8%%3==0,key=c("groups","time"))
> dd
   groups time   hit
1:      1    1 FALSE
2:      1    2 FALSE
3:      1    3  TRUE
4:      1    4 FALSE
5:      2    5 FALSE
6:      2    6  TRUE
7:      2    7 FALSE
8:      2    8 FALSE

# Desired output includes the time and the corresponding roll time

> (res1 <- dd[(hit)][dd,list(rolltime=time),roll=2,by=.EACHI][!is.na
(rolltime)])
   groups time rolltime
1:      1    3       3
2:      1    4       3
3:      2    6       6
4:      2    7       6
5:      2    8       6

# Undesired output (without .EACHI)

> (res2 <- dd[hit==1][dd,list(rolltime=time),roll=2][!is.na(rolltime)])
   rolltime
1:       1
2:       2
3:       3
4:       4
5:       5
6:       6
7:       7
8:       8

# Undesired output (with allow.cartesian)

> res3 <- dd[hit==1][dd,list(rolltime=time),roll=2,allow.cartesian=TRUE][!
is.na(rolltime)])
> identical(res2,res3)
[1] TRUE

Re rolltime vs. time, consider the following

> dd[(hit)][dd,time,roll=2,by=.EACHI]
   groups time time
1:      1    1   NA
2:      1    2   NA
3:      1    3    3
4:      1    4    3
5:      2    5   NA
6:      2    6    6
7:      2    7    6
8:      2    8    6

There are two different output columns named 'time'. One is the time from
the right relation of the join, the other is the time from the left
relation of the join. There is nothing like the i.time convention for
distinguishing the time that comes from one of the tables from the (rolled)
time that comes from the other.

           -s



On Mon, Jun 30, 2014 at 5:34 PM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

> Your example doesn’t work without allow.cartesian=TRUE.
>
> You *shouldn’t* be using by=.EACHI here. This by was what was implicit in
> the earlier versions which made it slow. Please re-read the README.
>
> Here’s the function I tested on 1.9.3:
>
> calc1 <- function(d) {
>     d[ hit==1][ d,list(hittime=time),roll=-20, allow.cartesian=TRUE][ !is.na(hittime)]
> }
>
> calc2 <- function(d) {
>   temp <- d[ hit==1][ d,list(time),roll=-20, allow.cartesian=TRUE]
>   setnames(temp,1,"hittime")
>   temp[!is.na(hittime)]
> }
>
> # Generate sample data
> set.seed(12312391)
> data <- data.table(
>           group = sample(1e3,1e7,replace=T),
>           time = ceiling(runif(1e7, 0, 1e5)),
>           hit = rbinom(1e7, 1, p = 0.1),
>   key=c("group","time"))
>
> system.time(ans1 <- calc1(data))
> #   user  system elapsed
> #  2.083   0.189   2.344
> system.time(ans2 <- calc2(data))
> #   user  system elapsed
> #  2.012   0.241   2.426
> identical(ans1, ans2) # [1] TRUE
>
> You write:
> I also don't see any way to refer to the different time vs. hittime without renaming the second time column.
>
> I don’t quite follow what this means, but IIUC I think this is what you’re
> referring to: https://github.com/Rdatatable/data.table/issues/471
>
> You write:
> You mention some FR's, but they're hard to find without the specific numbers.
>
> I was mentioning the first two points under *NEW FEATURES* within Changes
> in v1.9.3. The one that starts with by=.EACHI runs j for each group in x
> that each row of i joins to. and the one that starts with Accordingly,
> X[Y, j] now does what X[Y][, j] did.
>
> Maybe we should start numbering the fixes for easy reference. Will note it
> down.
>
> You write: Where can I find the 1.9.3 reference manual?
>
> This version is a development version. Necesary changes will be reflected
> in their corresponding ?... entry. And when we find some time, the
> introduction and FAQs will be updated. But that’s not yet.
>
> If you don’t wish to keep up-to-date by looking at the NEWS, you’ll have
> to wait until the next stable release on CRAN.
>
> You write: On my system (MacOSX), build_vignettes=TRUE gives an error in texi2dvi -- would that have generated the refman? If so, how do I fix that?
>
> I’m guessing it’s a PDF latex error. If so, you’ll have to install what
> the error message says is missing on your system. Sorry, can’t help you
> much there.
>
>
> Arun
>
> From: Stavros Macrakis (Σταῦρος Μακράκης) macrakis at alum.mit.edu
> Reply: Stavros Macrakis (Σταῦρος Μακράκης) macrakis at alum.mit.edu
> Date: June 30, 2014 at 10:40:24 PM
> To: Arunkumar Srinivasan aragorn168b at gmail.com
> Cc: datatable-help at r-forge.wu-wien.ac.at
> datatable-help at r-forge.wu-wien.ac.at
> Subject:  Re: [datatable-help] Speeding up column references with roll
>
>  OK, I'm retesting in 1.9.3, adding by=.EACHI. I don't see any
> significant difference in the timings -- setnames is still 25% faster than
> list(hittime=time). What exactly was fixed?
>
> I also don't see any way to refer to the different time vs. hittime
> without renaming the second time column.
>
> You mention some FR's, but they're hard to find without the specific
> numbers.
>
> Where can I find the 1.9.3 reference manual? I think it would be easier to
> understand for me than the incremental changes in the New Features
> listings. On my system (MacOSX), build_vignettes=TRUE gives an error in
> texi2dvi -- would that have generated the refman? If so, how do I fix that?
>
>  Thanks,
>
>                -s
>
>
> On Mon, Jun 30, 2014 at 1:00 PM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>>  Once again, has been fixed in 1.9.3. Now join requires `by=.EACHI`
>> (explicit) to perform a by-without-by.
>>  https://github.com/Rdatatable/data.table/blob/master/README.md
>>  Have a look at the first FR (by = .EACHI runs ...) that's been fixed in
>> 1.9.3 - there's some changes in the way join results in due to these
>> changes (which've been discussed since and for quite sometime) to bring
>> more consistency to the DT[i, j, by] syntax. Also have a look at the second
>> FR and the links it points to for the discussions.
>>
>>  In general, it's better to test with the devel version (and have a look
>> at README) for any bugs you may encounter.
>>
>>  Arun
>>
>> From: Stavros Macrakis (Σταῦρος Μακράκης) macrakis at alum.mit.edu
>> Reply: Stavros Macrakis (Σταῦρος Μακράκης) macrakis at alum.mit.edu
>> Date: June 30, 2014 at 5:38:10 PM
>> To: datatable-help at r-forge.wu-wien.ac.at
>> datatable-help at r-forge.wu-wien.ac.at
>> Subject:  [datatable-help] Speeding up column references with roll
>>
>>     In the following example, it is about 15-25% faster to use setnames
>> rather than j=list(name=var). Is there some better approach to referencing
>> the other joined column when using roll?
>>
>>  # Use j=list(name=var)
>> calc1 <- function(d) {
>>   d[ hit==1
>>    ][ d,list(hittime=time),roll=-20
>>    ][ !is.na(hittime)
>>    ]
>> }
>>
>> # Use setnames
>> calc2 <- function(d) {
>>   temp <- d[ hit==1
>>            ][ d,time,roll=-20
>>            ]
>>   setnames(temp,3,"hittime")
>>   temp[!is.na(hittime)]
>> }
>>
>>  # Generate sample data
>> set.seed(12312391)
>> data <- data.table(
>>           group = sample(1e3,1e7,replace=T),
>>           time = ceiling(runif(1e7, 0, 1e5)),
>>           hit = rbinom(1e7, 1, p = 0.1),
>>   key=c("group","time"))
>>
>> # Timing
>>
>> system.time(replicate(10,{gc();calc1(data)})) => 69 sec
>> system.time(replicate(10,{gc();calc2(data)})) => 52 sec
>>   _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140630/60a91218/attachment-0001.html>


More information about the datatable-help mailing list