[datatable-help] Quicker w/o keys set

Matthew Dowle mdowle at mdowle.plus.com
Fri Mar 22 12:05:18 CET 2013


 

Whilst what Rick and Michael said is very true, I suspect that
you've found that setting a key on a *numeric* type column is much
slower than setkey on an *integer* column. There was an awful (but
correct) benchmark on S.O. recently and that's what I replied, but I
can't find it now. All I can think is that the OP deleted the question,
which would be a shame. If that OP is watching, and that is what
happened, please can they undelete it. 

Also you have a setkey(DT)
there, with no columns specified. In that case, it will key all the
columns; think key only table. But if you have numeric value columns in
there as well, or any non-key columns at all, then that will be
wasteful. 

Anyway, in the code you posted, try changing 


as.numeric(aa)

to 

 as.integer(aa)

and you should see setkey run
dramatically faster. Then what Rick and Michael said applies from
there.

Matthew

On 22.03.2013 04:31, Ricardo Saporta wrote: 

> When
you set the key, it sorts the table -- this is part of what allows for
the speed. 
> This initial sorting is what is slowing down your
benchmarks. 
> 
> While it makes sense to compare the initial sort time
if you are trying to get a 'full' comparison, in most practice
applications, you will only be setting the key once. 
> 
> Therefore, if
you want to see what sort of speed increases you are actually getting,
create your DT's first, then benchmark the specific operations of
interest. 
> 
> Also, searching stackoverflow for [r] data.table and
benchmarks will produce several useful results 
> 
> Cheers
> Rick
> 
>
On Thursday, March 21, 2013, ekbrown wrote:
> 
>> Hello. I'm new to
data.table(). I am apparently not setting the keys
>> correctly to get
the increase in speed talked about in the vignettes, as I
>> get a
(much) quicker time *without* keys set. Take a look at the following
>>
benchmarking tests. Any ideas? Thanks. Earl Brown
>> 
>> >
library("data.table")
>> > library("rbenchmark")
>> >
>> > # generates
random data
>> > num.files > num.words > logical.vector > file.names
>
>> > # defines functions
>> > benDTNoKey + dt + dt[,sum(V1), by =
bb][,V1]
>> + }
>> >
>> > benDTWithKey + dt + setkey(dt)
>> +
dt[,sum(V1), by = bb][,V1]
>> + }
>> >
>> > benTapply >
>> > # runs
benchmarking
>> > benchmark(benTapply(logical.vector, file.names),
>> >
benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
>>
> file.names), replications = 10, columns = c("test", "replications",
>>
> "elapsed"))
>> test replications elapsed
>> 3
benDTNoKey(logical.vector, file.names) 10 *0.753*
>> 2
benDTWithKey(logical.vector, file.names) 10 *4.776*
>> 1
benTapply(logical.vector, file.names) 10 6.218
>> >
>> > # tests for
sameness among results
>> > one > two > three >
identical(as.integer(one), as.integer(two))
>> [1] TRUE
>> >
identical(as.integer(two), as.integer(three))
>> [1] TRUE
>> 
>> --
>>
View this message in context:
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html [1]
>>
Sent from the datatable-help mailing list archive at Nabble.com.
>>
_______________________________________________
>> datatable-help
mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
> 
> -- 
> 
> Ricardo Saporta 
> Graduate Student, Data Analytics

> Rutgers University, New Jersey 
> e: saporta at rutgers.edu [3]




Links:
------
[1]
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130322/68156b57/attachment.html>


More information about the datatable-help mailing list