[datatable-help] Quicker w/o keys set
Matthew Dowle
mdowle at mdowle.plus.com
Fri Mar 22 13:01:06 CET 2013
And this nice answer by Michael might be of interest too :
http://stackoverflow.com/a/13694673/403310
On 22.03.2013 11:05,
Matthew Dowle wrote:
> Whilst what Rick and Michael said is very true,
I suspect that you've found that setting a key on a *numeric* type
column is much slower than setkey on an *integer* column. There was an
awful (but correct) benchmark on S.O. recently and that's what I
replied, but I can't find it now. All I can think is that the OP deleted
the question, which would be a shame. If that OP is watching, and that
is what happened, please can they undelete it.
>
> Also you have a
setkey(DT) there, with no columns specified. In that case, it will key
all the columns; think key only table. But if you have numeric value
columns in there as well, or any non-key columns at all, then that will
be wasteful.
>
> Anyway, in the code you posted, try changing
>
>
as.numeric(aa)
>
> to
>
> as.integer(aa)
>
> and you should see
setkey run dramatically faster. Then what Rick and Michael said applies
from there.
>
> Matthew
>
> On 22.03.2013 04:31, Ricardo Saporta
wrote:
>
>> When you set the key, it sorts the table -- this is part
of what allows for the speed.
>> This initial sorting is what is
slowing down your benchmarks.
>>
>> While it makes sense to compare
the initial sort time if you are trying to get a 'full' comparison, in
most practice applications, you will only be setting the key once.
>>
>> Therefore, if you want to see what sort of speed increases you are
actually getting, create your DT's first, then benchmark the specific
operations of interest.
>>
>> Also, searching stackoverflow for [r]
data.table and benchmarks will produce several useful results
>>
>>
Cheers
>> Rick
>>
>> On Thursday, March 21, 2013, ekbrown wrote:
>>
>>> Hello. I'm new to data.table(). I am apparently not setting the
keys
>>> correctly to get the increase in speed talked about in the
vignettes, as I
>>> get a (much) quicker time *without* keys set. Take a
look at the following
>>> benchmarking tests. Any ideas? Thanks. Earl
Brown
>>>
>>> > library("data.table")
>>> > library("rbenchmark")
>>>
>
>>> > # generates random data
>>> > num.files > num.words >
logical.vector > file.names >
>>> > # defines functions
>>> > benDTNoKey
+ dt + dt[,sum(V1), by = bb][,V1]
>>> + }
>>> >
>>> > benDTWithKey + dt
+ setkey(dt)
>>> + dt[,sum(V1), by = bb][,V1]
>>> + }
>>> >
>>> >
benTapply >
>>> > # runs benchmarking
>>> >
benchmark(benTapply(logical.vector, file.names),
>>> >
benDTWithKey(logical.vector, file.names), benDTNoKey(logical.vector,
>>>
> file.names), replications = 10, columns = c("test",
"replications",
>>> > "elapsed"))
>>> test replications elapsed
>>> 3
benDTNoKey(logical.vector, file.names) 10 *0.753*
>>> 2
benDTWithKey(logical.vector, file.names) 10 *4.776*
>>> 1
benTapply(logical.vector, file.names) 10 6.218
>>> >
>>> > # tests for
sameness among results
>>> > one > two > three >
identical(as.integer(one), as.integer(two))
>>> [1] TRUE
>>> >
identical(as.integer(two), as.integer(three))
>>> [1] TRUE
>>>
>>>
--
>>> View this message in context:
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
[1]
>>> Sent from the datatable-help mailing list archive at
Nabble.com.
>>> _______________________________________________
>>>
datatable-help mailing list
>>>
datatable-help at lists.r-forge.r-project.org
>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]
>>
>> --
>>
>> Ricardo Saporta
>> Graduate Student, Data
Analytics
>> Rutgers University, New Jersey
>> e: saporta at rutgers.edu
[3]
Links:
------
[1]
http://r.789695.n4.nabble.com/Quicker-w-o-keys-set-tp4662157.html
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[3]
mailto:saporta at rutgers.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130322/73da8999/attachment.html>
More information about the datatable-help
mailing list