[datatable-help] Beta v1.8.0
Matthew Dowle
mdowle at mdowle.plus.com
Tue Mar 6 10:35:19 CET 2012
Since the 1st item is a (potentially) backwards incompatible change, this
is advance notice, to request testing please and feedback if it causes any
problems. It seems R-Forge's daily build is much better since its reboot
and appears to be up to date.
NEW FEATURES
o character columns are now allowed in keys and are preferred to
factor. data.table() and setkey() no longer coerce character to
factor. Factors are still supported. Implements FR#1493, FR#1224
and (partially) FR#951.
o unique(DT) and duplicated(DT) are now faster with character columns,
on unkeyed tables as well as keyed tables, FR#1724.
o New function set(DT,i,j,value) allows fast assignment to elements
of DT. Similar to := but avoids the overhead of [.data.table, so is
much faster inside a loop. Less flexible than :=, but as flexible
as matrix subassignment. Similar in spirit to setnames(), setcolorder(),
setkey() and setattr(); i.e., assigns by reference with no copy at all.
M = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(M)
DT = as.data.table(M)
system.time(for (i in 1:1000) DF[i,1L] <- i) # 591.000s
system.time(for (i in 1:1000) DT[i,V1:=i]) # 1.158s
system.time(for (i in 1:1000) M[i,1L] <- i) # 0.016s
system.time(for (i in 1:1000) set(DT,i,1L,i)) # 0.027s
o New functions chmatch() and %chin%, faster versions of match()
and %in% for character vectors. R's internal string cache is
utilised (no hash table is built). They are about 4 times faster
than match() on the example in ?chmatch.
o Internal function sortedmatch() removed and replaced with chmatch()
when matching i levels to x levels for columns of type 'factor'. This
preliminary step was causing a (known) significant slowdown when the
number of levels of a factor column was large (e.g. >10,000).
Exacerbated in tests of joining four such columns, as demonstrated by
Wes McKinney (author of Python package Pandas). Matching 1 million
strings of which of which 600,000 are unique is now reduced from 16s
to 0.5s, for example.
Background here :
http://stackoverflow.com/questions/8991709/why-are-pandas-merges-in-python-faster-than-data-table-merges-in-r
o rbind.data.table() gains a use.names argument, by default TRUE.
Set to FALSE to combine columns in order rather than by name.
BUG FIXES
o Fixed a `suffixes` handling bug in merge.data.table that was
only recently introduced during the recent "fast-merge"-ing reboot.
Briefly, the bug was only triggered in scenarios where both
tables had identical column names that were not part of `by` and
ended with *.1. cf. "merge and auto-increment columns in y[x]"
test in tests/test-data.frame-like.R for more information.
o Adding a column using := on a data.table just loaded from disk was
correctly detected and over allocated, but incorrectly warning about
a previous copy. Test 462 tested loading from disk, but suppressed
warnings (sadly). Fixed.
o data.table unaware packages that use DF[i] and DF[i]<-value syntax
were not compatible with data.table, fixed. Many thanks to Prasad
Chalasani for providing a reproducible example with base::droplevels().
Test added.
o as.data.table(DF) already preserved DF's attributes but not any
inherited classes such as nlme's groupedData, so nlme was incompatible
with data.table. Fixed. Thanks to Dieter Menne for providing a
reproducible example. Test added.
THANKS TO
o Joshua Ulrich for spotting a missing PACKAGE="data.table"
in .Call in setkey.R, and suggesting as.list.default() and
unique.default() to avoid dispatch for speed, all implemented.
More information about the datatable-help
mailing list