[datatable-help] Beta v1.8.0

Matthew Dowle mdowle at mdowle.plus.com
Tue Mar 6 10:35:19 CET 2012


Since the 1st item is a (potentially) backwards incompatible change, this
is advance notice, to request testing please and feedback if it causes any
problems. It seems R-Forge's daily build is much better since its reboot
and appears to be up to date.


NEW FEATURES

o  character columns are now allowed in keys and are preferred to
   factor. data.table() and setkey() no longer coerce character to
   factor. Factors are still supported. Implements FR#1493, FR#1224
   and (partially) FR#951.

o  unique(DT) and duplicated(DT) are now faster with character columns,
   on unkeyed tables as well as keyed tables, FR#1724.

o  New function set(DT,i,j,value) allows fast assignment to elements
   of DT. Similar to := but avoids the overhead of [.data.table, so is
   much faster inside a loop. Less flexible than :=, but as flexible
   as matrix subassignment. Similar in spirit to setnames(), setcolorder(),
   setkey() and setattr(); i.e., assigns by reference with no copy at all.

      M = matrix(1,nrow=100000,ncol=100)
      DF = as.data.frame(M)
      DT = as.data.table(M)
      system.time(for (i in 1:1000) DF[i,1L] <- i)   # 591.000s
      system.time(for (i in 1:1000) DT[i,V1:=i])     #   1.158s
      system.time(for (i in 1:1000) M[i,1L] <- i)    #   0.016s
      system.time(for (i in 1:1000) set(DT,i,1L,i))  #   0.027s

o  New functions chmatch() and %chin%, faster versions of match()
   and %in% for character vectors. R's internal string cache is
   utilised (no hash table is built). They are about 4 times faster
   than match() on the example in ?chmatch.

o  Internal function sortedmatch() removed and replaced with chmatch()
   when matching i levels to x levels for columns of type 'factor'. This
   preliminary step was causing a (known) significant slowdown when the
   number of levels of a factor column was large (e.g. >10,000).
   Exacerbated in tests of joining four such columns, as demonstrated by
   Wes McKinney (author of Python package Pandas). Matching 1 million
   strings of which of which 600,000 are unique is now reduced from 16s
   to 0.5s, for example.
   Background here :
       http://stackoverflow.com/questions/8991709/why-are-pandas-merges-in-python-faster-than-data-table-merges-in-r

o  rbind.data.table() gains a use.names argument, by default TRUE.
   Set to FALSE to combine columns in order rather than by name.

BUG FIXES

o  Fixed a `suffixes` handling bug in merge.data.table that was
   only recently introduced during the recent "fast-merge"-ing reboot.
   Briefly, the bug was only triggered in scenarios where both
   tables had identical column names that were not part of `by` and
   ended with *.1. cf. "merge and auto-increment columns in y[x]"
   test in tests/test-data.frame-like.R for more information.

o  Adding a column using := on a data.table just loaded from disk was
   correctly detected and over allocated, but incorrectly warning about
   a previous copy. Test 462 tested loading from disk, but suppressed
   warnings (sadly). Fixed.

o  data.table unaware packages that use DF[i] and DF[i]<-value syntax
   were not compatible with data.table, fixed. Many thanks to Prasad
   Chalasani for providing a reproducible example with base::droplevels().
   Test added.

o  as.data.table(DF) already preserved DF's attributes but not any
   inherited classes such as nlme's groupedData, so nlme was incompatible
   with data.table. Fixed. Thanks to Dieter Menne for providing a
   reproducible example. Test added.

THANKS TO

o  Joshua Ulrich for spotting a missing PACKAGE="data.table"
   in .Call in setkey.R, and suggesting as.list.default() and
   unique.default() to avoid dispatch for speed, all implemented.





More information about the datatable-help mailing list