[Rcpp-devel] Performance question about DataFrame

John Merrill john.merrill at gmail.com
Sat Jan 19 00:25:29 CET 2013


Sure.  I'll write something up for the gallery, but here's the crude
outline.

Here's the C++ code:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
List BuildCheapDataFrame(List a) {
  List returned_frame = clone(a);
  GenericVector sample_row = returned_frame(1);

  StringVector row_names(sample_row.length());
  for (int i = 0; i < sample_row.length(); ++i) {
    char name[5];
    sprintf(&(name[0]), "%d", i);
    row_names(i) = name;
  }
  returned_frame.attr("row.names") = row_names;

  StringVector col_names(returned_frame.length());
  for (int j = 0; j < returned_frame.length(); ++j) {
    char name[6];
    sprintf(&(name[0]), "X.%d", j);
    col_names(j) = name;
  }
  returned_frame.attr("names") = col_names;
  returned_frame.attr("class") = "data.frame";

  return returned_frame;
}

There are some subtleties in this code:

* It turns out that one can't send super-large data frames to it because of
possible buffer overflows.  I've never seen that problem when I've written
Rcpp functions which exchanged SEXPs with R, but this one uses Rcpp:export
in order to use sourceCpp.
* Notice the invocation of clone() in the first line of the code.  If you
don't do that, you wind up side-effecting the parameter, which is not what
most people would expect.

Here's the timing, as measured on an AWS node:

> sourceCpp('/tmp/test_adf.cc')
> a <- replicate(250, 1:100, simplify=FALSE)
> system.time(replicate( { as.data.frame(a) ; NULL }, n=100))
   user  system elapsed
  3.890   0.000   3.892
> system.time(replicate( { BuildCheapDataFrame(a) ; NULL }, n=100))
   user  system elapsed
  0.020   0.000   0.022

Yes, that really is a factor of 200 speedup.


On Fri, Jan 18, 2013 at 8:16 AM, Paul Johnson <pauljohn32 at gmail.com> wrote:

> On Thu, Jan 17, 2013 at 9:54 PM, John Merrill <john.merrill at gmail.com>
> wrote:
> > As of 2.15.1, data.frame appears to no longer be O(n^2) in the number of
> > columns in the frame.  That's certainly an improvement, yes.
> >
> > However, by eliminating calls to data.frame and replacing them with
> direct
> > class modifications, I can take a routine which takes minutes and reduce
> it
> > to a routine which takes seconds.  So, pragmatically, in Rcpp, I can get
> a
> > rough factor of sixty, it appears.
> >
> >
> Wow.
>
> When you have this written out, will you post links to it?  I can
> learn from your examples, I think.
>
> pj
>
>
>
> > On Thu, Jan 17, 2013 at 7:46 PM, Paul Johnson <pauljohn32 at gmail.com>
> wrote:
> >>
> >> On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <john.merrill at gmail.com>
> >> wrote:
> >> > It appears that DataFrame::create is a thin layer on top of the R
> >> > data.frame
> >> > call.  The guarantee correctness, but also means the performance of an
> >> > Rcpp
> >> > routine which returns a large data frame is limited by the performance
> >> > of
> >> > data.frame -- which is utterly horrible.
> >>
> >> Are you certain that this claim is still true?
> >>
> >> I was shocked/surprised by the package "dataframe" and the commentary
> >> about it. The author said that data.frame was slow because "This
> >> contains versions of standard data frame functions in R, modified to
> >> avoid making extra copies of inputs. This is faster, particularly for
> >> large data."
> >>
> >> it was repeatedly copying some objects and he proved a substantially
> >> faster approach.
> >>
> >> In the release notes for R-2.15.1, I recall seeing a note that R Core
> >> had responded by integrating several of those changes. But still
> >> data.frame is not fast for you?
> >>
> >> If they didn't make the core data.frame as fast, would you care to
> >> enlighten us by installing the dataframe package and letting us know
> >> if it is still faster?
> >>
> >> Or perhaps you are way ahead of me and you've already imitated
> >> Hesterberg's algorithms in your C++ design?
> >>
> >> pj
> >>
> >> --
> >> Paul E. Johnson
> >> Professor, Political Science      Assoc. Director
> >> 1541 Lilac Lane, Room 504      Center for Research Methods
> >> University of Kansas                 University of Kansas
> >> http://pj.freefaculty.org               http://quant.ku.edu
> >
> >
>
>
>
> --
> Paul E. Johnson
> Professor, Political Science      Assoc. Director
> 1541 Lilac Lane, Room 504      Center for Research Methods
> University of Kansas                 University of Kansas
> http://pj.freefaculty.org               http://quant.ku.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20130118/12a62123/attachment-0001.html>


More information about the Rcpp-devel mailing list