<div dir="ltr">Sure. I'll write something up for the gallery, but here's the crude outline.<div><br></div><div style>Here's the C++ code:</div><div style><pre><font color="#000000">#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List BuildCheapDataFrame(List a) {
List returned_frame = clone(a);
GenericVector sample_row = returned_frame(1);
StringVector row_names(sample_row.length());
for (int i = 0; i < sample_row.length(); ++i) {
char name[5];
sprintf(&(name[0]), "%d", i);
row_names(i) = name;
}
returned_frame.attr("row.names") = row_names;
StringVector col_names(returned_frame.length());
for (int j = 0; j < returned_frame.length(); ++j) {
char name[6];
sprintf(&(name[0]), "X.%d", j);
col_names(j) = name;
}
returned_frame.attr("names") = col_names;
returned_frame.attr("class") = "data.frame";
return returned_frame;
}</font></pre>There are some subtleties in this code:</div><div style><br></div><div style>* It turns out that one can't send super-large data frames to it because of possible buffer overflows. I've never seen that problem when I've written Rcpp functions which exchanged SEXPs with R, but this one uses Rcpp:export in order to use sourceCpp.</div>
<div style>* Notice the invocation of clone() in the first line of the code. If you don't do that, you wind up side-effecting the parameter, which is not what most people would expect.</div><div style><br></div><div style>
Here's the timing, as measured on an AWS node:</div><div style><br></div><div style><pre style="color:rgb(0,0,0)">> sourceCpp('/tmp/test_adf.cc')
> a <- replicate(250, 1:100, simplify=FALSE)
> system.time(replicate( { as.data.frame(a) ; NULL }, n=100))
user system elapsed
3.890 0.000 3.892
> system.time(replicate( { BuildCheapDataFrame(a) ; NULL }, n=100))
user system elapsed
0.020 0.000 0.022</pre>Yes, that really is a factor of 200 speedup.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Jan 18, 2013 at 8:16 AM, Paul Johnson <span dir="ltr"><<a href="mailto:pauljohn32@gmail.com" target="_blank">pauljohn32@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Thu, Jan 17, 2013 at 9:54 PM, John Merrill <<a href="mailto:john.merrill@gmail.com">john.merrill@gmail.com</a>> wrote:<br>
> As of 2.15.1, data.frame appears to no longer be O(n^2) in the number of<br>
> columns in the frame. That's certainly an improvement, yes.<br>
><br>
> However, by eliminating calls to data.frame and replacing them with direct<br>
> class modifications, I can take a routine which takes minutes and reduce it<br>
> to a routine which takes seconds. So, pragmatically, in Rcpp, I can get a<br>
> rough factor of sixty, it appears.<br>
><br>
><br>
</div>Wow.<br>
<br>
When you have this written out, will you post links to it? I can<br>
learn from your examples, I think.<br>
<span class="HOEnZb"><font color="#888888"><br>
pj<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
<br>
<br>
> On Thu, Jan 17, 2013 at 7:46 PM, Paul Johnson <<a href="mailto:pauljohn32@gmail.com">pauljohn32@gmail.com</a>> wrote:<br>
>><br>
>> On Tue, Jan 15, 2013 at 9:20 AM, John Merrill <<a href="mailto:john.merrill@gmail.com">john.merrill@gmail.com</a>><br>
>> wrote:<br>
>> > It appears that DataFrame::create is a thin layer on top of the R<br>
>> > data.frame<br>
>> > call. The guarantee correctness, but also means the performance of an<br>
>> > Rcpp<br>
>> > routine which returns a large data frame is limited by the performance<br>
>> > of<br>
>> > data.frame -- which is utterly horrible.<br>
>><br>
>> Are you certain that this claim is still true?<br>
>><br>
>> I was shocked/surprised by the package "dataframe" and the commentary<br>
>> about it. The author said that data.frame was slow because "This<br>
>> contains versions of standard data frame functions in R, modified to<br>
>> avoid making extra copies of inputs. This is faster, particularly for<br>
>> large data."<br>
>><br>
>> it was repeatedly copying some objects and he proved a substantially<br>
>> faster approach.<br>
>><br>
>> In the release notes for R-2.15.1, I recall seeing a note that R Core<br>
>> had responded by integrating several of those changes. But still<br>
>> data.frame is not fast for you?<br>
>><br>
>> If they didn't make the core data.frame as fast, would you care to<br>
>> enlighten us by installing the dataframe package and letting us know<br>
>> if it is still faster?<br>
>><br>
>> Or perhaps you are way ahead of me and you've already imitated<br>
>> Hesterberg's algorithms in your C++ design?<br>
>><br>
>> pj<br>
>><br>
>> --<br>
>> Paul E. Johnson<br>
>> Professor, Political Science Assoc. Director<br>
>> 1541 Lilac Lane, Room 504 Center for Research Methods<br>
>> University of Kansas University of Kansas<br>
>> <a href="http://pj.freefaculty.org" target="_blank">http://pj.freefaculty.org</a> <a href="http://quant.ku.edu" target="_blank">http://quant.ku.edu</a><br>
><br>
><br>
<br>
<br>
<br>
--<br>
Paul E. Johnson<br>
Professor, Political Science Assoc. Director<br>
1541 Lilac Lane, Room 504 Center for Research Methods<br>
University of Kansas University of Kansas<br>
<a href="http://pj.freefaculty.org" target="_blank">http://pj.freefaculty.org</a> <a href="http://quant.ku.edu" target="_blank">http://quant.ku.edu</a><br>
</div></div></blockquote></div><br></div>