[Rcpp-devel] Speed up of the data.frame creation in DataFrame.h

Romain François romain at r-enthusiasts.com
Sat Jun 7 13:21:39 CEST 2014

Le 7 juin 2014 à 03:27, Dmitry Nesterov <dmitry.nesterov at gmail.com> a écrit :

> Hello,
> Here I report the slowness in creation of Rcpp DataFrame objects and proposed change to speed it up.
> For system information, here is output from sessionInfo():
> R version 3.1.0 (2014-04-10)
> Platform: x86_64-apple-darwin13.1.0 (64-bit)
> ...
> other attached packages:
> [1] microbenchmark_1.3-0 Rcpp_0.11.1         
> I am using Rcpp package to port my old functions written with R's C interface to a more convenient style of Rcpp.
> While writing code that creates data.frame’s, I noticed that the Rcpp-based code was running quite a bit slower (using microbenchmark package) than my old implementation. The difference was approximately 40(!) times slower for data frame 50x2 (row x col)
> I have narrowed the speed difference down to the following call:
>    return Rcpp::DataFrame::create(Rcpp::Named(“xdata”)=x,
>                                   Rcpp::Named(“ydata”)=y);
> Where x and y are Rcpp::NumericVector objects.
> By debugging through the code and Rcpp, I noticed that during the creation Rcpp uses “as.data.frame” conversion on the vector list that contained x, y vectors and their names “xdata” and “ydata”, while this step was not necessary in my previous code using C interface.

Well, how then do you guarantee that the data frame is not corrupt ?

Consider this code: 

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
DataFrame test(){
  NumericVector x = NumericVector::create( 1, 2, 3, 4 ) ;
  NumericVector y = NumericVector::create( 1, 2 ) ;
  return DataFrame::create(_["x"] = x, _["y"] = y ) ;

The benefit of calling as.data.frame is that it would handle recycling y correctly. 

Just setting the class attribute to "data.frame" by brute force would make a corrupt data frame. Perhaps you can get your suggestion approved on the basis of being consistent with other ways to get corrupt data frames in Rcpp. 

The basic idea is valid, but this would need more work and understanding of the conceptual requirements of a data frame. 


> In Rcpp/DataFrame.h:87
>       static DataFrame_Impl from_list( Parent obj ){
> This in turn calls on line 104:
>                return DataFrame_Impl(obj) ;
> and which ultimately calls on line 78:
>        void set__(SEXP x){
>            if( ::Rf_inherits( x, "data.frame" )){
>                Parent::set__( x ) ;
>            } else{
>                SEXP y = internal::convert_using_rfunction( x, "as.data.frame" ) ;
>                Parent::set__( y ) ;
>            }
>        }
> Since the DataFrame::create() function has not set a class attribute to “data.frame” by far, the conversion “as.data.frame” takes place and slows down the creation of the final object.
> I propose to make change on line 103 to set class attribute to “data.frame”, so no further conversion will take place:
>            if( use_default_strings_as_factors ) {
>                Rf_setAttrib(obj, R_ClassSymbol, Rf_mkString("data.frame"));
>                return DataFrame_Impl(obj) ;
>            }
> I tested it and it brought the speed of execution of the function to about the same as it was before with plain C API.
> Please let me know if it makes sense or maybe I should use DataFrame::create() function differently.
> Best,
> Dmitry
> _______________________________________________
> Rcpp-devel mailing list
> Rcpp-devel at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel

More information about the Rcpp-devel mailing list