[Rcpp-devel] Speed up of the data.frame creation in DataFrame.h

Romain François romain at r-enthusiasts.com
Sat Jun 7 15:10:09 CEST 2014


Hello, 

I was merely pointing out the problem. People who maintain and contribute to Rcpp will tell you what they expect. I am no longer one of them. So I don’t really care either way, unless it starts adding a bug that will cause issues for software I’m involved with that still has to depend on Rcpp for reasons out of my control.

On a general note, I’d argue that it makes sense to submit the pull request anyway as it creates a special place where you can discuss the proposal, and it triggers continuous testing, so that travis will tell you if you break something. 

Romain

Le 7 juin 2014 à 14:35, Dmitry Nesterov <dmitry.nesterov at gmail.com> a écrit :

> Hello Romain,
> maybe then another function, like force_create() could be available? Or some checks for equal number of elements in each vector.
> One of the main Rcpp advantages to the user is its flexibility and speed, compared to the plain R code.
> I am not sure at this point what solution would be the best, but having fast methods in Rcpp would be really great.
> Should I wait then before submitting the pull request?
> Dmitry
> 
> On Jun 7, 2014, at 7:21 AM, Romain François <romain at r-enthusiasts.com> wrote:
> 
>> 
>> Le 7 juin 2014 à 03:27, Dmitry Nesterov <dmitry.nesterov at gmail.com> a écrit :
>> 
>>> Hello,
>>> Here I report the slowness in creation of Rcpp DataFrame objects and proposed change to speed it up.
>>> For system information, here is output from sessionInfo():
>>> R version 3.1.0 (2014-04-10)
>>> Platform: x86_64-apple-darwin13.1.0 (64-bit)
>>> ...
>>> other attached packages:
>>> [1] microbenchmark_1.3-0 Rcpp_0.11.1         
>>> 
>>> I am using Rcpp package to port my old functions written with R's C interface to a more convenient style of Rcpp.
>>> While writing code that creates data.frame’s, I noticed that the Rcpp-based code was running quite a bit slower (using microbenchmark package) than my old implementation. The difference was approximately 40(!) times slower for data frame 50x2 (row x col)
>>> 
>>> I have narrowed the speed difference down to the following call:
>>> 
>>>   return Rcpp::DataFrame::create(Rcpp::Named(“xdata”)=x,
>>>                                  Rcpp::Named(“ydata”)=y);
>>> 
>>> Where x and y are Rcpp::NumericVector objects.
>>> By debugging through the code and Rcpp, I noticed that during the creation Rcpp uses “as.data.frame” conversion on the vector list that contained x, y vectors and their names “xdata” and “ydata”, while this step was not necessary in my previous code using C interface.
>> 
>> Well, how then do you guarantee that the data frame is not corrupt ?
>> 
>> Consider this code: 
>> 
>> #include <Rcpp.h>
>> using namespace Rcpp ;
>> 
>> // [[Rcpp::export]]
>> DataFrame test(){
>>  NumericVector x = NumericVector::create( 1, 2, 3, 4 ) ;
>>  NumericVector y = NumericVector::create( 1, 2 ) ;
>>  return DataFrame::create(_["x"] = x, _["y"] = y ) ;
>> }
>> 
>> The benefit of calling as.data.frame is that it would handle recycling y correctly. 
>> 
>> Just setting the class attribute to "data.frame" by brute force would make a corrupt data frame. Perhaps you can get your suggestion approved on the basis of being consistent with other ways to get corrupt data frames in Rcpp. 
>> https://github.com/RcppCore/Rcpp/issues/144 
>> 
>> The basic idea is valid, but this would need more work and understanding of the conceptual requirements of a data frame. 
>> 
>> Romain
>> 
>> 
>>> In Rcpp/DataFrame.h:87
>>>      static DataFrame_Impl from_list( Parent obj ){
>>> This in turn calls on line 104:
>>>               return DataFrame_Impl(obj) ;
>>> and which ultimately calls on line 78:
>>>       void set__(SEXP x){
>>>           if( ::Rf_inherits( x, "data.frame" )){
>>>               Parent::set__( x ) ;
>>>           } else{
>>>               SEXP y = internal::convert_using_rfunction( x, "as.data.frame" ) ;
>>>               Parent::set__( y ) ;
>>>           }
>>>       }
>>> Since the DataFrame::create() function has not set a class attribute to “data.frame” by far, the conversion “as.data.frame” takes place and slows down the creation of the final object.
>>> I propose to make change on line 103 to set class attribute to “data.frame”, so no further conversion will take place:
>>>           if( use_default_strings_as_factors ) {
>>>               Rf_setAttrib(obj, R_ClassSymbol, Rf_mkString("data.frame"));
>>>               return DataFrame_Impl(obj) ;
>>>           }
>>> 
>>> I tested it and it brought the speed of execution of the function to about the same as it was before with plain C API.
>>> Please let me know if it makes sense or maybe I should use DataFrame::create() function differently.
>>> 
>>> Best,
>>> Dmitry
>>> 
>>> _______________________________________________
>>> Rcpp-devel mailing list
>>> Rcpp-devel at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/rcpp-devel
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/rcpp-devel/attachments/20140607/5a2a5472/attachment-0001.html>


More information about the Rcpp-devel mailing list