[Rcpp-devel] Modules and Boost and larger data sets

Simon Zehnder szehnder at uni-bonn.de
Fri Sep 6 14:42:55 CEST 2013


Hi Dirk,

thanks for the quick answer and to the many suggestions and correction you gave! I have now a better idea how to design the package. 

On Sep 6, 2013, at 2:20 PM, Dirk Eddelbuettel <edd at debian.org> wrote:

> 
> On 6 September 2013 at 13:46, Simon Zehnder wrote:
> | Dear Rcpp-Users and Rcpp-Devels,
> | 
> | this goes especially to Dirk and Romain, the developers of RcppBDT. 
> 
> Well its's mostly me for the scope of it, with numerous invaluable assists
> from Romain.  The released version is far behind the SVN version;
> unfortunately the SVN version is far from release-ready.
> 
For the next time I know better. Looking forward to the release. 

> | I am right now writing on a package for market microstructure data -
> | usually large tick datasets with trade times and security symbols.
> 
> Interesting. I do that for a living too.
> 
Well, when doing research with the tick data it is sometimes a pain in the ass to match trades with the last quotes or match spot prices with future prices. In MM, research relies a lot on this and it consumes almost the most time. So I try to construct a package that can do most of it - and fast (my idea is to use openmp in C++ for ordering and filtering). Furthermore the most used tick data for research are either NYSE/wrds or for Bonds TRACE (regarding the spot markets). So the package should also deal with the special format of these to make it easier. There is a package 'highfrequency' which does something similar but for TRACE data it is not appropriate. 

> | I read the Rcpp Book about Modules and when starting as usual with S4
> | classes in R, the Modules came into my mind. As I am operating on datasets
> | with usually around 1 Mio. rows I am wondering, if maybe the implementation
> | via Modules is the better (better in regard to performance) one - in
> 
> That is not usually the motivation for modules.
> 
> "Straight up" functions, coded via inline or attributes, will be as fast.
> 
With that I make my decision -> in R S4 classes. I do know these very well now. 

> | comparison to the usual S4 class implementation directly in R. With Modules
> 
> "The usual S4 class implementation"?  
> 
> I have done R for over a decade and I still hardly use S4, so "the usual" is,
> errmm, "unusual".
> 
That is true. the S4 class system is not very near to OOP in C++ or Java and there are a lot of limitations, etc. It gives me though a good way to structure my code. With usual I meant: writing S4 classes in R - not defining them in C++: as far as I understood from the Modules chapter of your book - S4 classes are build automatically with Modules defined? Please correct me, if I am wrong. 

> | I am able to define all functions on the datasets in C++ - which I expect
> | to be faster. Sorting the data and filtering the data in regard to
> | dates/times are of course one of the main tasks to be covered.
> 
> I have some trouble with the logic of your argument, but accept the end
> result that Boost Date.Time is good for dates and times. :)
> 

It's all about performance. Sorry for being imprecise. I expect sorting and filtering data in regard to dates/times in C++ is faster than doing it in R relying on POSIXlt/POSIXct (at least for datasets of larger size).

> | In RcppBDT I read in the DESCRIPTION file, that the Boost Header Files for
> | Date.Time must be included.
> 
> "On the system on which RcppBDT is to be compiled" -- different from where it
> is used (Windows, say). _No run-time depends_

Ah, the binaries that can be loaded for each system ...

> | As I have to choose one library for Date/Time formats in C++, boost just
> | seems so appropriate. But for usage in the Market Microstructure community
> | it is impossible to expect them to install Boost on their system.
> 
> Sorry but one has nothing to do with the other.
> 
True, I just want my colleagues and other researchers in the field to be able to use it very easily. You give me the answer below.

> Also please look at the CRAN package BH -- it _provides_ Boost headers for
> this very purpose. Several packages already use it.
> 
> | So, I would like to provide Boost already within the package.
> 
> Just don't do it. Seriously. Use a "Depends: BH"
> 

That is perfect! Thanks for this valuable information. 

> | As everything what you two do makes sense, I think I haven't grabbed yet the
> | reason, why Boost is not provided in the RcppBDT right alongside. Is there
> | something which restricts me from doing this?
> 
> It's inefficient. We don't ship the headers of the C library either. 
> 
> It's just a Depends. 
> 
> Better to hand-off to the system, and with R, we can (at least for pure
> template headers) via the BH package we created.
> 
> | I am very thankful for thoughts and opinions on my idea and my question. 
> 
> Sure, no problem.
> 
> Dirk
> 
> -- 
> Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

So, at the end: Thanks again for your valuable comments and tips. 

Best

Simon



More information about the Rcpp-devel mailing list