[Rprotobuf-yada] Extending RProtoBuf to read elephant-bird-style block-serialized protobufs
johansen at adobe.com
Fri Mar 1 02:08:48 CET 2013
So I recently asked about using writeDelimitedTo(...) in RProtoBuf. Since then, I've decided against that approach and am instead looking into the block serialization format used by the Java elephant-bird library, which aims to make Hadoop work with protocol buffer data. I'm pleased with what that gives me (Hadoop Map/Reduce, Hive, Pig, etc. support; splittability; small serialization size; compression). However, my use case still requires data to be usable in R. To that end, I'm interested in extending the RProtoBuf library to read elephant-bird block-serialized protocol buffers. Is RProtoBuf the right place to implement this capability? If so, what design guidance can you give me?
A bit more information about the block serialization format can be found in the BinaryBlockReader<https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java> and BinaryBlockWriter<https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockWriter.java> classes, and in block_storage.proto<https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/protobuf/block_storage.proto>. block_storage.proto defines the SerializedBlock message. BinaryBlockWriter basically stuffs serialized messages of the target type into the `repeated bytes proto_blobs = 3;` field in SerializedBlock, split over multiple SerializedBlocks. The example given in the .proto file:
SerializedBlock block = SerializedBlock.newBuilder().setVersion(1)
The SerializedBlock objects are then serialized in standard protobuf fashion, then written to the output stream with a certain byte sequence as delimiter, plus size information.
At any rate, a port to C++ of the reader and writer classes would probably not be too bad. The hard part from my perspective is how to connect that to the world of R in general, and to RProtoBuf in particular. Thoughts?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Rprotobuf-yada