<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); "><div style="font-size: 14px; font-family: Calibri, sans-serif; ">So I recently asked about using writeDelimitedTo(...) in RProtoBuf. Since then, I've decided against that approach and am instead looking into the block serialization format used by the Java elephant-bird library, which aims to make Hadoop work with protocol buffer data. I'm pleased with what that gives me (Hadoop Map/Reduce, Hive, Pig, etc. support; splittability; small serialization size; compression). However, my use case still requires data to be usable in R. To that end, I'm interested in extending the RProtoBuf library to read elephant-bird block-serialized protocol buffers. Is RProtoBuf the right place to implement this capability? If so, what design guidance can you give me?</div><div style="font-size: 14px; font-family: Calibri, sans-serif; "><br></div><div><span class="Apple-style-span" style="font-size: 14px; font-family: Calibri, sans-serif; ">A bit more information about the block serialization format can be found in the <a href="https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java">BinaryBlockReader</a> and <a href="https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockWriter.java">BinaryBlockWriter</a> classes, and in <a href="https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/protobuf/block_storage.proto">block_storage.proto</a>. block_storage.proto defines the SerializedBlock message. </span>BinaryBlockWriter basically stuffs serialized messages of the target type into the `repeated bytes proto_blobs = 3;` field in SerializedBlock, split over multiple SerializedBlocks. The example given in the .proto file:</div><div><span class="Apple-tab-span" style="white-space:pre"> </span>SerializedBlock block = SerializedBlock.newBuilder().setVersion(1)</div><div> .setProtoClassName(Status.class.getName())</div><div> .addProtoBlobs(status1.toByteString())</div><div> .addProtoBlobs(status2.toByteString())</div><div> .build();</div><div><br></div><div>The SerializedBlock objects are then serialized in standard protobuf fashion, then written to the output stream with a certain byte sequence as delimiter, plus size information. </div><div><br></div><div>At any rate, a port to C++ of the reader and writer classes would probably not be too bad. The hard part from my perspective is how to connect that to the world of R in general, and to RProtoBuf in particular. Thoughts?</div><div>- Josh</div></body></html>