<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; "><div>I have now implemented a method in wrapper_Descriptor.cpp called "readMessagesFromFile" that reads messages serialized using the elephant-bird block serialization format. I'm working on returning these messages as a well-formed DataFrame but have a question.</div><div><br></div><div>It seems that there are "get_payload" and "as_list" methods in wrapper_Message.cpp. These methods seem like they would be perfect for constructing the data frame; however, neither the GPB::Message class nor the S4_Message class seem to have either of these methods on their instances. Where, then, are those methods being defined?</div><div><br></div><div>Thanks for any information.</div><div>- Josh</div><div><br></div><span id="OLK_SRC_BODY_SECTION"><div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt"><span style="font-weight:bold">From: </span> Josh Hansen <<a href="mailto:johansen@adobe.com">johansen@adobe.com</a>><br><span style="font-weight:bold">Date: </span> Thursday, February 28, 2013 6:08 PM<br><span style="font-weight:bold">To: </span> "<a href="mailto:rprotobuf-yada@lists.r-forge.r-project.org">rprotobuf-yada@lists.r-forge.r-project.org</a>" <<a href="mailto:rprotobuf-yada@lists.r-forge.r-project.org">rprotobuf-yada@lists.r-forge.r-project.org</a>><br><span style="font-weight:bold">Subject: </span> Extending RProtoBuf to read elephant-bird-style block-serialized protobufs<br></div><div><br></div><div><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); "><div style="font-size: 14px; font-family: Calibri, sans-serif; ">So I recently asked about using writeDelimitedTo(...) in RProtoBuf. Since then, I've decided against that approach and am instead looking into the block serialization format used by the Java elephant-bird
library, which aims to make Hadoop work with protocol buffer data. I'm pleased with what that gives me (Hadoop Map/Reduce, Hive, Pig, etc. support; splittability; small serialization size; compression). However, my use case still requires data to be usable
in R. To that end, I'm interested in extending the RProtoBuf library to read elephant-bird block-serialized protocol buffers. Is RProtoBuf the right place to implement this capability? If so, what design guidance can you give me?</div><div style="font-size: 14px; font-family: Calibri, sans-serif; "><br></div><div><span class="Apple-style-span" style="font-size: 14px; font-family: Calibri, sans-serif; ">A bit more information about the block serialization format can be found in the
<a href="https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockReader.java">
BinaryBlockReader</a> and <a href="https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/io/BinaryBlockWriter.java">
BinaryBlockWriter</a> classes, and in <a href="https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/protobuf/block_storage.proto">
block_storage.proto</a>. block_storage.proto defines the SerializedBlock message. </span>BinaryBlockWriter basically stuffs serialized messages of the target type into the `repeated bytes proto_blobs = 3;` field in SerializedBlock, split over multiple SerializedBlocks.
The example given in the .proto file:</div><div><span class="Apple-tab-span" style="white-space:pre"></span>SerializedBlock block = SerializedBlock.newBuilder().setVersion(1)</div><div> .setProtoClassName(Status.class.getName())</div><div> .addProtoBlobs(status1.toByteString())</div><div> .addProtoBlobs(status2.toByteString())</div><div> .build();</div><div><br></div><div>The SerializedBlock objects are then serialized in standard protobuf fashion, then written to the output stream with a certain byte sequence as delimiter, plus size information. </div><div><br></div><div>At any rate, a port to C++ of the reader and writer classes would probably not be too bad. The hard part from my perspective is how to connect that to the world of R in general, and to RProtoBuf in particular. Thoughts?</div><div>- Josh</div></div></div></span></body></html>