<div dir="ltr"><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">Some test you can do without many code change :</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">1) transform your data.table as matrix before write</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">2) use write table + this config to save place& time</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)"><div class="gmail_default">->sep = Pipe (1byte& rarely used)</div><div class="gmail_default">->disable quote (saves "" more than bilion times)</div><div class="gmail_default">->latin1 instead of utf-8</div></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">3) use chunks (say cut in slice output) and append=T (this may work in parallel)</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">If still too long, try installing some database (sqlite) on your 24 core system, and try load it</div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)"><br></div><div class="gmail_default" style="font-family:tahoma,sans-serif;color:rgb(0,51,51)">hope this helps</div></div><div class="gmail_extra"><br><div class="gmail_quote">2015-03-23 14:49 GMT+01:00 Gerald Jean <span dir="ltr"><<a href="mailto:gerald.jean@dgag.ca" target="_blank">gerald.jean@dgag.ca</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div lang="FR-CA" link="blue" vlink="purple">
<div>
<p class="MsoNormal">Hello,<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal"><span lang="EN-US">I am currently on a project where I have to read, process, aggregate 10 to 12 millions of files for roughly 10 billions lines of data.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">The files are arranged in roughly 64000 directories, each directory is one client’s data.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">I have written code importing and “massaging” the data per directory. The code is data.table driven. I am running this on a 24 cores machine with 145 Gb of RAM on a Linux box under RedHat.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">For testing purpose I have parallelized the code, using the doMC package, runs fine and it seems to be fast. But I haven’t tried to output the resulting files, three per client. A small one, a moderate size one and
a large one, over 500Gb estimated.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">My question:<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">what is the best way to output those files without creating bottlenecks??<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">I thought of breaking the list of input directories into 24 threads, supplying a list of lists to “foreach” where one of the components of each sub-list would be the name of the output files but I am worried that “write.table”
would take for ever to write this data to disk, one solution would be to use “save” and keep the output data in Rdata format, but that complicates further analysis by other software.<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Any suggestions???<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">By the way “data.table” sure helped so far in processing that data, thanks to the developpers for such an efficient package,<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<p class="MsoNormal"><span lang="EN-US">Gérald<u></u><u></u></span></p>
<p class="MsoNormal"><span lang="EN-US"><u></u> <u></u></span></p>
<table border="0" cellpadding="0" width="640" style="width:480.0pt">
<tbody>
<tr>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><span><img width="136" height="54" src="cid:image001.gif@01D0654C.B37DA460"></span><span style="font-size:12.0pt"><u></u><u></u></span></p>
</td>
<td style="padding:.75pt .75pt .75pt .75pt"></td>
<td style="padding:.75pt .75pt .75pt .75pt"></td>
</tr>
<tr>
<td width="300" valign="top" style="width:225.0pt;padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><b><span style="font-size:8.0pt;font-family:"Verdana","sans-serif";color:black">Gerald Jean, M. Sc. en statistiques</span></b><span style="font-size:8.0pt;font-family:"Verdana","sans-serif";color:black"><br>
Conseiller senior en statistiques<br>
<br>
Actuariat corporatif,<br>
Modélisation et Recherche<br>
Assurance de dommages<br>
Mouvement Desjardins</span><span style="font-size:12.0pt"><u></u><u></u></span></p>
</td>
<td width="170" valign="top" style="width:127.5pt;padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Verdana","sans-serif";color:black"><br>
Lévis (siège social)<br>
<br>
<a href="tel:418%20835-4900" value="+14188354900" target="_blank">418 835-4900</a>,<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Verdana","sans-serif";color:black">poste 5527639<br>
<a href="tel:1%20877%20835-4900" value="+18778354900" target="_blank">1 877 835-4900</a>, <u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:8.0pt;font-family:"Verdana","sans-serif";color:black">poste 5527639<br>
Télécopieur : <a href="tel:418%20835-6657" value="+14188356657" target="_blank">418 835-6657</a></span><span style="font-size:12.0pt"><u></u><u></u></span></p>
</td>
<td width="170" valign="top" style="width:127.5pt;padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:8.0pt;font-family:"Verdana","sans-serif";color:black"><br>
<br>
<br>
<br>
</span><span style="font-size:12.0pt"><u></u><u></u></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<table border="0" cellpadding="0" width="640" style="width:480.0pt">
<tbody>
<tr>
<td style="padding:.75pt .75pt .75pt .75pt">
<p class="MsoNormal"><span style="font-size:7.0pt;font-family:"Verdana","sans-serif";color:black">Faites bonne impression et imprimez seulement au besoin!<br>
<br>
</span><span style="font-size:7.0pt;font-family:"Verdana","sans-serif";color:dimgray">Ce courriel est confidentiel, peut être protégé par le secret professionnel et est adressé exclusivement au destinataire. Il est strictement interdit
à toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur. Merci.</span><span style="font-size:12.0pt"><u></u><u></u></span></p>
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><span><u></u> <u></u></span></p>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
<br>_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br></blockquote></div><br></div>