[datatable-help] Best way to export Huge data sets.

Nicolas Paris niparisco at gmail.com
Mon Mar 23 22:50:33 CET 2015


Some test you can do without many code change  :
1) transform your data.table as matrix before write
2) use write table + this config to save place& time
->sep = Pipe (1byte& rarely used)
->disable quote (saves "" more than bilion times)
->latin1 instead of utf-8
3) use chunks (say cut in slice output) and append=T (this may work in
parallel)

If still too long, try installing some database (sqlite) on your 24 core
system, and try load it

hope this helps

2015-03-23 14:49 GMT+01:00 Gerald Jean <gerald.jean at dgag.ca>:

>  Hello,
>
>
>
> I am currently on a project where I have to read, process, aggregate 10 to
> 12 millions of files for roughly 10 billions lines of data.
>
>
>
> The files are arranged in roughly 64000 directories, each directory is one
> client’s data.
>
>
>
> I have written code importing and “massaging” the data per directory.  The
> code is data.table driven.  I am running this on a 24 cores machine with
> 145 Gb of RAM on a Linux box under RedHat.
>
>
>
> For testing purpose I have parallelized the code, using the doMC package,
> runs fine and it seems to be fast.  But I haven’t tried to output the
> resulting files, three per client.  A small one, a moderate size one and a
> large one, over 500Gb estimated.
>
>
>
> My question:
>
>
>
> what is the best way to output those files without creating bottlenecks??
>
>
>
> I thought of breaking the list of input directories into 24 threads,
> supplying a list of lists to “foreach” where one of the components of each
> sub-list would be the name of the output files but I am worried that
> “write.table” would take for ever to write this data to disk, one solution
> would be to use “save” and keep the output data in Rdata format, but that
> complicates further analysis by other software.
>
>
>
> Any suggestions???
>
>
>
> By the way “data.table” sure helped so far in processing that data, thanks
> to the developpers for such an efficient package,
>
>
>
> Gérald
>
>
>
>     *Gerald Jean, M. Sc. en statistiques*
> Conseiller senior en statistiques
>
> Actuariat corporatif,
> Modélisation et Recherche
> Assurance de dommages
> Mouvement Desjardins
>
>
> Lévis (siège social)
>
> 418 835-4900,
>
> poste 5527639
> 1 877 835-4900,
>
> poste 5527639
> Télécopieur : 418 835-6657
>
>
>
>
>
>
>
> Faites bonne impression et imprimez seulement au besoin!
>
> Ce courriel est confidentiel, peut être protégé par le secret
> professionnel et est adressé exclusivement au destinataire. Il est
> strictement interdit à toute autre personne de diffuser, distribuer ou
> reproduire ce message. Si vous l'avez reçu par erreur, veuillez
> immédiatement le détruire et aviser l'expéditeur. Merci.
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150323/0567b457/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 6632 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150323/0567b457/attachment-0001.gif>


More information about the datatable-help mailing list