[datatable-help] Best way to export Huge data sets.

Gerald Jean gerald.jean at dgag.ca
Tue Mar 24 13:03:21 CET 2015


Hello Nicolas,

thanks for your suggestions.  The data.table can’t be transformed in a matrix as it is of mixed types : POSIX.ct columns, character, logical, factor and numeric columns.

Admin is currently installing PostgreSQL on the server, I’ll try to go that route.  Too bad data.table doesn’t have, yet, a writing routine as fast as “fread” is for reading!!!

Thanks,

Gérald

[cid:image001.gif at 01D06608.FEC7C1A0]

Gerald Jean, M. Sc. en statistiques
Conseiller senior en statistiques

Actuariat corporatif,
Modélisation et Recherche
Assurance de dommages
Mouvement Desjardins


Lévis (siège social)

418 835-4900,
poste 5527639
1 877 835-4900,
poste 5527639
Télécopieur : 418 835-6657







Faites bonne impression et imprimez seulement au besoin!

Ce courriel est confidentiel, peut être protégé par le secret professionnel et est adressé exclusivement au destinataire. Il est strictement interdit à toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur. Merci.



De : Nicolas Paris [mailto:niparisco at gmail.com]
Envoyé : 23 mars 2015 17:51
À : Gerald Jean
Cc : datatable-help at lists.r-forge.r-project.org
Objet : Re: [datatable-help] Best way to export Huge data sets.

Some test you can do without many code change  :
1) transform your data.table as matrix before write
2) use write table + this config to save place& time
->sep = Pipe (1byte& rarely used)
->disable quote (saves "" more than bilion times)
->latin1 instead of utf-8
3) use chunks (say cut in slice output) and append=T (this may work in parallel)

If still too long, try installing some database (sqlite) on your 24 core system, and try load it

hope this helps

2015-03-23 14:49 GMT+01:00 Gerald Jean <gerald.jean at dgag.ca<mailto:gerald.jean at dgag.ca>>:
Hello,

I am currently on a project where I have to read, process, aggregate 10 to 12 millions of files for roughly 10 billions lines of data.

The files are arranged in roughly 64000 directories, each directory is one client’s data.

I have written code importing and “massaging” the data per directory.  The code is data.table driven.  I am running this on a 24 cores machine with 145 Gb of RAM on a Linux box under RedHat.

For testing purpose I have parallelized the code, using the doMC package, runs fine and it seems to be fast.  But I haven’t tried to output the resulting files, three per client.  A small one, a moderate size one and a  large one, over 500Gb estimated.

My question:

what is the best way to output those files without creating bottlenecks??

I thought of breaking the list of input directories into 24 threads, supplying a list of lists to “foreach” where one of the components of each sub-list would be the name of the output files but I am worried that “write.table” would take for ever to write this data to disk, one solution would be to use “save” and keep the output data in Rdata format, but that complicates further analysis by other software.

Any suggestions???

By the way “data.table” sure helped so far in processing that data, thanks to the developpers for such an efficient package,

Gérald

[cid:image001.gif at 01D06608.FEC7C1A0]

Gerald Jean, M. Sc. en statistiques
Conseiller senior en statistiques

Actuariat corporatif,
Modélisation et Recherche
Assurance de dommages
Mouvement Desjardins


Lévis (siège social)

418 835-4900<tel:418%20835-4900>,
poste 5527639
1 877 835-4900<tel:1%20877%20835-4900>,
poste 5527639
Télécopieur : 418 835-6657<tel:418%20835-6657>






Faites bonne impression et imprimez seulement au besoin!

Ce courriel est confidentiel, peut être protégé par le secret professionnel et est adressé exclusivement au destinataire. Il est strictement interdit à toute autre personne de diffuser, distribuer ou reproduire ce message. Si vous l'avez reçu par erreur, veuillez immédiatement le détruire et aviser l'expéditeur. Merci.




_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org<mailto:datatable-help at lists.r-forge.r-project.org>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150324/5b2daec7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 6632 bytes
Desc: image001.gif
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150324/5b2daec7/attachment-0001.gif>


More information about the datatable-help mailing list