[datatable-help] Best way to export Huge data sets.

Nicolas Paris niparisco at gmail.com
Tue Mar 24 13:34:35 CET 2015


​
thanks for your suggestions.  The data.table can’t be transformed in a
matrix as it is of mixed types : POSIX.ct columns, character, logical,
factor and numeric columns.


​What about casting all as character ? CSV does not make difference between
types as quote is disabled in the config I proposed.

About postgresql, I use it, and the faster way to load data is to use the
COPY statement. I load 7GB of data in 5 min, but...
COPY uses a csv as source. A "binary" file can be used too, but I have
never tried.

​The package RSqlite could help too, Some use this instead of CSV writing.
Never tried too.
​


 ​

​​

2015-03-24 13:03 GMT+01:00 Gerald Jean <gerald.jean at dgag.ca>:

>  Hello Nicolas,
>
> ​​
>
>
> ​​
> thanks for your suggestions.  The data.table can’t be transformed in a
> matrix as it is of mixed types : POSIX.ct columns, character, logical,
> factor and numeric columns.
>
>
>
> Admin is currently installing PostgreSQL on the server, I’ll try to go
> that route.  Too bad data.table doesn’t have, yet, a writing routine as
> fast as “fread” is for reading!!!
>
>
>
> Thanks,
>
>
>
> Gérald
>
>
>
>     *Gerald Jean, M. Sc. en statistiques*
> Conseiller senior en statistiques
>
> Actuariat corporatif,
> Modélisation et Recherche
> Assurance de dommages
> Mouvement Desjardins
>
>
> Lévis (siège social)
>
> 418 835-4900,
>
> poste 5527639
> 1 877 835-4900,
>
> poste 5527639
> Télécopieur : 418 835-6657
>
>
>
>
>
>
>
> Faites bonne impression et imprimez seulement au besoin!
>
> Ce courriel est confidentiel, peut être protégé par le secret
> professionnel et est adressé exclusivement au destinataire. Il est
> strictement interdit à toute autre personne de diffuser, distribuer ou
> reproduire ce message. Si vous l'avez reçu par erreur, veuillez
> immédiatement le détruire et aviser l'expéditeur. Merci.
>
>
>
>
>
> *De :* Nicolas Paris [mailto:niparisco at gmail.com]
> *Envoyé :* 23 mars 2015 17:51
> *À :* Gerald Jean
> *Cc :* datatable-help at lists.r-forge.r-project.org
> *Objet :* Re: [datatable-help] Best way to export Huge data sets.
>
>
>
> Some test you can do without many code change  :
>
> 1) transform your data.table as matrix before write
>
> 2) use write table + this config to save place& time
>
> ->sep = Pipe (1byte& rarely used)
>
> ->disable quote (saves "" more than bilion times)
>
> ->latin1 instead of utf-8
>
> 3) use chunks (say cut in slice output) and append=T (this may work in
> parallel)
>
>
>
> If still too long, try installing some database (sqlite) on your 24 core
> system, and try load it
>
>
>
> hope this helps
>
>
>
> 2015-03-23 14:49 GMT+01:00 Gerald Jean <gerald.jean at dgag.ca>:
>
> Hello,
>
>
>
> I am currently on a project where I have to read, process, aggregate 10 to
> 12 millions of files for roughly 10 billions lines of data.
>
>
>
> The files are arranged in roughly 64000 directories, each directory is one
> client’s data.
>
>
>
> I have written code importing and “massaging” the data per directory.  The
> code is data.table driven.  I am running this on a 24 cores machine with
> 145 Gb of RAM on a Linux box under RedHat.
>
>
>
> For testing purpose I have parallelized the code, using the doMC package,
> runs fine and it seems to be fast.  But I haven’t tried to output the
> resulting files, three per client.  A small one, a moderate size one and a
> large one, over 500Gb estimated.
>
>
>
> My question:
>
>
>
> what is the best way to output those files without creating bottlenecks??
>
>
>
> I thought of breaking the list of input directories into 24 threads,
> supplying a list of lists to “foreach” where one of the components of each
> sub-list would be the name of the output files but I am worried that
> “write.table” would take for ever to write this data to disk, one solution
> would be to use “save” and keep the output data in Rdata format, but that
> complicates further analysis by other software.
>
>
>
> Any suggestions???
>
>
>
> By the way “data.table” sure helped so far in processing that data, thanks
> to the developpers for such an efficient package,
>
>
>
> Gérald
>
>
>
>     *Gerald Jean, M. Sc. en statistiques*
> Conseiller senior en statistiques
>
> Actuariat corporatif,
> Modélisation et Recherche
> Assurance de dommages
> Mouvement Desjardins
>
>
> Lévis (siège social)
>
> 418 835-4900,
>
> poste 5527639
> 1 877 835-4900,
>
> poste 5527639
> Télécopieur : 418 835-6657
>
>
>
>
>
>
> Faites bonne impression et imprimez seulement au besoin!
>
> Ce courriel est confidentiel, peut être protégé par le secret
> professionnel et est adressé exclusivement au destinataire. Il est
> strictement interdit à toute autre personne de diffuser, distribuer ou
> reproduire ce message. Si vous l'avez reçu par erreur, veuillez
> immédiatement le détruire et aviser l'expéditeur. Merci.
>
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150324/7a5a7560/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 6632 bytes
Desc: not available
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150324/7a5a7560/attachment-0001.gif>


More information about the datatable-help mailing list