[datatable-help] fread on very large file

Paul Harding p.harding at paniscus.com
Wed May 1 11:28:52 CEST 2013


Here is the verbose output:

> dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T)
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found
Found 9 columns
First row with 9 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 9186293
Subtracted 0 for last eol and any trailing empty lines, leaving 9186293
data rows
Type codes: 000002000 (first 5 rows)
Type codes: 000002200 (+middle 5 rows)
Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :
  Expected sep (',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0

But here is the wc output (via cygwin; newline, word (whitespace delim so
each word one 'line' here), byte)@
$ wc spd_all_fixed.csv
 168997637  168997638 9078155125 spd_all_fixed.csv

[So fread  9M, wc 168M rows].

Regards
Paul


On 30 April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> **
>
>
>
> Hi,
>
> Thanks for reporting this. Please set verbose=TRUE and let us know the
> output.
>
> Thanks, Matthew
>
>
>
> On 30.04.2013 18:01, Paul Harding wrote:
>
>  Problem with fread on a large file
> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and
> modified by cygwin/perl to remove the second line.
>  Using data.table 1.8.8 on R3.0.0 I get an fread error
>  fread("data/spd_all_fixed.csv",sep=",")
> Error in fread("data/spd_all_fixed.csv", sep = ",") :
>   Expected sep (',') but '0' ends field 5 on line 6 when detecting types:
> 204038,2617097,20110803,0,0
> Looking for the offending line,with line numbers in output so I'm guessing
> this is line 6 of the mid-file chunk examined,
>  $ grep -n '204038,2617097,201108' spd_all_fixed.csv
> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
> and comparing to surrounding lines and the first ten lines
>  $ head  spd_all_fixed.csv
> s_key,i_key,p_key,q,pq,d,l,epi,class
> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
> I can't see any difference. I wonder if this is a bug? I have no problems
> on a small test data set run through an identical process and using the
> same fread command.
> Regards
> Paul
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130501/0ed7e587/attachment-0001.html>


More information about the datatable-help mailing list