[datatable-help] Fwd: fread on very large file

Thu May 2 11:19:22 CEST 2013

Some supplementary information, here is the portion of the file (with row
numbers, +1 for header) around where fread thinks the file ends.

$ nl spd_all_fixed.csv | head -n 9186300 |tail
9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0
9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0
9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13
9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0
9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0
9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0
9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0
9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0
9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0

9186294 (row 9186293 excl header) is where fread thinks the file ends,
mid-line by the look of it!

I've experimented by truncating the file. The error varies, either it reads
too few records or gives the error I reported, presumably determined by
whether the last perceived line is entire.

The problem arises when the file reaches 4GB, in this case between
8,030,000 and 8,040,000 rows:

-rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May  1 12:02
spd_all_trunc_8030k.csv
-rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May  1 12:06
spd_all_trunc_8040k.csv

> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found
Found 9 columns
First row with 9 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 80300000
Subtracted 1 for last eol and any trailing empty lines, leaving 80299999
data rows
Type codes: 000002000 (first 5 rows)
Type codes: 000002000 (+middle 5 rows)
Type codes: 000002000 (+last 5 rows)
0%Bumping column 7 from INT to INT64 on data row 9, field contains
'0.42634430000000001'
Bumping column 7 from INT64 to REAL on data row 9, field contains
'0.42634430000000001'
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) Sep and header detection
   0.000s (  0%) Count rows (wc -l)
   0.000s (  0%) Colmn type detection (first, middle and last 5 rows)
   0.000s (  0%) Allocation of 80299999x9 result (xMB) in RAM
 171.188s ( 65%) Reading data
1365231.809s (518439%) Allocation for type bumps (if any), including gc
time if triggered
-1365231.809s (-518439%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.000s        Total

> dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T)
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found
Found 9 columns
First row with 9 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 18913
Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data
rows
Type codes: 000002000 (first 5 rows)
Type codes: 000002000 (+middle 5 rows)
Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :
  Expected sep (',') but ',' ends field 2 on line 6 when detecting types:
204650,724540,

Regards,
Paul

On 1 May 2013 10:28, Paul Harding <p.harding at paniscus.com> wrote:

> Here is the verbose output:
>
> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T)
> Detected eol as \r\n (CRLF) in that order, the Windows standard.
> Looking for supplied sep ',' on line 30 (the last non blank line in the
> first 30) ... found
> Found 9 columns
> First row with 9 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
> Count of eol after first data row: 9186293
> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293
> data rows
> Type codes: 000002000 (first 5 rows)
> Type codes: 000002200 (+middle 5 rows)
> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :
>   Expected sep (',') but '0' ends field 5 on line 6 when detecting types:
> 204038,2617097,20110803,0,0
>
> But here is the wc output (via cygwin; newline, word (whitespace delim so
> each word one 'line' here), byte)@
> $ wc spd_all_fixed.csv
>  168997637  168997638 9078155125 spd_all_fixed.csv
>
> [So fread  9M, wc 168M rows].
>
> Regards
> Paul
>
>
> On 30 April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>> **
>>
>>
>>
>> Hi,
>>
>> Thanks for reporting this. Please set verbose=TRUE and let us know the
>> output.
>>
>> Thanks, Matthew
>>
>>
>>
>> On 30.04.2013 18:01, Paul Harding wrote:
>>
>>  Problem with fread on a large file
>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and
>> modified by cygwin/perl to remove the second line.
>>  Using data.table 1.8.8 on R3.0.0 I get an fread error
>>  fread("data/spd_all_fixed.csv",sep=",")
>> Error in fread("data/spd_all_fixed.csv", sep = ",") :
>>   Expected sep (',') but '0' ends field 5 on line 6 when detecting types:
>> 204038,2617097,20110803,0,0
>> Looking for the offending line,with line numbers in output so I'm
>> guessing this is line 6 of the mid-file chunk examined,
>>  $ grep -n '204038,2617097,201108' spd_all_fixed.csv
>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
>> and comparing to surrounding lines and the first ten lines
>>  $ head  spd_all_fixed.csv
>> s_key,i_key,p_key,q,pq,d,l,epi,class
>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
>> I can't see any difference. I wonder if this is a bug? I have no problems
>> on a small test data set run through an identical process and using the
>> same fread command.
>> Regards
>> Paul
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130502/1f53aef3/attachment.html>