[datatable-help] Fwd: fread on very large file

Fri May 3 15:32:16 CEST 2013

Definitely a 64-bit machine. Here are the details:

Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors)
Installed memory (RAM): 128GB
System type: 64-bit Operating System
Windows edition: Server 2008 R2 Enterprise SP1

Regards,
Paul

On 3 May 2013 10:51, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> **
>
>
>
> Hi Paul,
>
> Thanks for all this!
>
> >  The problem arises when the file reaches 4GB, in this case between
> 8,030,000 and 8,040,000 rows:
>
> Ahah.  Are you using a 32bit or 64bit Windows machine?
>
> Thanks, Matthew
>
>
>
> On 02.05.2013 10:19, Paul Harding wrote:
>
> Some supplementary information, here is the portion of the file (with row
> numbers, +1 for header) around where fread thinks the file ends.
>   $ nl spd_all_fixed.csv | head -n 9186300 |tail
> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0
> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0
> 9186293
> 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13
> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0
> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0
> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0
> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0
> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0
> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0
>  9186294 (row 9186293 excl header) is where fread thinks the file ends,
> mid-line by the look of it!
> I've experimented by truncating the file. The error varies, either it
> reads too few records or gives the error I reported, presumably determined
> by whether the last perceived line is entire.
> The problem arises when the file reaches 4GB, in this case between
> 8,030,000 and 8,040,000 rows:
>  -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May  1 12:02
> spd_all_trunc_8030k.csv
> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May  1 12:06
> spd_all_trunc_8040k.csv
>  > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)
>  Detected eol as \r\n (CRLF) in that order, the Windows standard.
> Looking for supplied sep ',' on line 30 (the last non blank line in the
> first 30) ... found
> Found 9 columns
> First row with 9 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
>  Count of eol after first data row: 80300000
> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999
> data rows
>  Type codes: 000002000 (first 5 rows)
>  Type codes: 000002000 (+middle 5 rows)
> Type codes: 000002000 (+last 5 rows)
> 0%Bumping column 7 from INT to INT64 on data row 9, field contains
> '0.42634430000000001'
> Bumping column 7 from INT64 to REAL on data row 9, field contains
> '0.42634430000000001'
>    0.000s (  0%) Memory map (rerun may be quicker)
>    0.000s (  0%) Sep and header detection
>    0.000s (  0%) Count rows (wc -l)
>    0.000s (  0%) Colmn type detection (first, middle and last 5 rows)
>    0.000s (  0%) Allocation of 80299999x9 result (xMB) in RAM
>  171.188s ( 65%) Reading data
> 1365231.809s (518439%) Allocation for type bumps (if any), including gc
> time if triggered
> -1365231.809s (-518439%) Coercing data already read in type bumps (if any)
>    0.000s (  0%) Changing na.strings to NA
>    0.000s        Total
> > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T)
>  Detected eol as \r\n (CRLF) in that order, the Windows standard.
> Looking for supplied sep ',' on line 30 (the last non blank line in the
> first 30) ... found
> Found 9 columns
> First row with 9 fields occurs on line 1 (either column names or first row
> of data)
> All the fields on line 1 are character fields. Treating as the column
> names.
>  Count of eol after first data row: 18913
> Subtracted 0 for last eol and any trailing empty lines, leaving 18913 data
> rows
>  Type codes: 000002000 (first 5 rows)
>  Type codes: 000002000 (+middle 5 rows)
> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :
>   Expected sep (',') but ',' ends field 2 on line 6 when detecting types:
> 204650,724540,
>  Regards,
> Paul
>
>
> On 1 May 2013 10:28, Paul Harding <p.harding at paniscus.com> wrote:
>
>> Here is the verbose output:
>> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T)
>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>> first 30) ... found
>> Found 9 columns
>> First row with 9 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 9186293
>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293
>> data rows
>> Type codes: 000002000 (first 5 rows)
>> Type codes: 000002200 (+middle 5 rows)
>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :
>>    Expected sep (',') but '0' ends field 5 on line 6 when detecting
>> types: 204038,2617097,20110803,0,0
>>  But here is the wc output (via cygwin; newline, word (whitespace delim
>> so each word one 'line' here), byte)@
>>  $ wc spd_all_fixed.csv
>>  168997637  168997638 9078155125 spd_all_fixed.csv
>> [So fread  9M, wc 168M rows].
>> Regards
>> Paul
>>
>>
>> On 30 April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>
>>>
>>>
>>> Hi,
>>>
>>> Thanks for reporting this. Please set verbose=TRUE and let us know the
>>> output.
>>>
>>> Thanks, Matthew
>>>
>>>
>>>
>>> On 30.04.2013 18:01, Paul Harding wrote:
>>>
>>>  Problem with fread on a large file
>>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and
>>> modified by cygwin/perl to remove the second line.
>>>  Using data.table 1.8.8 on R3.0.0 I get an fread error
>>>  fread("data/spd_all_fixed.csv",sep=",")
>>> Error in fread("data/spd_all_fixed.csv", sep = ",") :
>>>   Expected sep (',') but '0' ends field 5 on line 6 when detecting
>>> types: 204038,2617097,20110803,0,0
>>> Looking for the offending line,with line numbers in output so I'm
>>> guessing this is line 6 of the mid-file chunk examined,
>>>  $ grep -n '204038,2617097,201108' spd_all_fixed.csv
>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
>>> and comparing to surrounding lines and the first ten lines
>>>  $ head  spd_all_fixed.csv
>>> s_key,i_key,p_key,q,pq,d,l,epi,class
>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
>>> I can't see any difference. I wonder if this is a bug? I have no
>>> problems on a small test data set run through an identical process and
>>> using the same fread command.
>>> Regards
>>> Paul
>>>
>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/177008e5/attachment.html>