[datatable-help] Fwd: fread on very large file
Matthew Dowle
mdowle at mdowle.plus.com
Fri May 3 15:59:16 CEST 2013
Oh. Then it's likely a bug with fread on Windows for files > 4GB.
Think GetFileSize() should be GetFileSizeEx(), iirc.
Please could you
file it as a bug on the tracker. Thanks.
Matthew
On 03.05.2013
14:32, Paul Harding wrote:
> Definitely a 64-bit machine. Here are the
details:
>
> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors)
> Installed memory (RAM): 128GB
> System type: 64-bit Operating System
> Windows edition: Server 2008 R2 Enterprise SP1
> Regards,
> Paul
>
> On 3 May 2013 10:51, Matthew Dowle <mdowle at mdowle.plus.com [3]>
wrote:
>
>> Hi Paul,
>>
>> Thanks for all this!
>>
>>> The problem
arises when the file reaches 4GB, in this case between 8,030,000 and
8,040,000 rows:
>>
>> Ahah. Are you using a 32bit or 64bit Windows
machine?
>>
>> Thanks, Matthew
>>
>> On 02.05.2013 10:19, Paul
Harding wrote:
>>
>>> Some supplementary information, here is the
portion of the file (with row numbers, +1 for header) around where fread
thinks the file ends.
>>>
>>> $ nl spd_all_fixed.csv | head -n 9186300
|tail
>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0
>>>
9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0
>>>
9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13
>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>>
9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0
>>>
9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0
>>>
9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0
>>> 9186298
204043,2617097,20110803,0,0,0.53391279999999997,0,0,0
>>> 9186299
204044,2617097,20110803,0,0,0.16047169999999999,0,0,0
>>> 9186300
204045,2617097,20110803,1,0,0.78766970000000003,0,0,0
>>> 9186294 (row
9186293 excl header) is where fread thinks the file ends, mid-line by
the look of it!
>>> I've experimented by truncating the file. The error
varies, either it reads too few records or gives the error I reported,
presumably determined by whether the last perceived line is entire.
>>>
The problem arises when the file reaches 4GB, in this case between
8,030,000 and 8,040,000 rows:
>>>
>>> -rw-r--r--+ 1 Paul.Harding
Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv
>>> -rw-r--r--+ 1
Paul.Harding Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv
>>>
>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)
>>>
>>> Detected eol as rn (CRLF) in that order, the Windows standard.
>>>
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found
>>> Found 9 columns
>>> First row with 9 fields
occurs on line 1 (either column names or first row of data)
>>> All the
fields on line 1 are character fields. Treating as the column names.
>>> Count of eol after first data row: 80300000
>>> Subtracted 1 for
last eol and any trailing empty lines, leaving 80299999 data rows
>>>
>>> Type codes: 000002000 (first 5 rows)
>>> Type codes: 000002000
(+middle 5 rows)
>>> Type codes: 000002000 (+last 5 rows)
>>>
0%Bumping column 7 from INT to INT64 on data row 9, field contains
'0.42634430000000001'
>>> Bumping column 7 from INT64 to REAL on data
row 9, field contains '0.42634430000000001'
>>> 0.000s ( 0%) Memory map
(rerun may be quicker)
>>> 0.000s ( 0%) Sep and header detection
>>>
0.000s ( 0%) Count rows (wc -l)
>>> 0.000s ( 0%) Colmn type detection
(first, middle and last 5 rows)
>>> 0.000s ( 0%) Allocation of
80299999x9 result (xMB) in RAM
>>> 171.188s ( 65%) Reading data
>>>
1365231.809s (518439%) Allocation for type bumps (if any), including gc
time if triggered
>>> -1365231.809s (-518439%) Coercing data already
read in type bumps (if any)
>>> 0.000s ( 0%) Changing na.strings to NA
>>> 0.000s Total
>>>> dt<-fread("data/spd_all_trunc_8040k.csv",
sep=",",verbose=T)
>>>
>>> Detected eol as rn (CRLF) in that order,
the Windows standard.
>>> Looking for supplied sep ',' on line 30 (the
last non blank line in the first 30) ... found
>>> Found 9 columns
>>>
First row with 9 fields occurs on line 1 (either column names or first
row of data)
>>> All the fields on line 1 are character fields.
Treating as the column names.
>>> Count of eol after first data row:
18913
>>> Subtracted 0 for last eol and any trailing empty lines,
leaving 18913 data rows
>>>
>>> Type codes: 000002000 (first 5 rows)
>>> Type codes: 000002000 (+middle 5 rows)
>>> Error in
fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :
>>>
Expected sep (',') but ',' ends field 2 on line 6 when detecting types:
204650,724540,
>>> Regards,
>>> Paul
>>>
>>> On 1 May 2013 10:28,
Paul Harding <p.harding at paniscus.com [2]> wrote:
>>>
>>>> Here is the
verbose output:
>>>>
>>>>> dt<-fread("data/spd_all_fixed.csv",
sep=",",verbose=T)
>>>> Detected eol as rn (CRLF) in that order, the
Windows standard.
>>>> Looking for supplied sep ',' on line 30 (the
last non blank line in the first 30) ... found
>>>> Found 9 columns
>>>> First row with 9 fields occurs on line 1 (either column names or
first row of data)
>>>> All the fields on line 1 are character fields.
Treating as the column names.
>>>> Count of eol after first data row:
9186293
>>>> Subtracted 0 for last eol and any trailing empty lines,
leaving 9186293 data rows
>>>> Type codes: 000002000 (first 5 rows)
>>>> Type codes: 000002200 (+middle 5 rows)
>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :
>>>>
>>>>
Expected sep (',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0
>>>> But here is the wc output (via cygwin;
newline, word (whitespace delim so each word one 'line' here), byte)@
>>>>
>>>> $ wc spd_all_fixed.csv
>>>> 168997637 168997638 9078155125
spd_all_fixed.csv
>>>> [So fread 9M, wc 168M rows].
>>>> Regards
>>>>
Paul
>>>>
>>>> On 30 April 2013 18:52, Matthew Dowle
<mdowle at mdowle.plus.com [1]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks
for reporting this. Please set verbose=TRUE and let us know the output.
>>>>>
>>>>> Thanks, Matthew
>>>>>
>>>>> On 30.04.2013 18:01, Paul
Harding wrote:
>>>>>
>>>>>> Problem with fread on a large file The
file is 8GB, just short of 200,000 lines, produced as SQLoutput and
modified by cygwin/perl to remove the second line.
>>>>>>
>>>>>> Using
data.table 1.8.8 on R3.0.0 I get an fread error
>>>>>>
>>>>>>
fread("data/spd_all_fixed.csv",sep=",")
>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",") :
>>>>>> Expected sep (',')
but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0
>>>>>> Looking for the offending line,with
line numbers in output so I'm guessing this is line 6 of the mid-file
chunk examined,
>>>>>>
>>>>>> $ grep -n '204038,2617097,201108'
spd_all_fixed.csv
>>>>>>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
>>>>>>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
>>>>>>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>>>>>
9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
>>>>>>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
>>>>>>
and comparing to surrounding lines and the first ten lines
>>>>>>
>>>>>> $ head spd_all_fixed.csv
>>>>>>
s_key,i_key,p_key,q,pq,d,l,epi,class
>>>>>>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
>>>>>>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
>>>>>>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
>>>>>>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
>>>>>>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
>>>>>>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
>>>>>>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
>>>>>>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
>>>>>>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
>>>>>> I can't see any difference. I wonder if this is a bug? I have no
problems on a small test data set run through an identical process and
using the same fread command.
>>>>>> Regards
>>>>>> Paul
Links:
------
[1] mailto:mdowle at mdowle.plus.com
[2]
mailto:p.harding at paniscus.com
[3] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130503/be9d99af/attachment-0001.html>
More information about the datatable-help
mailing list