[datatable-help] Fwd: fread on very large file

Matthew Dowle mdowle at mdowle.plus.com
Sat May 11 03:39:10 CEST 2013


 

Paul, Vishal, 

Commit 859 : 

* fread now supports files larger
than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files

between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call
to GetFileSize() needed to
 be GetFileSizeEx().

Please test and confirm
ok now.

Thanks, Matthew

On 03.05.2013 14:59, Matthew Dowle wrote: 

>
Oh. Then it's likely a bug with fread on Windows for files > 4GB. Think
GetFileSize() should be GetFileSizeEx(), iirc. 
> 
> Please could you
file it as a bug on the tracker. Thanks. 
> 
> Matthew 
> 
> On
03.05.2013 14:32, Paul Harding wrote: 
> 
>> Definitely a 64-bit
machine. Here are the details: 
>> 
>> Processor: Intel Xeon CPU E7-4830
@2.13GHz (4 processors) 
>> Installed memory (RAM): 128GB 
>> System
type: 64-bit Operating System 
>> Windows edition: Server 2008 R2
Enterprise SP1 
>> Regards, 
>> Paul 
>> 
>> On 3 May 2013 10:51,
Matthew Dowle <mdowle at mdowle.plus.com [3]> wrote:
>> 
>>> Hi Paul, 
>>>

>>> Thanks for all this! 
>>> 
>>>> The problem arises when the file
reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: 
>>>

>>> Ahah. Are you using a 32bit or 64bit Windows machine? 
>>> 
>>>
Thanks, Matthew 
>>> 
>>> On 02.05.2013 10:19, Paul Harding wrote: 
>>>

>>>> Some supplementary information, here is the portion of the file
(with row numbers, +1 for header) around where fread thinks the file
ends. 
>>>> 
>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail 
>>>>
9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0 
>>>> 9186292
204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 
>>>> 9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13

>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0

>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0

>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0 
>>>>
9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 
>>>> 9186298
204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 
>>>> 9186299
204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 
>>>> 9186300
204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 
>>>> 9186294 (row
9186293 excl header) is where fread thinks the file ends, mid-line by
the look of it! 
>>>> I've experimented by truncating the file. The
error varies, either it reads too few records or gives the error I
reported, presumably determined by whether the last perceived line is
entire. 
>>>> The problem arises when the file reaches 4GB, in this case
between 8,030,000 and 8,040,000 rows: 
>>>> 
>>>> -rw-r--r--+ 1
Paul.Harding Domain Users 4.0G May 1 12:02 spd_all_trunc_8030k.csv 
>>>>
-rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May 1 12:06
spd_all_trunc_8040k.csv 
>>>> 
>>>>>
dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) 
>>>> 
>>>>
Detected eol as rn (CRLF) in that order, the Windows standard. 
>>>>
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found 
>>>> Found 9 columns 
>>>> First row with 9 fields
occurs on line 1 (either column names or first row of data) 
>>>> All
the fields on line 1 are character fields. Treating as the column names.

>>>> Count of eol after first data row: 80300000 
>>>> Subtracted 1 for
last eol and any trailing empty lines, leaving 80299999 data rows 
>>>>

>>>> Type codes: 000002000 (first 5 rows) 
>>>> Type codes: 000002000
(+middle 5 rows) 
>>>> Type codes: 000002000 (+last 5 rows) 
>>>>
0%Bumping column 7 from INT to INT64 on data row 9, field contains
'0.42634430000000001' 
>>>> Bumping column 7 from INT64 to REAL on data
row 9, field contains '0.42634430000000001' 
>>>> 0.000s ( 0%) Memory
map (rerun may be quicker) 
>>>> 0.000s ( 0%) Sep and header detection

>>>> 0.000s ( 0%) Count rows (wc -l) 
>>>> 0.000s ( 0%) Colmn type
detection (first, middle and last 5 rows) 
>>>> 0.000s ( 0%) Allocation
of 80299999x9 result (xMB) in RAM 
>>>> 171.188s ( 65%) Reading data

>>>> 1365231.809s (518439%) Allocation for type bumps (if any),
including gc time if triggered 
>>>> -1365231.809s (-518439%) Coercing
data already read in type bumps (if any) 
>>>> 0.000s ( 0%) Changing
na.strings to NA 
>>>> 0.000s Total 
>>>>>
dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) 
>>>> 
>>>>
Detected eol as rn (CRLF) in that order, the Windows standard. 
>>>>
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found 
>>>> Found 9 columns 
>>>> First row with 9 fields
occurs on line 1 (either column names or first row of data) 
>>>> All
the fields on line 1 are character fields. Treating as the column names.

>>>> Count of eol after first data row: 18913 
>>>> Subtracted 0 for
last eol and any trailing empty lines, leaving 18913 data rows 
>>>>

>>>> Type codes: 000002000 (first 5 rows) 
>>>> Type codes: 000002000
(+middle 5 rows) 
>>>> Error in fread("data/spd_all_trunc_8040k.csv",
sep = ",", verbose = T) : 
>>>> Expected sep (',') but ',' ends field 2
on line 6 when detecting types: 204650,724540, 
>>>> Regards, 
>>>> Paul

>>>> 
>>>> On 1 May 2013 10:28, Paul Harding <p.harding at paniscus.com
[2]> wrote:
>>>> 
>>>>> Here is the verbose output: 
>>>>> 
>>>>>>
dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) 
>>>>> Detected
eol as rn (CRLF) in that order, the Windows standard. 
>>>>> Looking for
supplied sep ',' on line 30 (the last non blank line in the first 30)
... found 
>>>>> Found 9 columns 
>>>>> First row with 9 fields occurs
on line 1 (either column names or first row of data) 
>>>>> All the
fields on line 1 are character fields. Treating as the column names.

>>>>> Count of eol after first data row: 9186293 
>>>>> Subtracted 0
for last eol and any trailing empty lines, leaving 9186293 data rows

>>>>> Type codes: 000002000 (first 5 rows) 
>>>>> Type codes: 000002200
(+middle 5 rows) 
>>>>> Error in fread("data/spd_all_fixed.csv", sep =
",", verbose = T) : 
>>>>> 
>>>>> Expected sep (',') but '0' ends field
5 on line 6 when detecting types: 204038,2617097,20110803,0,0 
>>>>> But
here is the wc output (via cygwin; newline, word (whitespace delim so
each word one 'line' here), byte)@ 
>>>>> 
>>>>> $ wc spd_all_fixed.csv

>>>>> 168997637 168997638 9078155125 spd_all_fixed.csv 
>>>>> [So fread
9M, wc 168M rows]. 
>>>>> Regards 
>>>>> Paul 
>>>>> 
>>>>> On 30 April
2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com [1]> wrote:
>>>>>

>>>>>> Hi, 
>>>>>> 
>>>>>> Thanks for reporting this. Please set
verbose=TRUE and let us know the output. 
>>>>>> 
>>>>>> Thanks, Matthew

>>>>>> 
>>>>>> On 30.04.2013 18:01, Paul Harding wrote: 
>>>>>>

>>>>>>> Problem with fread on a large file The file is 8GB, just short
of 200,000 lines, produced as SQLoutput and modified by cygwin/perl to
remove the second line.
>>>>>>> 
>>>>>>> Using data.table 1.8.8 on
R3.0.0 I get an fread error 
>>>>>>> 
>>>>>>>
fread("data/spd_all_fixed.csv",sep=",") 
>>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",") : 
>>>>>>> Expected sep (',')
but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0 
>>>>>>> Looking for the offending line,with
line numbers in output so I'm guessing this is line 6 of the mid-file
chunk examined, 
>>>>>>> 
>>>>>>> $ grep -n '204038,2617097,201108'
spd_all_fixed.csv 
>>>>>>>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 
>>>>>>>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 
>>>>>>>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 
>>>>>>>
9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 
>>>>>>>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0 
>>>>>>>
and comparing to surrounding lines and the first ten lines 
>>>>>>>

>>>>>>> $ head spd_all_fixed.csv 
>>>>>>>
s_key,i_key,p_key,q,pq,d,l,epi,class 
>>>>>>>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 
>>>>>>>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 
>>>>>>>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 
>>>>>>>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 
>>>>>>>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 
>>>>>>>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 
>>>>>>>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 
>>>>>>>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 
>>>>>>>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13

>>>>>>> I can't see any difference. I wonder if this is a bug? I have
no problems on a small test data set run through an identical process
and using the same fread command. 
>>>>>>> Regards 
>>>>>>> Paul




Links:
------
[1] mailto:mdowle at mdowle.plus.com
[2]
mailto:p.harding at paniscus.com
[3] mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130511/16214cff/attachment.html>


More information about the datatable-help mailing list