[datatable-help] Fwd: fread on very large file

Matthew Dowle mdowle at mdowle.plus.com
Mon May 13 22:38:31 CEST 2013


 

Hi Paul, 

Sorry for that hassle. As you've realised I don't develop
data.table on Windows. Those lines are switched in at compile time for
Windows, and so I rely on (the truly impressive) winbuilder to compile
and test for me. On this occasion, I did submit to winbuilder last night
but it didn't reply (even with a compile error) which is extremely
unusual. And R-Forge is stuck in 'building' state too (which is not
unusual, sadly). 

I''ll let you know when it's passing on winbuilder,
and I'll updated the Windows .zip on the homepage (since we can't rely
on R-Forge) ... 

Matthew 

On 13.05.2013 16:01, Paul Harding wrote: 

>
I'd love to test it, pulled the latest commit with svn, not sure about
building from source on windows, got some compilation errors: 
> 
>>
install.packages("pkg/",type="source",repos=NULL) 
> Warning in
install.packages : 
> package 'pkg/' is not available (for R version
3.0.0) 
> * installing *source* package 'data.table' ... 
> ** libs 
>
gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG
-I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99
-mtune=core2 -c fread.c -o fread.o 
> fread.c: In function 'readfile':

> fread.c:343:9: error: 'hfile' undeclared (first use in this function)

> fread.c:343:9: note: each undeclared identifier is reported only once
for each function it appears in 
> fread.c:346:115: error: expected ';'
before ')' token 
> fread.c:346:115: error: expected statement before
')' token 
> fread.c:350:17: warning: implicit declaration of function
'nanosleep' [-Wimplicit-function-declaration] 
> make: *** [fread.o]
Error 1 
> ERROR: compilation failed for package 'data.table' 
> Regards

> Paul 
> 
> On 11 May 2013 02:39, Matthew Dowle
<mdowle at mdowle.plus.com [4]> wrote:
> 
>> Paul, Vishal, 
>> 
>> Commit
859 : 
>> 
>> * fread now supports files larger than 4GB on 64bit
Windows (#2767 thanks to Paul Harding) and files
>> between 2GB and 4GB
on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize()
needed to
>> be GetFileSizeEx().
>> 
>> Please test and confirm ok
now.
>> 
>> Thanks, Matthew
>> 
>> On 03.05.2013 14:59, Matthew Dowle
wrote: 
>> 
>>> Oh. Then it's likely a bug with fread on Windows for
files > 4GB. Think GetFileSize() should be GetFileSizeEx(), iirc. 
>>>

>>> Please could you file it as a bug on the tracker. Thanks. 
>>> 
>>>
Matthew 
>>> 
>>> On 03.05.2013 14:32, Paul Harding wrote: 
>>> 
>>>>
Definitely a 64-bit machine. Here are the details: 
>>>> 
>>>>
Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) 
>>>>
Installed memory (RAM): 128GB 
>>>> System type: 64-bit Operating System

>>>> Windows edition: Server 2008 R2 Enterprise SP1 
>>>> Regards,

>>>> Paul 
>>>> 
>>>> On 3 May 2013 10:51, Matthew Dowle
<mdowle at mdowle.plus.com [3]> wrote:
>>>> 
>>>>> Hi Paul, 
>>>>> 
>>>>>
Thanks for all this! 
>>>>> 
>>>>>> The problem arises when the file
reaches 4GB, in this case between 8,030,000 and 8,040,000 rows: 
>>>>>

>>>>> Ahah. Are you using a 32bit or 64bit Windows machine? 
>>>>>

>>>>> Thanks, Matthew 
>>>>> 
>>>>> On 02.05.2013 10:19, Paul Harding
wrote: 
>>>>> 
>>>>>> Some supplementary information, here is the
portion of the file (with row numbers, +1 for header) around where fread
thinks the file ends. 
>>>>>> 
>>>>>> $ nl spd_all_fixed.csv | head -n
9186300 |tail 
>>>>>> 9186291
204029,2617097,20110803,0,0,0.3014501,0,0,0 
>>>>>> 9186292
204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 
>>>>>> 9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13

>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0

>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0

>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0

>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 
>>>>>>
9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 
>>>>>>
9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 
>>>>>>
9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 
>>>>>>
9186294 (row 9186293 excl header) is where fread thinks the file ends,
mid-line by the look of it! 
>>>>>> I've experimented by truncating the
file. The error varies, either it reads too few records or gives the
error I reported, presumably determined by whether the last perceived
line is entire. 
>>>>>> The problem arises when the file reaches 4GB, in
this case between 8,030,000 and 8,040,000 rows: 
>>>>>> 
>>>>>>
-rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02
spd_all_trunc_8030k.csv 
>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users
4.1G May 1 12:06 spd_all_trunc_8040k.csv 
>>>>>> 
>>>>>>>
dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) 
>>>>>>

>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard.

>>>>>> Looking for supplied sep ',' on line 30 (the last non blank line
in the first 30) ... found 
>>>>>> Found 9 columns 
>>>>>> First row
with 9 fields occurs on line 1 (either column names or first row of
data) 
>>>>>> All the fields on line 1 are character fields. Treating as
the column names. 
>>>>>> Count of eol after first data row: 80300000

>>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
80299999 data rows 
>>>>>> 
>>>>>> Type codes: 000002000 (first 5 rows)

>>>>>> Type codes: 000002000 (+middle 5 rows) 
>>>>>> Type codes:
000002000 (+last 5 rows) 
>>>>>> 0%Bumping column 7 from INT to INT64 on
data row 9, field contains '0.42634430000000001' 
>>>>>> Bumping column
7 from INT64 to REAL on data row 9, field contains '0.42634430000000001'

>>>>>> 0.000s ( 0%) Memory map (rerun may be quicker) 
>>>>>> 0.000s (
0%) Sep and header detection 
>>>>>> 0.000s ( 0%) Count rows (wc -l)

>>>>>> 0.000s ( 0%) Colmn type detection (first, middle and last 5
rows) 
>>>>>> 0.000s ( 0%) Allocation of 80299999x9 result (xMB) in RAM

>>>>>> 171.188s ( 65%) Reading data 
>>>>>> 1365231.809s (518439%)
Allocation for type bumps (if any), including gc time if triggered

>>>>>> -1365231.809s (-518439%) Coercing data already read in type
bumps (if any) 
>>>>>> 0.000s ( 0%) Changing na.strings to NA 
>>>>>>
0.000s Total 
>>>>>>> dt<-fread("data/spd_all_trunc_8040k.csv",
sep=",",verbose=T) 
>>>>>> 
>>>>>> Detected eol as rn (CRLF) in that
order, the Windows standard. 
>>>>>> Looking for supplied sep ',' on
line 30 (the last non blank line in the first 30) ... found 
>>>>>>
Found 9 columns 
>>>>>> First row with 9 fields occurs on line 1 (either
column names or first row of data) 
>>>>>> All the fields on line 1 are
character fields. Treating as the column names. 
>>>>>> Count of eol
after first data row: 18913 
>>>>>> Subtracted 0 for last eol and any
trailing empty lines, leaving 18913 data rows 
>>>>>> 
>>>>>> Type
codes: 000002000 (first 5 rows) 
>>>>>> Type codes: 000002000 (+middle 5
rows) 
>>>>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",",
verbose = T) : 
>>>>>> Expected sep (',') but ',' ends field 2 on line 6
when detecting types: 204650,724540, 
>>>>>> Regards, 
>>>>>> Paul

>>>>>> 
>>>>>> On 1 May 2013 10:28, Paul Harding
<p.harding at paniscus.com [2]> wrote:
>>>>>> 
>>>>>>> Here is the verbose
output: 
>>>>>>> 
>>>>>>>> dt<-fread("data/spd_all_fixed.csv",
sep=",",verbose=T) 
>>>>>>> Detected eol as rn (CRLF) in that order, the
Windows standard. 
>>>>>>> Looking for supplied sep ',' on line 30 (the
last non blank line in the first 30) ... found 
>>>>>>> Found 9 columns

>>>>>>> First row with 9 fields occurs on line 1 (either column names
or first row of data) 
>>>>>>> All the fields on line 1 are character
fields. Treating as the column names. 
>>>>>>> Count of eol after first
data row: 9186293 
>>>>>>> Subtracted 0 for last eol and any trailing
empty lines, leaving 9186293 data rows 
>>>>>>> Type codes: 000002000
(first 5 rows) 
>>>>>>> Type codes: 000002200 (+middle 5 rows) 
>>>>>>>
Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :

>>>>>>> 
>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when
detecting types: 204038,2617097,20110803,0,0 
>>>>>>> But here is the wc
output (via cygwin; newline, word (whitespace delim so each word one
'line' here), byte)@ 
>>>>>>> 
>>>>>>> $ wc spd_all_fixed.csv 
>>>>>>>
168997637 168997638 9078155125 spd_all_fixed.csv 
>>>>>>> [So fread 9M,
wc 168M rows]. 
>>>>>>> Regards 
>>>>>>> Paul 
>>>>>>> 
>>>>>>> On 30
April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com [1]>
wrote:
>>>>>>> 
>>>>>>>> Hi, 
>>>>>>>> 
>>>>>>>> Thanks for reporting
this. Please set verbose=TRUE and let us know the output. 
>>>>>>>>

>>>>>>>> Thanks, Matthew 
>>>>>>>> 
>>>>>>>> On 30.04.2013 18:01, Paul
Harding wrote: 
>>>>>>>> 
>>>>>>>>> Problem with fread on a large file
The file is 8GB, just short of 200,000 lines, produced as SQLoutput and
modified by cygwin/perl to remove the second line.
>>>>>>>>> 
>>>>>>>>>
Using data.table 1.8.8 on R3.0.0 I get an fread error 
>>>>>>>>>

>>>>>>>>> fread("data/spd_all_fixed.csv",sep=",") 
>>>>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",") : 
>>>>>>>>> Expected sep
(',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0 
>>>>>>>>> Looking for the offending
line,with line numbers in output so I'm guessing this is line 6 of the
mid-file chunk examined, 
>>>>>>>>> 
>>>>>>>>> $ grep -n
'204038,2617097,201108' spd_all_fixed.csv 
>>>>>>>>>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0 
>>>>>>>>>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0 
>>>>>>>>>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0 
>>>>>>>>>
9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0 
>>>>>>>>>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0

>>>>>>>>> and comparing to surrounding lines and the first ten lines

>>>>>>>>> 
>>>>>>>>> $ head spd_all_fixed.csv 
>>>>>>>>>
s_key,i_key,p_key,q,pq,d,l,epi,class 
>>>>>>>>>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 
>>>>>>>>>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 
>>>>>>>>>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 
>>>>>>>>>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 
>>>>>>>>>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 
>>>>>>>>>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 
>>>>>>>>>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 
>>>>>>>>>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 
>>>>>>>>>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13

>>>>>>>>> I can't see any difference. I wonder if this is a bug? I have
no problems on a small test data set run through an identical process
and using the same fread command. 
>>>>>>>>> Regards 
>>>>>>>>> Paul




Links:
------
[1] mailto:mdowle at mdowle.plus.com
[2]
mailto:p.harding at paniscus.com
[3] mailto:mdowle at mdowle.plus.com
[4]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130513/54ef992a/attachment-0001.html>


More information about the datatable-help mailing list