[datatable-help] Fwd: fread on very large file

Mon May 13 23:26:57 CEST 2013

Passing on winbuilder now. 

.zip (rev 874) uploaded to homepage
(will take an hour or two to refresh), but available now from here :

https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable

Matthew 

On 13.05.2013 21:38, Matthew Dowle wrote: 

> Hi Paul, 
>

> Sorry for that hassle. As you've realised I don't develop data.table
on Windows. Those lines are switched in at compile time for Windows, and
so I rely on (the truly impressive) winbuilder to compile and test for
me. On this occasion, I did submit to winbuilder last night but it
didn't reply (even with a compile error) which is extremely unusual. And
R-Forge is stuck in 'building' state too (which is not unusual, sadly).

> 
> I''ll let you know when it's passing on winbuilder, and I'll
updated the Windows .zip on the homepage (since we can't rely on
R-Forge) ... 
> 
> Matthew 
> 
> On 13.05.2013 16:01, Paul Harding
wrote: 
> 
>> I'd love to test it, pulled the latest commit with svn,
not sure about building from source on windows, got some compilation
errors: 
>> 
>>> install.packages("pkg/",type="source",repos=NULL) 
>>
Warning in install.packages : 
>> package 'pkg/' is not available (for R
version 3.0.0) 
>> * installing *source* package 'data.table' ... 
>> **
libs 
>> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG
-I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99
-mtune=core2 -c fread.c -o fread.o 
>> fread.c: In function 'readfile':

>> fread.c:343:9: error: 'hfile' undeclared (first use in this
function) 
>> fread.c:343:9: note: each undeclared identifier is
reported only once for each function it appears in 
>> fread.c:346:115:
error: expected ';' before ')' token 
>> fread.c:346:115: error:
expected statement before ')' token 
>> fread.c:350:17: warning:
implicit declaration of function 'nanosleep'
[-Wimplicit-function-declaration] 
>> make: *** [fread.o] Error 1 
>>
ERROR: compilation failed for package 'data.table' 
>> Regards 
>> Paul

>> 
>> On 11 May 2013 02:39, Matthew Dowle <mdowle at mdowle.plus.com [4]>
wrote:
>> 
>>> Paul, Vishal, 
>>> 
>>> Commit 859 : 
>>> 
>>> * fread
now supports files larger than 4GB on 64bit Windows (#2767 thanks to
Paul Harding) and files
>>> between 2GB and 4GB on 32bit Windows (#2655
thanks to Vishal). A C call to GetFileSize() needed to
>>> be
GetFileSizeEx().
>>> 
>>> Please test and confirm ok now.
>>> 
>>>
Thanks, Matthew
>>> 
>>> On 03.05.2013 14:59, Matthew Dowle wrote: 
>>>

>>>> Oh. Then it's likely a bug with fread on Windows for files > 4GB.
Think GetFileSize() should be GetFileSizeEx(), iirc. 
>>>> 
>>>> Please
could you file it as a bug on the tracker. Thanks. 
>>>> 
>>>> Matthew

>>>> 
>>>> On 03.05.2013 14:32, Paul Harding wrote: 
>>>> 
>>>>>
Definitely a 64-bit machine. Here are the details: 
>>>>> 
>>>>>
Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) 
>>>>>
Installed memory (RAM): 128GB 
>>>>> System type: 64-bit Operating
System 
>>>>> Windows edition: Server 2008 R2 Enterprise SP1 
>>>>>
Regards, 
>>>>> Paul 
>>>>> 
>>>>> On 3 May 2013 10:51, Matthew Dowle
<mdowle at mdowle.plus.com [3]> wrote:
>>>>> 
>>>>>> Hi Paul, 
>>>>>>

>>>>>> Thanks for all this! 
>>>>>> 
>>>>>>> The problem arises when
the file reaches 4GB, in this case between 8,030,000 and 8,040,000 rows:

>>>>>> 
>>>>>> Ahah. Are you using a 32bit or 64bit Windows machine?

>>>>>> 
>>>>>> Thanks, Matthew 
>>>>>> 
>>>>>> On 02.05.2013 10:19,
Paul Harding wrote: 
>>>>>> 
>>>>>>> Some supplementary information,
here is the portion of the file (with row numbers, +1 for header) around
where fread thinks the file ends. 
>>>>>>> 
>>>>>>> $ nl
spd_all_fixed.csv | head -n 9186300 |tail 
>>>>>>> 9186291
204029,2617097,20110803,0,0,0.3014501,0,0,0 
>>>>>>> 9186292
204030,2617097,20110803,0,0,0.52049100000000004,0,0,0 
>>>>>>> 9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13

>>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0

>>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0

>>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0

>>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0 
>>>>>>>
9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0 
>>>>>>>
9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0 
>>>>>>>
9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0 
>>>>>>>
9186294 (row 9186293 excl header) is where fread thinks the file ends,
mid-line by the look of it! 
>>>>>>> I've experimented by truncating the
file. The error varies, either it reads too few records or gives the
error I reported, presumably determined by whether the last perceived
line is entire. 
>>>>>>> The problem arises when the file reaches 4GB,
in this case between 8,030,000 and 8,040,000 rows: 
>>>>>>> 
>>>>>>>
-rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1 12:02
spd_all_trunc_8030k.csv 
>>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users
4.1G May 1 12:06 spd_all_trunc_8040k.csv 
>>>>>>> 
>>>>>>>>
dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T) 
>>>>>>>

>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard.

>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank
line in the first 30) ... found 
>>>>>>> Found 9 columns 
>>>>>>> First
row with 9 fields occurs on line 1 (either column names or first row of
data) 
>>>>>>> All the fields on line 1 are character fields. Treating
as the column names. 
>>>>>>> Count of eol after first data row:
80300000 
>>>>>>> Subtracted 1 for last eol and any trailing empty
lines, leaving 80299999 data rows 
>>>>>>> 
>>>>>>> Type codes:
000002000 (first 5 rows) 
>>>>>>> Type codes: 000002000 (+middle 5 rows)

>>>>>>> Type codes: 000002000 (+last 5 rows) 
>>>>>>> 0%Bumping column
7 from INT to INT64 on data row 9, field contains '0.42634430000000001'

>>>>>>> Bumping column 7 from INT64 to REAL on data row 9, field
contains '0.42634430000000001' 
>>>>>>> 0.000s ( 0%) Memory map (rerun
may be quicker) 
>>>>>>> 0.000s ( 0%) Sep and header detection 
>>>>>>>
0.000s ( 0%) Count rows (wc -l) 
>>>>>>> 0.000s ( 0%) Colmn type
detection (first, middle and last 5 rows) 
>>>>>>> 0.000s ( 0%)
Allocation of 80299999x9 result (xMB) in RAM 
>>>>>>> 171.188s ( 65%)
Reading data 
>>>>>>> 1365231.809s (518439%) Allocation for type bumps
(if any), including gc time if triggered 
>>>>>>> -1365231.809s
(-518439%) Coercing data already read in type bumps (if any) 
>>>>>>>
0.000s ( 0%) Changing na.strings to NA 
>>>>>>> 0.000s Total 
>>>>>>>>
dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) 
>>>>>>>

>>>>>>> Detected eol as rn (CRLF) in that order, the Windows standard.

>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank
line in the first 30) ... found 
>>>>>>> Found 9 columns 
>>>>>>> First
row with 9 fields occurs on line 1 (either column names or first row of
data) 
>>>>>>> All the fields on line 1 are character fields. Treating
as the column names. 
>>>>>>> Count of eol after first data row: 18913

>>>>>>> Subtracted 0 for last eol and any trailing empty lines, leaving
18913 data rows 
>>>>>>> 
>>>>>>> Type codes: 000002000 (first 5 rows)

>>>>>>> Type codes: 000002000 (+middle 5 rows) 
>>>>>>> Error in
fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) : 
>>>>>>>
Expected sep (',') but ',' ends field 2 on line 6 when detecting types:
204650,724540, 
>>>>>>> Regards, 
>>>>>>> Paul 
>>>>>>> 
>>>>>>> On 1
May 2013 10:28, Paul Harding <p.harding at paniscus.com [2]> wrote:
>>>>>>>

>>>>>>>> Here is the verbose output: 
>>>>>>>> 
>>>>>>>>>
dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) 
>>>>>>>>
Detected eol as rn (CRLF) in that order, the Windows standard. 
>>>>>>>>
Looking for supplied sep ',' on line 30 (the last non blank line in the
first 30) ... found 
>>>>>>>> Found 9 columns 
>>>>>>>> First row with 9
fields occurs on line 1 (either column names or first row of data)

>>>>>>>> All the fields on line 1 are character fields. Treating as the
column names. 
>>>>>>>> Count of eol after first data row: 9186293

>>>>>>>> Subtracted 0 for last eol and any trailing empty lines,
leaving 9186293 data rows 
>>>>>>>> Type codes: 000002000 (first 5 rows)

>>>>>>>> Type codes: 000002200 (+middle 5 rows) 
>>>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",", verbose = T) : 
>>>>>>>>

>>>>>>>> Expected sep (',') but '0' ends field 5 on line 6 when
detecting types: 204038,2617097,20110803,0,0 
>>>>>>>> But here is the
wc output (via cygwin; newline, word (whitespace delim so each word one
'line' here), byte)@ 
>>>>>>>> 
>>>>>>>> $ wc spd_all_fixed.csv

>>>>>>>> 168997637 168997638 9078155125 spd_all_fixed.csv 
>>>>>>>> [So
fread 9M, wc 168M rows]. 
>>>>>>>> Regards 
>>>>>>>> Paul 
>>>>>>>>

>>>>>>>> On 30 April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com
[1]> wrote:
>>>>>>>> 
>>>>>>>>> Hi, 
>>>>>>>>> 
>>>>>>>>> Thanks for
reporting this. Please set verbose=TRUE and let us know the output.

>>>>>>>>> 
>>>>>>>>> Thanks, Matthew 
>>>>>>>>> 
>>>>>>>>> On
30.04.2013 18:01, Paul Harding wrote: 
>>>>>>>>> 
>>>>>>>>>> Problem
with fread on a large file The file is 8GB, just short of 200,000 lines,
produced as SQLoutput and modified by cygwin/perl to remove the second
line.
>>>>>>>>>> 
>>>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get an
fread error 
>>>>>>>>>> 
>>>>>>>>>>
fread("data/spd_all_fixed.csv",sep=",") 
>>>>>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",") : 
>>>>>>>>>> Expected sep
(',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0 
>>>>>>>>>> Looking for the offending
line,with line numbers in output so I'm guessing this is line 6 of the
mid-file chunk examined, 
>>>>>>>>>> 
>>>>>>>>>> $ grep -n
'204038,2617097,201108' spd_all_fixed.csv 
>>>>>>>>>>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0

>>>>>>>>>>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0

>>>>>>>>>>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0

>>>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0

>>>>>>>>>>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0

>>>>>>>>>> and comparing to surrounding lines and the first ten lines

>>>>>>>>>> 
>>>>>>>>>> $ head spd_all_fixed.csv 
>>>>>>>>>>
s_key,i_key,p_key,q,pq,d,l,epi,class 
>>>>>>>>>>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 
>>>>>>>>>>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 
>>>>>>>>>>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 
>>>>>>>>>>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 
>>>>>>>>>>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 
>>>>>>>>>>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 
>>>>>>>>>>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 
>>>>>>>>>>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 
>>>>>>>>>>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13

>>>>>>>>>> I can't see any difference. I wonder if this is a bug? I
have no problems on a small test data set run through an identical
process and using the same fread command. 
>>>>>>>>>> Regards

>>>>>>>>>> Paul

Links:
------
[1] mailto:mdowle at mdowle.plus.com
[2]
mailto:p.harding at paniscus.com
[3] mailto:mdowle at mdowle.plus.com
[4]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130513/eb4b29f6/attachment.html>