[datatable-help] Fwd: fread on very large file

Tue May 14 22:52:05 CEST 2013

Hi Paul, 

Great to hear, interesting timings. Yup - with a 16GB
data.table in RAM, now we're talking. It's this kind of size data.table
was intended for. Don't try names(DT)[1]<-"newname" on that! 

Have
changed the .zip file name on the homepage - thanks for mentioning it.
And I see R-Forge is up to date and "Current" status anyway after all
that, so via the R-Forge repo should be fine now, too. 

Matthew 

On
14.05.2013 13:28, Paul Harding wrote: 

> Hi Matthew, some frustration
until I worked out I needed to rename the zip file to data.table.zip to
install! I have regression tested on a 4GB file, and tested on a 19GB
whopper. Obviously it is a tad slow, but read.csv would never get there!
Delighted, I can't do what I need to do on these big datasets without
data.table. All seems fine, correct record count etc. I'm not checking
every line of data ;-) 
> 
>>
gash.dt<-fread("data/data_extract_1_fixed_trunc_fixed.csv") 
>>
big.dt<-fread("data/data_extract_1_fixed.csv",verbose=T) 
> Detected eol
as rn (CRLF) in that order, the Windows standard. 
> Using line 30 to
detect sep (the last non blank line in the first 'autostart') ...
sep=',' 
> Found 16 columns 
> First row with 16 fields occurs on line 1
(either column names or first row of data) 
> All the fields on line 1
are character fields. Treating as the column names. 
> Count of eol
after first data row: 214038352 
> Subtracted 1 for last eol and any
trailing empty lines, leaving 214038351 data rows 
> Type codes:
0003330030000000 (first 5 rows) 
> Type codes: 0003330030000000 (+middle
5 rows) 
> Type codes: 0003330030000000 (+last 5 rows) 
> 0.050s ( 0%)
Memory map (rerun may be quicker) 
> 0.020s ( 0%) sep and header
detection 
> 159.560s ( 35%) Count rows (wc -l) 
> 0.001s ( 0%) Column
type detection (first, middle and last 5 rows) 
> 46.267s ( 10%)
Allocation of 214038351x16 result (xMB) in RAM 
> 244.760s ( 54%)
Reading data 
> 0.000s ( 0%) Allocation for type bumps (if any),
including gc time if triggered 
> 0.000s ( 0%) Coercing data already
read in type bumps (if any) 
> 5.258s ( 1%) Changing na.strings to NA 
>
455.916s Total 
> 
> $ wc data_extract_1_fixed.csv 
> 214038352
414098500 19745071003 data_extract_1_fixed.csv 
> 
>> tables() 
> NAME
NROW MB COLS KEY 
> [1,] big.dt 214,038,351 16330
STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw

> [2,] gash.dt 46,535,426 3551
STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw

> [3,] range.dt 1 1 startdt,enddt 
> [4,] spd.dt 46,535,426 4083
caldate,store_key,item_key,period_key,ulc_category,format,state,pqty,tqty,weekda
store_key,item_key,caldate 
> [5,] test.dt 5 1 digits,letters digits 
>
Total: 23,966MB 
> 
> On 13 May 2013 22:26, Matthew Dowle
<mdowle at mdowle.plus.com [6]> wrote:
> 
>> Passing on winbuilder now. 
>>

>> .zip (rev 874) uploaded to homepage (will take an hour or two to
refresh), but available now from here : 
>> 
>>
https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable
[5] 
>> 
>> Matthew 
>> 
>> On 13.05.2013 21:38, Matthew Dowle wrote:

>> 
>>> Hi Paul, 
>>> 
>>> Sorry for that hassle. As you've realised I
don't develop data.table on Windows. Those lines are switched in at
compile time for Windows, and so I rely on (the truly impressive)
winbuilder to compile and test for me. On this occasion, I did submit to
winbuilder last night but it didn't reply (even with a compile error)
which is extremely unusual. And R-Forge is stuck in 'building' state too
(which is not unusual, sadly). 
>>> 
>>> I''ll let you know when it's
passing on winbuilder, and I'll updated the Windows .zip on the homepage
(since we can't rely on R-Forge) ... 
>>> 
>>> Matthew 
>>> 
>>> On
13.05.2013 16:01, Paul Harding wrote: 
>>> 
>>>> I'd love to test it,
pulled the latest commit with svn, not sure about building from source
on windows, got some compilation errors: 
>>>> 
>>>>>
install.packages("pkg/",type="source",repos=NULL) 
>>>> Warning in
install.packages : 
>>>> package 'pkg/' is not available (for R version
3.0.0) 
>>>> * installing *source* package 'data.table' ... 
>>>> **
libs 
>>>> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG
-I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99
-mtune=core2 -c fread.c -o fread.o 
>>>> fread.c: In function
'readfile': 
>>>> fread.c:343:9: error: 'hfile' undeclared (first use in
this function) 
>>>> fread.c:343:9: note: each undeclared identifier is
reported only once for each function it appears in 
>>>>
fread.c:346:115: error: expected ';' before ')' token 
>>>>
fread.c:346:115: error: expected statement before ')' token 
>>>>
fread.c:350:17: warning: implicit declaration of function 'nanosleep'
[-Wimplicit-function-declaration] 
>>>> make: *** [fread.o] Error 1

>>>> ERROR: compilation failed for package 'data.table' 
>>>> Regards

>>>> Paul 
>>>> 
>>>> On 11 May 2013 02:39, Matthew Dowle
<mdowle at mdowle.plus.com [4]> wrote:
>>>> 
>>>>> Paul, Vishal, 
>>>>>

>>>>> Commit 859 : 
>>>>> 
>>>>> * fread now supports files larger than
4GB on 64bit Windows (#2767 thanks to Paul Harding) and files
>>>>>
between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call
to GetFileSize() needed to
>>>>> be GetFileSizeEx().
>>>>> 
>>>>> Please
test and confirm ok now.
>>>>> 
>>>>> Thanks, Matthew
>>>>> 
>>>>> On
03.05.2013 14:59, Matthew Dowle wrote: 
>>>>> 
>>>>>> Oh. Then it's
likely a bug with fread on Windows for files > 4GB. Think GetFileSize()
should be GetFileSizeEx(), iirc. 
>>>>>> 
>>>>>> Please could you file
it as a bug on the tracker. Thanks. 
>>>>>> 
>>>>>> Matthew 
>>>>>>

>>>>>> On 03.05.2013 14:32, Paul Harding wrote: 
>>>>>> 
>>>>>>>
Definitely a 64-bit machine. Here are the details: 
>>>>>>> 
>>>>>>>
Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors) 
>>>>>>>
Installed memory (RAM): 128GB 
>>>>>>> System type: 64-bit Operating
System 
>>>>>>> Windows edition: Server 2008 R2 Enterprise SP1 
>>>>>>>
Regards, 
>>>>>>> Paul 
>>>>>>> 
>>>>>>> On 3 May 2013 10:51, Matthew
Dowle <mdowle at mdowle.plus.com [3]> wrote:
>>>>>>> 
>>>>>>>> Hi Paul,

>>>>>>>> 
>>>>>>>> Thanks for all this! 
>>>>>>>> 
>>>>>>>>> The
problem arises when the file reaches 4GB, in this case between 8,030,000
and 8,040,000 rows: 
>>>>>>>> 
>>>>>>>> Ahah. Are you using a 32bit or
64bit Windows machine? 
>>>>>>>> 
>>>>>>>> Thanks, Matthew 
>>>>>>>>

>>>>>>>> On 02.05.2013 10:19, Paul Harding wrote: 
>>>>>>>> 
>>>>>>>>>
Some supplementary information, here is the portion of the file (with
row numbers, +1 for header) around where fread thinks the file ends.

>>>>>>>>> 
>>>>>>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail

>>>>>>>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0

>>>>>>>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0

>>>>>>>>> 9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13

>>>>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0

>>>>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0

>>>>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0

>>>>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0

>>>>>>>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0

>>>>>>>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0

>>>>>>>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0

>>>>>>>>> 9186294 (row 9186293 excl header) is where fread thinks the
file ends, mid-line by the look of it! 
>>>>>>>>> I've experimented by
truncating the file. The error varies, either it reads too few records
or gives the error I reported, presumably determined by whether the last
perceived line is entire. 
>>>>>>>>> The problem arises when the file
reaches 4GB, in this case between 8,030,000 and 8,040,000 rows:

>>>>>>>>> 
>>>>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1
12:02 spd_all_trunc_8030k.csv 
>>>>>>>>> -rw-r--r--+ 1 Paul.Harding
Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv 
>>>>>>>>>

>>>>>>>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)

>>>>>>>>> 
>>>>>>>>> Detected eol as rn (CRLF) in that order, the
Windows standard. 
>>>>>>>>> Looking for supplied sep ',' on line 30
(the last non blank line in the first 30) ... found 
>>>>>>>>> Found 9
columns 
>>>>>>>>> First row with 9 fields occurs on line 1 (either
column names or first row of data) 
>>>>>>>>> All the fields on line 1
are character fields. Treating as the column names. 
>>>>>>>>> Count of
eol after first data row: 80300000 
>>>>>>>>> Subtracted 1 for last eol
and any trailing empty lines, leaving 80299999 data rows 
>>>>>>>>>

>>>>>>>>> Type codes: 000002000 (first 5 rows) 
>>>>>>>>> Type codes:
000002000 (+middle 5 rows) 
>>>>>>>>> Type codes: 000002000 (+last 5
rows) 
>>>>>>>>> 0%Bumping column 7 from INT to INT64 on data row 9,
field contains '0.42634430000000001' 
>>>>>>>>> Bumping column 7 from
INT64 to REAL on data row 9, field contains '0.42634430000000001'

>>>>>>>>> 0.000s ( 0%) Memory map (rerun may be quicker) 
>>>>>>>>>
0.000s ( 0%) Sep and header detection 
>>>>>>>>> 0.000s ( 0%) Count rows
(wc -l) 
>>>>>>>>> 0.000s ( 0%) Colmn type detection (first, middle and
last 5 rows) 
>>>>>>>>> 0.000s ( 0%) Allocation of 80299999x9 result
(xMB) in RAM 
>>>>>>>>> 171.188s ( 65%) Reading data 
>>>>>>>>>
1365231.809s (518439%) Allocation for type bumps (if any), including gc
time if triggered 
>>>>>>>>> -1365231.809s (-518439%) Coercing data
already read in type bumps (if any) 
>>>>>>>>> 0.000s ( 0%) Changing
na.strings to NA 
>>>>>>>>> 0.000s Total 
>>>>>>>>>>
dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T) 
>>>>>>>>>

>>>>>>>>> Detected eol as rn (CRLF) in that order, the Windows
standard. 
>>>>>>>>> Looking for supplied sep ',' on line 30 (the last
non blank line in the first 30) ... found 
>>>>>>>>> Found 9 columns

>>>>>>>>> First row with 9 fields occurs on line 1 (either column names
or first row of data) 
>>>>>>>>> All the fields on line 1 are character
fields. Treating as the column names. 
>>>>>>>>> Count of eol after
first data row: 18913 
>>>>>>>>> Subtracted 0 for last eol and any
trailing empty lines, leaving 18913 data rows 
>>>>>>>>> 
>>>>>>>>> Type
codes: 000002000 (first 5 rows) 
>>>>>>>>> Type codes: 000002000
(+middle 5 rows) 
>>>>>>>>> Error in
fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :

>>>>>>>>> Expected sep (',') but ',' ends field 2 on line 6 when
detecting types: 204650,724540, 
>>>>>>>>> Regards, 
>>>>>>>>> Paul

>>>>>>>>> 
>>>>>>>>> On 1 May 2013 10:28, Paul Harding
<p.harding at paniscus.com [2]> wrote:
>>>>>>>>> 
>>>>>>>>>> Here is the
verbose output: 
>>>>>>>>>> 
>>>>>>>>>>>
dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T) 
>>>>>>>>>>
Detected eol as rn (CRLF) in that order, the Windows standard.

>>>>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank
line in the first 30) ... found 
>>>>>>>>>> Found 9 columns 
>>>>>>>>>>
First row with 9 fields occurs on line 1 (either column names or first
row of data) 
>>>>>>>>>> All the fields on line 1 are character fields.
Treating as the column names. 
>>>>>>>>>> Count of eol after first data
row: 9186293 
>>>>>>>>>> Subtracted 0 for last eol and any trailing
empty lines, leaving 9186293 data rows 
>>>>>>>>>> Type codes: 000002000
(first 5 rows) 
>>>>>>>>>> Type codes: 000002200 (+middle 5 rows)

>>>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose
= T) : 
>>>>>>>>>> 
>>>>>>>>>> Expected sep (',') but '0' ends field 5
on line 6 when detecting types: 204038,2617097,20110803,0,0 
>>>>>>>>>>
But here is the wc output (via cygwin; newline, word (whitespace delim
so each word one 'line' here), byte)@ 
>>>>>>>>>> 
>>>>>>>>>> $ wc
spd_all_fixed.csv 
>>>>>>>>>> 168997637 168997638 9078155125
spd_all_fixed.csv 
>>>>>>>>>> [So fread 9M, wc 168M rows]. 
>>>>>>>>>>
Regards 
>>>>>>>>>> Paul 
>>>>>>>>>> 
>>>>>>>>>> On 30 April 2013 18:52,
Matthew Dowle <mdowle at mdowle.plus.com [1]> wrote:
>>>>>>>>>>

>>>>>>>>>>> Hi, 
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for reporting this.
Please set verbose=TRUE and let us know the output. 
>>>>>>>>>>>

>>>>>>>>>>> Thanks, Matthew 
>>>>>>>>>>> 
>>>>>>>>>>> On 30.04.2013
18:01, Paul Harding wrote: 
>>>>>>>>>>> 
>>>>>>>>>>>> Problem with fread
on a large file The file is 8GB, just short of 200,000 lines, produced
as SQLoutput and modified by cygwin/perl to remove the second
line.
>>>>>>>>>>>> 
>>>>>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get
an fread error 
>>>>>>>>>>>> 
>>>>>>>>>>>>
fread("data/spd_all_fixed.csv",sep=",") 
>>>>>>>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",") : 
>>>>>>>>>>>> Expected sep
(',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0 
>>>>>>>>>>>> Looking for the offending
line,with line numbers in output so I'm guessing this is line 6 of the
mid-file chunk examined, 
>>>>>>>>>>>> 
>>>>>>>>>>>> $ grep -n
'204038,2617097,201108' spd_all_fixed.csv 
>>>>>>>>>>>>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0

>>>>>>>>>>>>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0

>>>>>>>>>>>>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0

>>>>>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0

>>>>>>>>>>>>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0

>>>>>>>>>>>> and comparing to surrounding lines and the first ten lines

>>>>>>>>>>>> 
>>>>>>>>>>>> $ head spd_all_fixed.csv 
>>>>>>>>>>>>
s_key,i_key,p_key,q,pq,d,l,epi,class 
>>>>>>>>>>>>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0 
>>>>>>>>>>>>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0 
>>>>>>>>>>>>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0 
>>>>>>>>>>>>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0 
>>>>>>>>>>>>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0 
>>>>>>>>>>>>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0 
>>>>>>>>>>>>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0 
>>>>>>>>>>>>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0 
>>>>>>>>>>>>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13

>>>>>>>>>>>> I can't see any difference. I wonder if this is a bug? I
have no problems on a small test data set run through an identical
process and using the same fread command. 
>>>>>>>>>>>> Regards

>>>>>>>>>>>> Paul

Links:
------
[1]
mailto:mdowle at mdowle.plus.com
[2] mailto:p.harding at paniscus.com
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
[5]
https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable
[6]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130514/4666d7f8/attachment-0001.html>