[datatable-help] Fwd: fread on very large file
Matthew Dowle
mdowle at mdowle.plus.com
Tue May 14 22:52:05 CEST 2013
Hi Paul,
Great to hear, interesting timings. Yup - with a 16GB
data.table in RAM, now we're talking. It's this kind of size data.table
was intended for. Don't try names(DT)[1]<-"newname" on that!
Have
changed the .zip file name on the homepage - thanks for mentioning it.
And I see R-Forge is up to date and "Current" status anyway after all
that, so via the R-Forge repo should be fine now, too.
Matthew
On
14.05.2013 13:28, Paul Harding wrote:
> Hi Matthew, some frustration
until I worked out I needed to rename the zip file to data.table.zip to
install! I have regression tested on a 4GB file, and tested on a 19GB
whopper. Obviously it is a tad slow, but read.csv would never get there!
Delighted, I can't do what I need to do on these big datasets without
data.table. All seems fine, correct record count etc. I'm not checking
every line of data ;-)
>
>>
gash.dt<-fread("data/data_extract_1_fixed_trunc_fixed.csv")
>>
big.dt<-fread("data/data_extract_1_fixed.csv",verbose=T)
> Detected eol
as rn (CRLF) in that order, the Windows standard.
> Using line 30 to
detect sep (the last non blank line in the first 'autostart') ...
sep=','
> Found 16 columns
> First row with 16 fields occurs on line 1
(either column names or first row of data)
> All the fields on line 1
are character fields. Treating as the column names.
> Count of eol
after first data row: 214038352
> Subtracted 1 for last eol and any
trailing empty lines, leaving 214038351 data rows
> Type codes:
0003330030000000 (first 5 rows)
> Type codes: 0003330030000000 (+middle
5 rows)
> Type codes: 0003330030000000 (+last 5 rows)
> 0.050s ( 0%)
Memory map (rerun may be quicker)
> 0.020s ( 0%) sep and header
detection
> 159.560s ( 35%) Count rows (wc -l)
> 0.001s ( 0%) Column
type detection (first, middle and last 5 rows)
> 46.267s ( 10%)
Allocation of 214038351x16 result (xMB) in RAM
> 244.760s ( 54%)
Reading data
> 0.000s ( 0%) Allocation for type bumps (if any),
including gc time if triggered
> 0.000s ( 0%) Coercing data already
read in type bumps (if any)
> 5.258s ( 1%) Changing na.strings to NA
>
455.916s Total
>
> $ wc data_extract_1_fixed.csv
> 214038352
414098500 19745071003 data_extract_1_fixed.csv
>
>> tables()
> NAME
NROW MB COLS KEY
> [1,] big.dt 214,038,351 16330
STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw
> [2,] gash.dt 46,535,426 3551
STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw
> [3,] range.dt 1 1 startdt,enddt
> [4,] spd.dt 46,535,426 4083
caldate,store_key,item_key,period_key,ulc_category,format,state,pqty,tqty,weekda
store_key,item_key,caldate
> [5,] test.dt 5 1 digits,letters digits
>
Total: 23,966MB
>
> On 13 May 2013 22:26, Matthew Dowle
<mdowle at mdowle.plus.com [6]> wrote:
>
>> Passing on winbuilder now.
>>
>> .zip (rev 874) uploaded to homepage (will take an hour or two to
refresh), but available now from here :
>>
>>
https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable
[5]
>>
>> Matthew
>>
>> On 13.05.2013 21:38, Matthew Dowle wrote:
>>
>>> Hi Paul,
>>>
>>> Sorry for that hassle. As you've realised I
don't develop data.table on Windows. Those lines are switched in at
compile time for Windows, and so I rely on (the truly impressive)
winbuilder to compile and test for me. On this occasion, I did submit to
winbuilder last night but it didn't reply (even with a compile error)
which is extremely unusual. And R-Forge is stuck in 'building' state too
(which is not unusual, sadly).
>>>
>>> I''ll let you know when it's
passing on winbuilder, and I'll updated the Windows .zip on the homepage
(since we can't rely on R-Forge) ...
>>>
>>> Matthew
>>>
>>> On
13.05.2013 16:01, Paul Harding wrote:
>>>
>>>> I'd love to test it,
pulled the latest commit with svn, not sure about building from source
on windows, got some compilation errors:
>>>>
>>>>>
install.packages("pkg/",type="source",repos=NULL)
>>>> Warning in
install.packages :
>>>> package 'pkg/' is not available (for R version
3.0.0)
>>>> * installing *source* package 'data.table' ...
>>>> **
libs
>>>> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG
-I"d:/RCompile/CRANpkg/extralibs64/local/include" -O2 -Wall -std=gnu99
-mtune=core2 -c fread.c -o fread.o
>>>> fread.c: In function
'readfile':
>>>> fread.c:343:9: error: 'hfile' undeclared (first use in
this function)
>>>> fread.c:343:9: note: each undeclared identifier is
reported only once for each function it appears in
>>>>
fread.c:346:115: error: expected ';' before ')' token
>>>>
fread.c:346:115: error: expected statement before ')' token
>>>>
fread.c:350:17: warning: implicit declaration of function 'nanosleep'
[-Wimplicit-function-declaration]
>>>> make: *** [fread.o] Error 1
>>>> ERROR: compilation failed for package 'data.table'
>>>> Regards
>>>> Paul
>>>>
>>>> On 11 May 2013 02:39, Matthew Dowle
<mdowle at mdowle.plus.com [4]> wrote:
>>>>
>>>>> Paul, Vishal,
>>>>>
>>>>> Commit 859 :
>>>>>
>>>>> * fread now supports files larger than
4GB on 64bit Windows (#2767 thanks to Paul Harding) and files
>>>>>
between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call
to GetFileSize() needed to
>>>>> be GetFileSizeEx().
>>>>>
>>>>> Please
test and confirm ok now.
>>>>>
>>>>> Thanks, Matthew
>>>>>
>>>>> On
03.05.2013 14:59, Matthew Dowle wrote:
>>>>>
>>>>>> Oh. Then it's
likely a bug with fread on Windows for files > 4GB. Think GetFileSize()
should be GetFileSizeEx(), iirc.
>>>>>>
>>>>>> Please could you file
it as a bug on the tracker. Thanks.
>>>>>>
>>>>>> Matthew
>>>>>>
>>>>>> On 03.05.2013 14:32, Paul Harding wrote:
>>>>>>
>>>>>>>
Definitely a 64-bit machine. Here are the details:
>>>>>>>
>>>>>>>
Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors)
>>>>>>>
Installed memory (RAM): 128GB
>>>>>>> System type: 64-bit Operating
System
>>>>>>> Windows edition: Server 2008 R2 Enterprise SP1
>>>>>>>
Regards,
>>>>>>> Paul
>>>>>>>
>>>>>>> On 3 May 2013 10:51, Matthew
Dowle <mdowle at mdowle.plus.com [3]> wrote:
>>>>>>>
>>>>>>>> Hi Paul,
>>>>>>>>
>>>>>>>> Thanks for all this!
>>>>>>>>
>>>>>>>>> The
problem arises when the file reaches 4GB, in this case between 8,030,000
and 8,040,000 rows:
>>>>>>>>
>>>>>>>> Ahah. Are you using a 32bit or
64bit Windows machine?
>>>>>>>>
>>>>>>>> Thanks, Matthew
>>>>>>>>
>>>>>>>> On 02.05.2013 10:19, Paul Harding wrote:
>>>>>>>>
>>>>>>>>>
Some supplementary information, here is the portion of the file (with
row numbers, +1 for header) around where fread thinks the file ends.
>>>>>>>>>
>>>>>>>>> $ nl spd_all_fixed.csv | head -n 9186300 |tail
>>>>>>>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0
>>>>>>>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0
>>>>>>>>> 9186293
204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13
>>>>>>>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>>>>>>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0
>>>>>>>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0
>>>>>>>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0
>>>>>>>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0
>>>>>>>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0
>>>>>>>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0
>>>>>>>>> 9186294 (row 9186293 excl header) is where fread thinks the
file ends, mid-line by the look of it!
>>>>>>>>> I've experimented by
truncating the file. The error varies, either it reads too few records
or gives the error I reported, presumably determined by whether the last
perceived line is entire.
>>>>>>>>> The problem arises when the file
reaches 4GB, in this case between 8,030,000 and 8,040,000 rows:
>>>>>>>>>
>>>>>>>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May 1
12:02 spd_all_trunc_8030k.csv
>>>>>>>>> -rw-r--r--+ 1 Paul.Harding
Domain Users 4.1G May 1 12:06 spd_all_trunc_8040k.csv
>>>>>>>>>
>>>>>>>>>> dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)
>>>>>>>>>
>>>>>>>>> Detected eol as rn (CRLF) in that order, the
Windows standard.
>>>>>>>>> Looking for supplied sep ',' on line 30
(the last non blank line in the first 30) ... found
>>>>>>>>> Found 9
columns
>>>>>>>>> First row with 9 fields occurs on line 1 (either
column names or first row of data)
>>>>>>>>> All the fields on line 1
are character fields. Treating as the column names.
>>>>>>>>> Count of
eol after first data row: 80300000
>>>>>>>>> Subtracted 1 for last eol
and any trailing empty lines, leaving 80299999 data rows
>>>>>>>>>
>>>>>>>>> Type codes: 000002000 (first 5 rows)
>>>>>>>>> Type codes:
000002000 (+middle 5 rows)
>>>>>>>>> Type codes: 000002000 (+last 5
rows)
>>>>>>>>> 0%Bumping column 7 from INT to INT64 on data row 9,
field contains '0.42634430000000001'
>>>>>>>>> Bumping column 7 from
INT64 to REAL on data row 9, field contains '0.42634430000000001'
>>>>>>>>> 0.000s ( 0%) Memory map (rerun may be quicker)
>>>>>>>>>
0.000s ( 0%) Sep and header detection
>>>>>>>>> 0.000s ( 0%) Count rows
(wc -l)
>>>>>>>>> 0.000s ( 0%) Colmn type detection (first, middle and
last 5 rows)
>>>>>>>>> 0.000s ( 0%) Allocation of 80299999x9 result
(xMB) in RAM
>>>>>>>>> 171.188s ( 65%) Reading data
>>>>>>>>>
1365231.809s (518439%) Allocation for type bumps (if any), including gc
time if triggered
>>>>>>>>> -1365231.809s (-518439%) Coercing data
already read in type bumps (if any)
>>>>>>>>> 0.000s ( 0%) Changing
na.strings to NA
>>>>>>>>> 0.000s Total
>>>>>>>>>>
dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T)
>>>>>>>>>
>>>>>>>>> Detected eol as rn (CRLF) in that order, the Windows
standard.
>>>>>>>>> Looking for supplied sep ',' on line 30 (the last
non blank line in the first 30) ... found
>>>>>>>>> Found 9 columns
>>>>>>>>> First row with 9 fields occurs on line 1 (either column names
or first row of data)
>>>>>>>>> All the fields on line 1 are character
fields. Treating as the column names.
>>>>>>>>> Count of eol after
first data row: 18913
>>>>>>>>> Subtracted 0 for last eol and any
trailing empty lines, leaving 18913 data rows
>>>>>>>>>
>>>>>>>>> Type
codes: 000002000 (first 5 rows)
>>>>>>>>> Type codes: 000002000
(+middle 5 rows)
>>>>>>>>> Error in
fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :
>>>>>>>>> Expected sep (',') but ',' ends field 2 on line 6 when
detecting types: 204650,724540,
>>>>>>>>> Regards,
>>>>>>>>> Paul
>>>>>>>>>
>>>>>>>>> On 1 May 2013 10:28, Paul Harding
<p.harding at paniscus.com [2]> wrote:
>>>>>>>>>
>>>>>>>>>> Here is the
verbose output:
>>>>>>>>>>
>>>>>>>>>>>
dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T)
>>>>>>>>>>
Detected eol as rn (CRLF) in that order, the Windows standard.
>>>>>>>>>> Looking for supplied sep ',' on line 30 (the last non blank
line in the first 30) ... found
>>>>>>>>>> Found 9 columns
>>>>>>>>>>
First row with 9 fields occurs on line 1 (either column names or first
row of data)
>>>>>>>>>> All the fields on line 1 are character fields.
Treating as the column names.
>>>>>>>>>> Count of eol after first data
row: 9186293
>>>>>>>>>> Subtracted 0 for last eol and any trailing
empty lines, leaving 9186293 data rows
>>>>>>>>>> Type codes: 000002000
(first 5 rows)
>>>>>>>>>> Type codes: 000002200 (+middle 5 rows)
>>>>>>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose
= T) :
>>>>>>>>>>
>>>>>>>>>> Expected sep (',') but '0' ends field 5
on line 6 when detecting types: 204038,2617097,20110803,0,0
>>>>>>>>>>
But here is the wc output (via cygwin; newline, word (whitespace delim
so each word one 'line' here), byte)@
>>>>>>>>>>
>>>>>>>>>> $ wc
spd_all_fixed.csv
>>>>>>>>>> 168997637 168997638 9078155125
spd_all_fixed.csv
>>>>>>>>>> [So fread 9M, wc 168M rows].
>>>>>>>>>>
Regards
>>>>>>>>>> Paul
>>>>>>>>>>
>>>>>>>>>> On 30 April 2013 18:52,
Matthew Dowle <mdowle at mdowle.plus.com [1]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reporting this.
Please set verbose=TRUE and let us know the output.
>>>>>>>>>>>
>>>>>>>>>>> Thanks, Matthew
>>>>>>>>>>>
>>>>>>>>>>> On 30.04.2013
18:01, Paul Harding wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Problem with fread
on a large file The file is 8GB, just short of 200,000 lines, produced
as SQLoutput and modified by cygwin/perl to remove the second
line.
>>>>>>>>>>>>
>>>>>>>>>>>> Using data.table 1.8.8 on R3.0.0 I get
an fread error
>>>>>>>>>>>>
>>>>>>>>>>>>
fread("data/spd_all_fixed.csv",sep=",")
>>>>>>>>>>>> Error in
fread("data/spd_all_fixed.csv", sep = ",") :
>>>>>>>>>>>> Expected sep
(',') but '0' ends field 5 on line 6 when detecting types:
204038,2617097,20110803,0,0
>>>>>>>>>>>> Looking for the offending
line,with line numbers in output so I'm guessing this is line 6 of the
mid-file chunk examined,
>>>>>>>>>>>>
>>>>>>>>>>>> $ grep -n
'204038,2617097,201108' spd_all_fixed.csv
>>>>>>>>>>>>
8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
>>>>>>>>>>>>
8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
>>>>>>>>>>>>
9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>>>>>>>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
>>>>>>>>>>>>
10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
>>>>>>>>>>>> and comparing to surrounding lines and the first ten lines
>>>>>>>>>>>>
>>>>>>>>>>>> $ head spd_all_fixed.csv
>>>>>>>>>>>>
s_key,i_key,p_key,q,pq,d,l,epi,class
>>>>>>>>>>>>
203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
>>>>>>>>>>>>
203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
>>>>>>>>>>>>
203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
>>>>>>>>>>>>
203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
>>>>>>>>>>>>
203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
>>>>>>>>>>>>
203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
>>>>>>>>>>>>
203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
>>>>>>>>>>>>
203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
>>>>>>>>>>>>
203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
>>>>>>>>>>>> I can't see any difference. I wonder if this is a bug? I
have no problems on a small test data set run through an identical
process and using the same fread command.
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Paul
Links:
------
[1]
mailto:mdowle at mdowle.plus.com
[2] mailto:p.harding at paniscus.com
[3]
mailto:mdowle at mdowle.plus.com
[4] mailto:mdowle at mdowle.plus.com
[5]
https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable
[6]
mailto:mdowle at mdowle.plus.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130514/4666d7f8/attachment-0001.html>
More information about the datatable-help
mailing list