[datatable-help] Fwd: fread on very large file

Tue May 14 14:28:53 CEST 2013

Hi Matthew, some frustration until I worked out I needed to rename the zip
file to data.table.zip to install! I have regression tested on a 4GB file,
and tested on a 19GB whopper. Obviously it is a tad slow, but read.csv
would never get there! Delighted, I can't do what I need to do on these big
datasets without data.table. All seems fine, correct record count etc.  I'm
not checking every line of data ;-)

> gash.dt<-fread("data/data_extract_1_fixed_trunc_fixed.csv")
> big.dt<-fread("data/data_extract_1_fixed.csv",verbose=T)
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Using line 30 to detect sep (the last non blank line in the first
'autostart') ... sep=','
Found 16 columns
First row with 16 fields occurs on line 1 (either column names or first row
of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 214038352
Subtracted 1 for last eol and any trailing empty lines, leaving 214038351
data rows
Type codes: 0003330030000000 (first 5 rows)
Type codes: 0003330030000000 (+middle 5 rows)
Type codes: 0003330030000000 (+last 5 rows)
   0.050s (  0%) Memory map (rerun may be quicker)
   0.020s (  0%) sep and header detection
 159.560s ( 35%) Count rows (wc -l)
   0.001s (  0%) Column type detection (first, middle and last 5 rows)
  46.267s ( 10%) Allocation of 214038351x16 result (xMB) in RAM
 244.760s ( 54%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if
triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   5.258s (  1%) Changing na.strings to NA
 455.916s        Total

$ wc data_extract_1_fixed.csv
  214038352   414098500 19745071003 data_extract_1_fixed.csv

> tables()
     NAME            NROW    MB COLS
                                      KEY
[1,] big.dt   214,038,351 16330
STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw

[2,] gash.dt   46,535,426  3551
STORE_KEY,ITEM_KEY,period_key,ULC_CATEGORY,format,state,pqty,tqty,weekday,dayofw

[3,] range.dt           1     1 startdt,enddt

[4,] spd.dt    46,535,426  4083
caldate,store_key,item_key,period_key,ulc_category,format,state,pqty,tqty,weekda
store_key,item_key,caldate
[5,] test.dt            5     1 digits,letters
                                      digits
Total: 23,966MB

On 13 May 2013 22:26, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> **
>
>
>
> Passing on winbuilder now.
>
> .zip (rev 874) uploaded to homepage (will take an hour or two to refresh),
>  but available now from here :
>
>
> https://r-forge.r-project.org/scm/viewvc.php/*checkout*/www/data.table_1.8.9_rev874.zip?revision=875&root=datatable
>
> Matthew
>
>
>
> On 13.05.2013 21:38, Matthew Dowle wrote:
>
>
>
> Hi Paul,
>
> Sorry for that hassle.  As you've realised I don't develop data.table on
> Windows.  Those lines are switched in at compile time for Windows,  and so
> I rely on (the truly impressive) winbuilder to compile and test for me.  On
> this occasion,  I did submit to winbuilder last night but it didn't reply
> (even with a compile error) which is extremely unusual.  And R-Forge is
> stuck in 'building' state too (which is not unusual, sadly).
>
> I''ll let you know when it's passing on winbuilder,  and I'll updated the
> Windows .zip on the homepage (since we can't rely on R-Forge) ...
>
> Matthew
>
>
>
> On 13.05.2013 16:01, Paul Harding wrote:
>
> I'd love to test it, pulled the latest commit with svn, not sure about
> building from source on windows, got some compilation errors:
> > install.packages("pkg/",type="source",repos=NULL)
> Warning in install.packages :
>   package ‘pkg/’ is not available (for R version 3.0.0)
> * installing *source* package 'data.table' ...
> ** libs
> gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG
> -I"d:/RCompile/CRANpkg/extralibs64/local/include"     -O2 -Wall  -std=gnu99
> -mtune=core2 -c fread.c -o fread.o
> fread.c: In function 'readfile':
> fread.c:343:9: error: 'hfile' undeclared (first use in this function)
> fread.c:343:9: note: each undeclared identifier is reported only once for
> each function it appears in
> fread.c:346:115: error: expected ';' before ')' token
> fread.c:346:115: error: expected statement before ')' token
> fread.c:350:17: warning: implicit declaration of function 'nanosleep'
> [-Wimplicit-function-declaration]
> make: *** [fread.o] Error 1
> ERROR: compilation failed for package 'data.table'
>  Regards
> Paul
>
>
> On 11 May 2013 02:39, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>>
>>
>> Paul, Vishal,
>>
>> Commit 859 :
>>
>> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files
>>   between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to
>>   be GetFileSizeEx().
>>
>>
>>
>> Please test and confirm ok now.
>>
>>
>>
>> Thanks, Matthew
>>
>>
>>
>> On 03.05.2013 14:59, Matthew Dowle wrote:
>>
>>
>>
>> Oh. Then it's likely a bug with fread on Windows for files > 4GB.  Think
>> GetFileSize() should be GetFileSizeEx(), iirc.
>>
>> Please could you file it as a bug on the tracker.  Thanks.
>>
>> Matthew
>>
>>
>>
>> On 03.05.2013 14:32, Paul Harding wrote:
>>
>> Definitely a 64-bit machine. Here are the details:
>>
>> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors)
>> Installed memory (RAM): 128GB
>> System type: 64-bit Operating System
>> Windows edition: Server 2008 R2 Enterprise SP1
>>  Regards,
>> Paul
>>
>>
>> On 3 May 2013 10:51, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>
>>>
>>>
>>> Hi Paul,
>>>
>>> Thanks for all this!
>>>
>>> >  The problem arises when the file reaches 4GB, in this case between
>>> 8,030,000 and 8,040,000 rows:
>>>
>>> Ahah.  Are you using a 32bit or 64bit Windows machine?
>>>
>>> Thanks, Matthew
>>>
>>>
>>>
>>> On 02.05.2013 10:19, Paul Harding wrote:
>>>
>>> Some supplementary information, here is the portion of the file (with
>>> row numbers, +1 for header) around where fread thinks the file ends.
>>>   $ nl spd_all_fixed.csv | head -n 9186300 |tail
>>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0
>>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0
>>> 9186293
>>> 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13
>>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0
>>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0
>>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0
>>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0
>>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0
>>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0
>>>  9186294 (row 9186293 excl header) is where fread thinks the file ends,
>>> mid-line by the look of it!
>>> I've experimented by truncating the file. The error varies, either it
>>> reads too few records or gives the error I reported, presumably determined
>>> by whether the last perceived line is entire.
>>> The problem arises when the file reaches 4GB, in this case between
>>> 8,030,000 and 8,040,000 rows:
>>>  -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May  1 12:02
>>> spd_all_trunc_8030k.csv
>>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May  1 12:06
>>> spd_all_trunc_8040k.csv
>>>  > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)
>>>  Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>>> first 30) ... found
>>> Found 9 columns
>>> First row with 9 fields occurs on line 1 (either column names or first
>>> row of data)
>>> All the fields on line 1 are character fields. Treating as the column
>>> names.
>>>  Count of eol after first data row: 80300000
>>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999
>>> data rows
>>>  Type codes: 000002000 (first 5 rows)
>>>  Type codes: 000002000 (+middle 5 rows)
>>> Type codes: 000002000 (+last 5 rows)
>>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains
>>> '0.42634430000000001'
>>> Bumping column 7 from INT64 to REAL on data row 9, field contains
>>> '0.42634430000000001'
>>>    0.000s (  0%) Memory map (rerun may be quicker)
>>>    0.000s (  0%) Sep and header detection
>>>    0.000s (  0%) Count rows (wc -l)
>>>    0.000s (  0%) Colmn type detection (first, middle and last 5 rows)
>>>    0.000s (  0%) Allocation of 80299999x9 result (xMB) in RAM
>>>  171.188s ( 65%) Reading data
>>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc
>>> time if triggered
>>> -1365231.809s (-518439%) Coercing data already read in type bumps (if
>>> any)
>>>    0.000s (  0%) Changing na.strings to NA
>>>    0.000s        Total
>>> > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T)
>>>  Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>>> first 30) ... found
>>> Found 9 columns
>>> First row with 9 fields occurs on line 1 (either column names or first
>>> row of data)
>>> All the fields on line 1 are character fields. Treating as the column
>>> names.
>>>  Count of eol after first data row: 18913
>>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913
>>> data rows
>>>  Type codes: 000002000 (first 5 rows)
>>>  Type codes: 000002000 (+middle 5 rows)
>>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :
>>>   Expected sep (',') but ',' ends field 2 on line 6 when detecting
>>> types: 204650,724540,
>>>  Regards,
>>> Paul
>>>
>>>
>>> On 1 May 2013 10:28, Paul Harding <p.harding at paniscus.com> wrote:
>>>
>>>> Here is the verbose output:
>>>> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T)
>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>>>> first 30) ... found
>>>> Found 9 columns
>>>> First row with 9 fields occurs on line 1 (either column names or first
>>>> row of data)
>>>> All the fields on line 1 are character fields. Treating as the column
>>>> names.
>>>> Count of eol after first data row: 9186293
>>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293
>>>> data rows
>>>> Type codes: 000002000 (first 5 rows)
>>>> Type codes: 000002200 (+middle 5 rows)
>>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :
>>>>    Expected sep (',') but '0' ends field 5 on line 6 when detecting
>>>> types: 204038,2617097,20110803,0,0
>>>>  But here is the wc output (via cygwin; newline, word (whitespace
>>>> delim so each word one 'line' here), byte)@
>>>>  $ wc spd_all_fixed.csv
>>>>  168997637  168997638 9078155125 spd_all_fixed.csv
>>>> [So fread  9M, wc 168M rows].
>>>> Regards
>>>> Paul
>>>>
>>>>
>>>> On 30 April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the
>>>>> output.
>>>>>
>>>>> Thanks, Matthew
>>>>>
>>>>>
>>>>>
>>>>> On 30.04.2013 18:01, Paul Harding wrote:
>>>>>
>>>>>  Problem with fread on a large file
>>>>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput
>>>>> and modified by cygwin/perl to remove the second line.
>>>>>  Using data.table 1.8.8 on R3.0.0 I get an fread error
>>>>>  fread("data/spd_all_fixed.csv",sep=",")
>>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") :
>>>>>   Expected sep (',') but '0' ends field 5 on line 6 when detecting
>>>>> types: 204038,2617097,20110803,0,0
>>>>> Looking for the offending line,with line numbers in output so I'm
>>>>> guessing this is line 6 of the mid-file chunk examined,
>>>>>  $ grep -n '204038,2617097,201108' spd_all_fixed.csv
>>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
>>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
>>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
>>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
>>>>> and comparing to surrounding lines and the first ten lines
>>>>>  $ head  spd_all_fixed.csv
>>>>> s_key,i_key,p_key,q,pq,d,l,epi,class
>>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
>>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
>>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
>>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
>>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
>>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
>>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
>>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
>>>>>
>>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
>>>>> I can't see any difference. I wonder if this is a bug? I have no
>>>>> problems on a small test data set run through an identical process and
>>>>> using the same fread command.
>>>>> Regards
>>>>> Paul
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130514/de1b17f6/attachment-0001.html>