[datatable-help] Fwd: fread on very large file

Mon May 13 17:01:15 CEST 2013

I'd love to test it, pulled the latest commit with svn, not sure about
building from source on windows, got some compilation errors:

> install.packages("pkg/",type="source",repos=NULL)
Warning in install.packages :
  package ‘pkg/’ is not available (for R version 3.0.0)
* installing *source* package 'data.table' ...
** libs
gcc -m64 -I"C:/Users/PAUL~1.HAR/R/R-30~1.0/include" -DNDEBUG
-I"d:/RCompile/CRANpkg/extralibs64/local/include"     -O2 -Wall  -std=gnu99
-mtune=core2 -c fread.c -o fread.o
fread.c: In function 'readfile':
fread.c:343:9: error: 'hfile' undeclared (first use in this function)
fread.c:343:9: note: each undeclared identifier is reported only once for
each function it appears in
fread.c:346:115: error: expected ';' before ')' token
fread.c:346:115: error: expected statement before ')' token
fread.c:350:17: warning: implicit declaration of function 'nanosleep'
[-Wimplicit-function-declaration]
make: *** [fread.o] Error 1
ERROR: compilation failed for package 'data.table'

Regards
Paul

On 11 May 2013 02:39, Matthew Dowle <mdowle at mdowle.plus.com> wrote:

> **
>
>
>
> Paul, Vishal,
>
> Commit 859 :
>
> * fread now supports files larger than 4GB on 64bit Windows (#2767 thanks to Paul Harding) and files
>   between 2GB and 4GB on 32bit Windows (#2655 thanks to Vishal). A C call to GetFileSize() needed to
>   be GetFileSizeEx().
>
>
>
> Please test and confirm ok now.
>
>
>
> Thanks, Matthew
>
>
>
> On 03.05.2013 14:59, Matthew Dowle wrote:
>
>
>
> Oh. Then it's likely a bug with fread on Windows for files > 4GB.  Think
> GetFileSize() should be GetFileSizeEx(), iirc.
>
> Please could you file it as a bug on the tracker.  Thanks.
>
> Matthew
>
>
>
> On 03.05.2013 14:32, Paul Harding wrote:
>
> Definitely a 64-bit machine. Here are the details:
>
> Processor: Intel Xeon CPU E7-4830 @2.13GHz (4 processors)
> Installed memory (RAM): 128GB
> System type: 64-bit Operating System
> Windows edition: Server 2008 R2 Enterprise SP1
>  Regards,
> Paul
>
>
> On 3 May 2013 10:51, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>>
>>
>> Hi Paul,
>>
>> Thanks for all this!
>>
>> >  The problem arises when the file reaches 4GB, in this case between
>> 8,030,000 and 8,040,000 rows:
>>
>> Ahah.  Are you using a 32bit or 64bit Windows machine?
>>
>> Thanks, Matthew
>>
>>
>>
>> On 02.05.2013 10:19, Paul Harding wrote:
>>
>> Some supplementary information, here is the portion of the file (with row
>> numbers, +1 for header) around where fread thinks the file ends.
>>   $ nl spd_all_fixed.csv | head -n 9186300 |tail
>> 9186291 204029,2617097,20110803,0,0,0.3014501,0,0,0
>> 9186292 204030,2617097,20110803,0,0,0.52049100000000004,0,0,0
>> 9186293
>> 204034,2617097,20110803,0,0,0.86560269999999995,0.86560269999999995,2,13
>> 9186294 204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>> 9186295 204039,2617097,20110803,0,0,0.24952240000000001,0,0,0
>> 9186296 204041,2617097,20110803,1,0,1.0032293000000001,0,0,0
>> 9186297 204042,2617097,20110803,0,0,0.1375876,0,0,0
>> 9186298 204043,2617097,20110803,0,0,0.53391279999999997,0,0,0
>> 9186299 204044,2617097,20110803,0,0,0.16047169999999999,0,0,0
>> 9186300 204045,2617097,20110803,1,0,0.78766970000000003,0,0,0
>>  9186294 (row 9186293 excl header) is where fread thinks the file ends,
>> mid-line by the look of it!
>> I've experimented by truncating the file. The error varies, either it
>> reads too few records or gives the error I reported, presumably determined
>> by whether the last perceived line is entire.
>> The problem arises when the file reaches 4GB, in this case between
>> 8,030,000 and 8,040,000 rows:
>>  -rw-r--r--+ 1 Paul.Harding Domain Users 4.0G May  1 12:02
>> spd_all_trunc_8030k.csv
>> -rw-r--r--+ 1 Paul.Harding Domain Users 4.1G May  1 12:06
>> spd_all_trunc_8040k.csv
>>  > dt<-fread("data/spd_all_trunc_8030k.csv", sep=",",verbose=T)
>>  Detected eol as \r\n (CRLF) in that order, the Windows standard.
>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>> first 30) ... found
>> Found 9 columns
>> First row with 9 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>>  Count of eol after first data row: 80300000
>> Subtracted 1 for last eol and any trailing empty lines, leaving 80299999
>> data rows
>>  Type codes: 000002000 (first 5 rows)
>>  Type codes: 000002000 (+middle 5 rows)
>> Type codes: 000002000 (+last 5 rows)
>> 0%Bumping column 7 from INT to INT64 on data row 9, field contains
>> '0.42634430000000001'
>> Bumping column 7 from INT64 to REAL on data row 9, field contains
>> '0.42634430000000001'
>>    0.000s (  0%) Memory map (rerun may be quicker)
>>    0.000s (  0%) Sep and header detection
>>    0.000s (  0%) Count rows (wc -l)
>>    0.000s (  0%) Colmn type detection (first, middle and last 5 rows)
>>    0.000s (  0%) Allocation of 80299999x9 result (xMB) in RAM
>>  171.188s ( 65%) Reading data
>> 1365231.809s (518439%) Allocation for type bumps (if any), including gc
>> time if triggered
>> -1365231.809s (-518439%) Coercing data already read in type bumps (if any)
>>    0.000s (  0%) Changing na.strings to NA
>>    0.000s        Total
>> > dt<-fread("data/spd_all_trunc_8040k.csv", sep=",",verbose=T)
>>  Detected eol as \r\n (CRLF) in that order, the Windows standard.
>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>> first 30) ... found
>> Found 9 columns
>> First row with 9 fields occurs on line 1 (either column names or first
>> row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>>  Count of eol after first data row: 18913
>> Subtracted 0 for last eol and any trailing empty lines, leaving 18913
>> data rows
>>  Type codes: 000002000 (first 5 rows)
>>  Type codes: 000002000 (+middle 5 rows)
>> Error in fread("data/spd_all_trunc_8040k.csv", sep = ",", verbose = T) :
>>   Expected sep (',') but ',' ends field 2 on line 6 when detecting types:
>> 204650,724540,
>>  Regards,
>> Paul
>>
>>
>> On 1 May 2013 10:28, Paul Harding <p.harding at paniscus.com> wrote:
>>
>>> Here is the verbose output:
>>> > dt<-fread("data/spd_all_fixed.csv", sep=",",verbose=T)
>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>> Looking for supplied sep ',' on line 30 (the last non blank line in the
>>> first 30) ... found
>>> Found 9 columns
>>> First row with 9 fields occurs on line 1 (either column names or first
>>> row of data)
>>> All the fields on line 1 are character fields. Treating as the column
>>> names.
>>> Count of eol after first data row: 9186293
>>> Subtracted 0 for last eol and any trailing empty lines, leaving 9186293
>>> data rows
>>> Type codes: 000002000 (first 5 rows)
>>> Type codes: 000002200 (+middle 5 rows)
>>> Error in fread("data/spd_all_fixed.csv", sep = ",", verbose = T) :
>>>    Expected sep (',') but '0' ends field 5 on line 6 when detecting
>>> types: 204038,2617097,20110803,0,0
>>>  But here is the wc output (via cygwin; newline, word (whitespace delim
>>> so each word one 'line' here), byte)@
>>>  $ wc spd_all_fixed.csv
>>>  168997637  168997638 9078155125 spd_all_fixed.csv
>>> [So fread  9M, wc 168M rows].
>>> Regards
>>> Paul
>>>
>>>
>>> On 30 April 2013 18:52, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Thanks for reporting this. Please set verbose=TRUE and let us know the
>>>> output.
>>>>
>>>> Thanks, Matthew
>>>>
>>>>
>>>>
>>>> On 30.04.2013 18:01, Paul Harding wrote:
>>>>
>>>>  Problem with fread on a large file
>>>> The file is 8GB, just short of 200,000 lines, produced as SQLoutput and
>>>> modified by cygwin/perl to remove the second line.
>>>>  Using data.table 1.8.8 on R3.0.0 I get an fread error
>>>>  fread("data/spd_all_fixed.csv",sep=",")
>>>> Error in fread("data/spd_all_fixed.csv", sep = ",") :
>>>>   Expected sep (',') but '0' ends field 5 on line 6 when detecting
>>>> types: 204038,2617097,20110803,0,0
>>>> Looking for the offending line,with line numbers in output so I'm
>>>> guessing this is line 6 of the mid-file chunk examined,
>>>>  $ grep -n '204038,2617097,201108' spd_all_fixed.csv
>>>> 8316105:204038,2617097,20110801,0,0,0.64220529999999998,0,0,0
>>>> 8751106:204038,2617097,20110802,1,0,0.65744469999999999,0,0,0
>>>> 9186294:204038,2617097,20110803,0,0,0.49455500000000002,0,0,0
>>>> 9621619:204038,2617097,20110804,0,0,0.3461342,0,0,0
>>>> 10057189:204038,2617097,20110805,0,0,0.34128710000000001,0,0,0
>>>> and comparing to surrounding lines and the first ten lines
>>>>  $ head  spd_all_fixed.csv
>>>> s_key,i_key,p_key,q,pq,d,l,epi,class
>>>> 203974,1107181,20110713,0,0,0.13700080000000001,0,0,0
>>>> 203975,1107181,20110713,0,0,5.8352899999999999E-2,0,0,0
>>>> 203976,1107181,20110713,0,0,7.1298999999999998E-3,0,0,0
>>>> 203978,1107181,20110713,0,0,0.78346819999999995,0,0,0
>>>> 203979,1107181,20110713,0,0,0.61627779999999999,0,0,0
>>>> 203981,1107181,20110713,1,0,0.38610509999999998,0,0,0
>>>> 203982,1107181,20110713,0,0,4.0657899999999997E-2,0,0,0
>>>> 203983,1107181,20110713,2,0,0.71278109999999995,0,0,0
>>>> 203984,1107181,20110713,0,0,0.42634430000000001,0.42634430000000001,2,13
>>>> I can't see any difference. I wonder if this is a bug? I have no
>>>> problems on a small test data set run through an identical process and
>>>> using the same fread command.
>>>> Regards
>>>> Paul
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130513/5b4c29ac/attachment.html>