[datatable-help] New function fread() in v1.8.7

Matthew Dowle mdowle at mdowle.plus.com
Wed Dec 26 23:21:10 CET 2012


sep is now passed through and have added your example as a test.
Hope ok now.

Thanks,
Matthew

On 24.12.2012 14:18, Hideyoshi Maeda wrote:
> using autostart=1 gives the following error
>
> Error in fread(file.path, autostart = 1) :
> ' ends field 2 on line 1 when detecting types: Date and
> Time,Open,High,Low,Close,Volume
> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>
>
> On 24 Dec 2012, at 13:48, Matthew Dowle <mdowle at mdowle.plus.com> 
> wrote:
>
>>
>> Yes autostart is the line it detects separators, then it searches 
>> upwards to find the first row with the same number of columns. If that 
>> row is all character then it deems that as the column name row.  So if 
>> you start autostart on 1, it's already at the top and it might catch 
>> the right separator by avoiding the data rows for separator detection.
>>
>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>> Thanks for the quick response.
>>>
>>> I wasn't sure if I understood you correctly, but isn't the problem
>>> the way that autostart finds separators?
>>>
>>> and in my example, it had headers, so I think it would need to 
>>> start
>>> from row 2 wouldn't it, i.e. the first row that has non-header 
>>> values?
>>>
>>> Thanks
>>>
>>> On 24 Dec 2012, at 11:44, Matthew Dowle <mdowle at mdowle.plus.com> 
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> Ah yes, haven't hooked up the sep override yet, apologies, will 
>>>> fix.
>>>> Maybe setting autostart to the row number of the header row 
>>>> (probably 1)
>>>> might work.
>>>>
>>>> Thanks,
>>>> Matthew
>>>>
>>>>
>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>> oups…forgot to add the output from the verbose part…here it is...
>>>>>
>>>>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>>>>> Starting format detection on line 30 (the last non blank line in 
>>>>> the
>>>>> first 30)
>>>>> Detected sep as '/' and 3 columns
>>>>> Type codes: 003
>>>>> Found first row with 3 fields occuring on line 1 (either column 
>>>>> names
>>>>> or first row of data)
>>>>> The first data row has some non character fields. Treating as a 
>>>>> data
>>>>> row and using default column names.
>>>>> Count of eol after pos: 1143699
>>>>> Subtracted 1 for last eol and any trailing empty lines, leaving
>>>>> 1143698 data rows
>>>>>  0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>  0.000s (  0%) Format detection
>>>>>  0.095s ( 13%) Count rows (wc -l)
>>>>>  0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>  0.480s ( 66%) Reading data
>>>>>  0.000s (  0%) Bumping column type midread and coercing data 
>>>>> already read
>>>>>  0.002s (  0%) Changing na.strings to NA
>>>>>  0.731s        Total
>>>>>
>>>>>
>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda 
>>>>> <hideyoshi.maeda at gmail.com> wrote:
>>>>>
>>>>>> Hi Matthew,
>>>>>>
>>>>>> I am using the new `data.table` `fread()` function to read my 
>>>>>> csv files, which has the format as follows when using the read.csv 
>>>>>> function
>>>>>>
>>>>>>          Date.and.Time Open High  Low Close Volume
>>>>>>  1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>  2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>  3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>
>>>>>> The value of the first column is all of: `2007/01/01 22:53:00`, 
>>>>>> the next 5 columns are separated with commas.
>>>>>>
>>>>>> but when reading the same file using fread i get the following 
>>>>>> output
>>>>>>
>>>>>>      V1 V2                                             V3
>>>>>>  1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>  2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>  3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>
>>>>>> This is because the autodetect is using the "/" as a 
>>>>>> separator...
>>>>>>
>>>>>> I tried overriding this using the `sep=","` argument but this 
>>>>>> does not seem to be used in the function anywhere.
>>>>>>
>>>>>> Furthremore when using verbose I get the following output, which 
>>>>>> suggests that I was right in thinking that "/" is used as a 
>>>>>> separator rather than ",".
>>>>>>
>>>>>> Is there any way to fix this, so that it correctly reads all 6 
>>>>>> columns separately?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> HLM
>>>>>>
>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle <mdowle at mdowle.plus.com> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi datatablers,
>>>>>>>
>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>
>>>>>>> =====
>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>> * header, skip, nrows, sep and colClasses are all auto 
>>>>>>> detected.
>>>>>>> * integers>2^31 are detected and read natively as 
>>>>>>> bit64::integer64.
>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>> * new implementation entirely in C
>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>> read.csv("test.csv")                                   # 30-60 
>>>>>>> sec
>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    10 
>>>>>>> sec
>>>>>>> fread("test.csv")                                      #     3 
>>>>>>> sec
>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   360 
>>>>>>> sec
>>>>>>> fread("2008.csv")                                      #    50 
>>>>>>> sec
>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for 
>>>>>>> ideas,
>>>>>>> discussions and beta testing.
>>>>>>> =====
>>>>>>>
>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>>>>
>>>>>>> install.packages("data.table", 
>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>> require(data.table)
>>>>>>> ?fread
>>>>>>> fread("your biggest baddest file")
>>>>>>>
>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 
>>>>>>> optimization rather
>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as 
>>>>>>> great on Win64
>>>>>>> until that can be resolved on R-Forge, unless you compile 
>>>>>>> yourself. -O3
>>>>>>> has some optimizations that fread may benefit from. But 
>>>>>>> interested to hear.
>>>>>>>
>>>>>>> Seasons greatings!
>>>>>>>
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> datatable-help mailing list
>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>> 
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>
>>>>
>>


More information about the datatable-help mailing list