[datatable-help] New function fread() in v1.8.7

Matthew Dowle mdowle at mdowle.plus.com
Fri Dec 28 23:22:05 CET 2012


Btw we like backticks in data.table :

     DT[,`Date and Time`]
     setkey(DT,`Date and Time`)   # [*]

although you'd probably  setnames(DT,"Date and Time","datetime")  for a 
core column like that.

[*] which I've just noticed doesn't work, will file new bug report.


On 28.12.2012 22:06, Matthew Dowle wrote:
> Great. Thanks for confirm.
>
> The file itself has "Date and Time" as the column name doesn't it
> i.e. with spaces not dots? fread retains exactly what's in the file,
> whereas read.csv runs the column names through base::make.names()
> which converts the spaces to dots to make the column names
> syntactically valid, iiuc. data.table's general policy is to allow
> spaces and other unusual characters in columns names and retain them
> throughout (forgiving the odd bug now fixed caused by some make.names
> calls which should have been make.unique).
>
> To do the same as read.csv :
>
>     DT = fread(...)
>     setnames(DT,make.names(names(DT)))
>
> Not sure I understood correctly and I didn't test.
>
>
> On 28.12.2012 21:36, Hideyoshi Maeda wrote:
>> The sep argument now works thank you!
>>
>> But just out of curiosity…not a major problem of sorts but by using
>> fread(file.path,sep=",") on my csv file, the column names includes 
>> "."
>> as shown in my original email… but the output result automatically
>> removes the "." in the column name…is there a way to stop it from
>> doing that?, i.e. the first column becomes "Data and Time"  when 
>> using
>> fread, rather than the original "Date.and.Time" when using read.csv
>>
>>
>> On 26 Dec 2012, at 22:21, Matthew Dowle <mdowle at mdowle.plus.com> 
>> wrote:
>>
>>>
>>> sep is now passed through and have added your example as a test.
>>> Hope ok now.
>>>
>>> Thanks,
>>> Matthew
>>>
>>> On 24.12.2012 14:18, Hideyoshi Maeda wrote:
>>>> using autostart=1 gives the following error
>>>>
>>>> Error in fread(file.path, autostart = 1) :
>>>> ' ends field 2 on line 1 when detecting types: Date and
>>>> Time,Open,High,Low,Close,Volume
>>>> 2007/01/01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>
>>>>
>>>> On 24 Dec 2012, at 13:48, Matthew Dowle <mdowle at mdowle.plus.com> 
>>>> wrote:
>>>>
>>>>>
>>>>> Yes autostart is the line it detects separators, then it searches 
>>>>> upwards to find the first row with the same number of columns. If 
>>>>> that row is all character then it deems that as the column name 
>>>>> row. So if you start autostart on 1, it's already at the top and it 
>>>>> might catch the right separator by avoiding the data rows for 
>>>>> separator detection.
>>>>>
>>>>> On 24.12.2012 11:52, Hideyoshi Maeda wrote:
>>>>>> Thanks for the quick response.
>>>>>>
>>>>>> I wasn't sure if I understood you correctly, but isn't the 
>>>>>> problem
>>>>>> the way that autostart finds separators?
>>>>>>
>>>>>> and in my example, it had headers, so I think it would need to 
>>>>>> start
>>>>>> from row 2 wouldn't it, i.e. the first row that has non-header 
>>>>>> values?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On 24 Dec 2012, at 11:44, Matthew Dowle <mdowle at mdowle.plus.com> 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Ah yes, haven't hooked up the sep override yet, apologies, will 
>>>>>>> fix.
>>>>>>> Maybe setting autostart to the row number of the header row 
>>>>>>> (probably 1)
>>>>>>> might work.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>>>>>>>> oups…forgot to add the output from the verbose part…here it 
>>>>>>>> is...
>>>>>>>>
>>>>>>>> Detected eol as \r\n (CRLF) in that order, the Windows 
>>>>>>>> standard.
>>>>>>>> Starting format detection on line 30 (the last non blank line 
>>>>>>>> in the
>>>>>>>> first 30)
>>>>>>>> Detected sep as '/' and 3 columns
>>>>>>>> Type codes: 003
>>>>>>>> Found first row with 3 fields occuring on line 1 (either 
>>>>>>>> column names
>>>>>>>> or first row of data)
>>>>>>>> The first data row has some non character fields. Treating as 
>>>>>>>> a data
>>>>>>>> row and using default column names.
>>>>>>>> Count of eol after pos: 1143699
>>>>>>>> Subtracted 1 for last eol and any trailing empty lines, 
>>>>>>>> leaving
>>>>>>>> 1143698 data rows
>>>>>>>> 0.153s ( 21%) Memory map (quicker if you rerun)
>>>>>>>> 0.000s (  0%) Format detection
>>>>>>>> 0.095s ( 13%) Count rows (wc -l)
>>>>>>>> 0.001s (  0%) Allocation of 1143698x3 result (xMB) in RAM
>>>>>>>> 0.480s ( 66%) Reading data
>>>>>>>> 0.000s (  0%) Bumping column type midread and coercing data 
>>>>>>>> already read
>>>>>>>> 0.002s (  0%) Changing na.strings to NA
>>>>>>>> 0.731s        Total
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda 
>>>>>>>> <hideyoshi.maeda at gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Matthew,
>>>>>>>>>
>>>>>>>>> I am using the new `data.table` `fread()` function to read my 
>>>>>>>>> csv files, which has the format as follows when using the 
>>>>>>>>> read.csv function
>>>>>>>>>
>>>>>>>>>         Date.and.Time Open High  Low Close Volume
>>>>>>>>> 1 2007/01/01 22:51:00 5683 5683 5673  5673     64
>>>>>>>>> 2 2007/01/01 22:52:00 5675 5676 5674  5674     17
>>>>>>>>> 3 2007/01/01 22:53:00 5674 5674 5673  5674     42
>>>>>>>>>
>>>>>>>>> The value of the first column is all of: `2007/01/01 
>>>>>>>>> 22:53:00`, the next 5 columns are separated with commas.
>>>>>>>>>
>>>>>>>>> but when reading the same file using fread i get the 
>>>>>>>>> following output
>>>>>>>>>
>>>>>>>>>     V1 V2                                             V3
>>>>>>>>> 1 2007  1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>>>>>>>> 2 2007  1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>>>>>>>> 3 2007  1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>>>>>>>
>>>>>>>>> This is because the autodetect is using the "/" as a 
>>>>>>>>> separator...
>>>>>>>>>
>>>>>>>>> I tried overriding this using the `sep=","` argument but this 
>>>>>>>>> does not seem to be used in the function anywhere.
>>>>>>>>>
>>>>>>>>> Furthremore when using verbose I get the following output, 
>>>>>>>>> which suggests that I was right in thinking that "/" is used as 
>>>>>>>>> a separator rather than ",".
>>>>>>>>>
>>>>>>>>> Is there any way to fix this, so that it correctly reads all 
>>>>>>>>> 6 columns separately?
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> HLM
>>>>>>>>>
>>>>>>>>> On 21 Dec 2012, at 18:28, Matthew Dowle 
>>>>>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi datatablers,
>>>>>>>>>>
>>>>>>>>>> Feedback and bug reports much appreciated :
>>>>>>>>>>
>>>>>>>>>> =====
>>>>>>>>>> New function fread(), a fast and friendly file reader.
>>>>>>>>>> * header, skip, nrows, sep and colClasses are all auto 
>>>>>>>>>> detected.
>>>>>>>>>> * integers>2^31 are detected and read natively as 
>>>>>>>>>> bit64::integer64.
>>>>>>>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>>>>>>>> * new implementation entirely in C
>>>>>>>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>>>>>>>> read.csv("test.csv")                                   # 
>>>>>>>>>> 30-60 sec
>>>>>>>>>> read.table("test.csv",<all known tricks, known nrows>) #    
>>>>>>>>>> 10 sec
>>>>>>>>>> fread("test.csv")                                      #     
>>>>>>>>>> 3 sec
>>>>>>>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>>>>>>>> read.table("2008.csv",<all known tricks, known nrows>) #   
>>>>>>>>>> 360 sec
>>>>>>>>>> fread("2008.csv")                                      #    
>>>>>>>>>> 50 sec
>>>>>>>>>> See ?fread. Many thanks to Chris Neff and Garrett See for 
>>>>>>>>>> ideas,
>>>>>>>>>> discussions and beta testing.
>>>>>>>>>> =====
>>>>>>>>>>
>>>>>>>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac 
>>>>>>>>>> yet) :
>>>>>>>>>>
>>>>>>>>>> install.packages("data.table", 
>>>>>>>>>> repos="http://R-Forge.R-project.org")
>>>>>>>>>> require(data.table)
>>>>>>>>>> ?fread
>>>>>>>>>> fread("your biggest baddest file")
>>>>>>>>>>
>>>>>>>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 
>>>>>>>>>> optimization rather
>>>>>>>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as 
>>>>>>>>>> great on Win64
>>>>>>>>>> until that can be resolved on R-Forge, unless you compile 
>>>>>>>>>> yourself. -O3
>>>>>>>>>> has some optimizations that fread may benefit from. But 
>>>>>>>>>> interested to hear.
>>>>>>>>>>
>>>>>>>>>> Seasons greatings!
>>>>>>>>>>
>>>>>>>>>> Matthew
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> datatable-help mailing list
>>>>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>>>> 
>>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>>>>>
>>>>>>>
>>>>>



More information about the datatable-help mailing list