[datatable-help] New function fread() in v1.8.7
Hideyoshi Maeda
hideyoshi.maeda at gmail.com
Mon Dec 24 12:08:33 CET 2012
oups…forgot to add the output from the verbose part…here it is...
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Starting format detection on line 30 (the last non blank line in the first 30)
Detected sep as '/' and 3 columns
Type codes: 003
Found first row with 3 fields occuring on line 1 (either column names or first row of data)
The first data row has some non character fields. Treating as a data row and using default column names.
Count of eol after pos: 1143699
Subtracted 1 for last eol and any trailing empty lines, leaving 1143698 data rows
0.153s ( 21%) Memory map (quicker if you rerun)
0.000s ( 0%) Format detection
0.095s ( 13%) Count rows (wc -l)
0.001s ( 0%) Allocation of 1143698x3 result (xMB) in RAM
0.480s ( 66%) Reading data
0.000s ( 0%) Bumping column type midread and coercing data already read
0.002s ( 0%) Changing na.strings to NA
0.731s Total
On 24 Dec 2012, at 11:04, Hideyoshi Maeda <hideyoshi.maeda at gmail.com> wrote:
> Hi Matthew,
>
> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>
> Date.and.Time Open High Low Close Volume
> 1 2007/01/01 22:51:00 5683 5683 5673 5673 64
> 2 2007/01/01 22:52:00 5675 5676 5674 5674 17
> 3 2007/01/01 22:53:00 5674 5674 5673 5674 42
>
> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>
> but when reading the same file using fread i get the following output
>
> V1 V2 V3
> 1 2007 1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
> 2 2007 1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
> 3 2007 1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>
> This is because the autodetect is using the "/" as a separator...
>
> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>
> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>
> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>
> Thanks
>
> HLM
>
> On 21 Dec 2012, at 18:28, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
>>
>> Hi datatablers,
>>
>> Feedback and bug reports much appreciated :
>>
>> =====
>> New function fread(), a fast and friendly file reader.
>> * header, skip, nrows, sep and colClasses are all auto detected.
>> * integers>2^31 are detected and read natively as bit64::integer64.
>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>> * new implementation entirely in C
>> * with a 50MB .csv, 1 million rows x 6 columns :
>> read.csv("test.csv") # 30-60 sec
>> read.table("test.csv",<all known tricks, known nrows>) # 10 sec
>> fread("test.csv") # 3 sec
>> * airline data: 658MB csv (7 million rows x 29 columns)
>> read.table("2008.csv",<all known tricks, known nrows>) # 360 sec
>> fread("2008.csv") # 50 sec
>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>> discussions and beta testing.
>> =====
>>
>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>
>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>> require(data.table)
>> ?fread
>> fread("your biggest baddest file")
>>
>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>> has some optimizations that fread may benefit from. But interested to hear.
>>
>> Seasons greatings!
>>
>> Matthew
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
More information about the datatable-help
mailing list