[datatable-help] New function fread() in v1.8.7
Hideyoshi Maeda
hideyoshi.maeda at gmail.com
Mon Dec 24 12:52:18 CET 2012
Thanks for the quick response.
I wasn't sure if I understood you correctly, but isn't the problem the way that autostart finds separators?
and in my example, it had headers, so I think it would need to start from row 2 wouldn't it, i.e. the first row that has non-header values?
Thanks
On 24 Dec 2012, at 11:44, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Hi,
>
> Ah yes, haven't hooked up the sep override yet, apologies, will fix.
> Maybe setting autostart to the row number of the header row (probably 1)
> might work.
>
> Thanks,
> Matthew
>
>
> On 24.12.2012 11:08, Hideyoshi Maeda wrote:
>> oups…forgot to add the output from the verbose part…here it is...
>>
>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>> Starting format detection on line 30 (the last non blank line in the
>> first 30)
>> Detected sep as '/' and 3 columns
>> Type codes: 003
>> Found first row with 3 fields occuring on line 1 (either column names
>> or first row of data)
>> The first data row has some non character fields. Treating as a data
>> row and using default column names.
>> Count of eol after pos: 1143699
>> Subtracted 1 for last eol and any trailing empty lines, leaving
>> 1143698 data rows
>> 0.153s ( 21%) Memory map (quicker if you rerun)
>> 0.000s ( 0%) Format detection
>> 0.095s ( 13%) Count rows (wc -l)
>> 0.001s ( 0%) Allocation of 1143698x3 result (xMB) in RAM
>> 0.480s ( 66%) Reading data
>> 0.000s ( 0%) Bumping column type midread and coercing data already read
>> 0.002s ( 0%) Changing na.strings to NA
>> 0.731s Total
>>
>>
>> On 24 Dec 2012, at 11:04, Hideyoshi Maeda <hideyoshi.maeda at gmail.com> wrote:
>>
>>> Hi Matthew,
>>>
>>> I am using the new `data.table` `fread()` function to read my csv files, which has the format as follows when using the read.csv function
>>>
>>> Date.and.Time Open High Low Close Volume
>>> 1 2007/01/01 22:51:00 5683 5683 5673 5673 64
>>> 2 2007/01/01 22:52:00 5675 5676 5674 5674 17
>>> 3 2007/01/01 22:53:00 5674 5674 5673 5674 42
>>>
>>> The value of the first column is all of: `2007/01/01 22:53:00`, the next 5 columns are separated with commas.
>>>
>>> but when reading the same file using fread i get the following output
>>>
>>> V1 V2 V3
>>> 1 2007 1 01 22:51:00,5683.00,5683.00,5673.00,5673.00,64
>>> 2 2007 1 01 22:52:00,5675.00,5676.00,5674.00,5674.00,17
>>> 3 2007 1 01 22:53:00,5674.00,5674.00,5673.00,5674.00,42
>>>
>>> This is because the autodetect is using the "/" as a separator...
>>>
>>> I tried overriding this using the `sep=","` argument but this does not seem to be used in the function anywhere.
>>>
>>> Furthremore when using verbose I get the following output, which suggests that I was right in thinking that "/" is used as a separator rather than ",".
>>>
>>> Is there any way to fix this, so that it correctly reads all 6 columns separately?
>>>
>>> Thanks
>>>
>>> HLM
>>>
>>> On 21 Dec 2012, at 18:28, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>>>
>>>>
>>>> Hi datatablers,
>>>>
>>>> Feedback and bug reports much appreciated :
>>>>
>>>> =====
>>>> New function fread(), a fast and friendly file reader.
>>>> * header, skip, nrows, sep and colClasses are all auto detected.
>>>> * integers>2^31 are detected and read natively as bit64::integer64.
>>>> * accepts filenames, URLs and "A,B\n1,2\n3,4" directly
>>>> * new implementation entirely in C
>>>> * with a 50MB .csv, 1 million rows x 6 columns :
>>>> read.csv("test.csv") # 30-60 sec
>>>> read.table("test.csv",<all known tricks, known nrows>) # 10 sec
>>>> fread("test.csv") # 3 sec
>>>> * airline data: 658MB csv (7 million rows x 29 columns)
>>>> read.table("2008.csv",<all known tricks, known nrows>) # 360 sec
>>>> fread("2008.csv") # 50 sec
>>>> See ?fread. Many thanks to Chris Neff and Garrett See for ideas,
>>>> discussions and beta testing.
>>>> =====
>>>>
>>>> 1.8.7 is passing checks on Unix and Windows (but not Mac yet) :
>>>>
>>>> install.packages("data.table", repos="http://R-Forge.R-project.org")
>>>> require(data.table)
>>>> ?fread
>>>> fread("your biggest baddest file")
>>>>
>>>> Oddly, R-Forge appears to be compiling Win64 with -O2 optimization rather
>>>> than -O3 (but -O3 on Win32 ok), so speedups might not be as great on Win64
>>>> until that can be resolved on R-Forge, unless you compile yourself. -O3
>>>> has some optimizations that fread may benefit from. But interested to hear.
>>>>
>>>> Seasons greatings!
>>>>
>>>> Matthew
>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>
More information about the datatable-help
mailing list