[datatable-help] colClasses and fread
Matthew Dowle
mdowle at mdowle.plus.com
Fri Sep 13 00:52:40 CEST 2013
But I think in the diagnostics you sent, the final result was still
correct. The initial guess may have been poor, but it bumped the
columns mid read and worked it out. Why do you need to set colClasses?
What was wrong in the final result?
(BTW, this thread was failing the mailman size filter (100k message
size). I let them through and chopped the history on this one for that
reason. )
On 12/09/13 23:42, Matthew Dowle wrote:
>
> Is that v1.8.10 as on CRAN? It doesn't look like it from a few clues
> in the output below.
> v1.8.10 has colClasses working, see NEWS.
>
> On 12/09/13 22:32, Ari Friedman wrote:
>> Dear maintainers of that most wonderful package that makes R fast with
>> big data,
>>
>> I've recently discovered fread. It's amazing. My call to read.fwf on a
>> 4GB file that took all night now takes under a minute after conversion
>> to csv via csvkit/in2csv.
>>
>> However, automatic type detection is working very poorly, probably due
>> to the presence of a large number of columns with high rates of
>> missingness, plus a large number of character columns with encoded
>> values (these are medical and diagnostic codes).
>>
>> Normally I'd specify colClasses, and the warning messages even tell me I
>> should specify colClasses, but there's no colClasses argument to fread.
>>
>> Any thoughts on solving this? Verbose output, warnings, and a
>> comparison of the guesses vs. what the documentation on the file says it
>> is are found below. Unfortunately the data can't be shared, even in
>> small portions so I can't make this reproducible.
>>
>> Thanks!
>> Ari
>> > dt <- fread('myfile.csv', verbose=TRUE)
>> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
>> Using line 30 to detect sep (the last non blank line in the first 30) ... ','
>> Found 393 columns
>> First row with 393 fields occurs on line 1 (either column names or first row of data)
>> All the fields on line 1 are character fields. Treating as the column names.
>> Count of eol after first data row: 2994440
>> Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows
>> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows)
>> Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows)
>> Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows)
>> 0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867'
>> Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867'
>> Bumping column 146 from REAL to STR on data row 9, field contains 'V5867'
>> Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869'
>> Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869'
>> Bumping column 147 from REAL to STR on data row 9, field contains 'V5869'
>> Bumping column 142 from INT to INT64 on data row 10, field contains 'V140'
>> Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140'
>> Bumping column 142 from REAL to STR on data row 10, field contains 'V140'
>> Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885'
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130912/fe885e56/attachment.html>
More information about the datatable-help
mailing list