<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix"><br>
But I think in the diagnostics you sent, the final result was
still correct. The initial guess may have been poor, but it
bumped the columns mid read and worked it out. Why do you need to
set colClasses? What was wrong in the final result?<br>
<br>
(BTW, this thread was failing the mailman size filter (100k
message size). I let them through and chopped the history on this
one for that reason. )<br>
<br>
<br>
On 12/09/13 23:42, Matthew Dowle wrote:<br>
</div>
<blockquote cite="mid:5232434E.3050608@mdowle.plus.com" type="cite">
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
<div class="moz-cite-prefix"><br>
Is that v1.8.10 as on CRAN? It doesn't look like it from a few
clues in the output below.<br>
v1.8.10 has colClasses working, see NEWS.<br>
<br>
On 12/09/13 22:32, Ari Friedman wrote:<br>
</div>
<blockquote
cite="mid:CAAT1DuNBmEvJJOx71G2XB9Xz5zF+xTXpf256B96s=M3kq7u+gw@mail.gmail.com"
type="cite">
<div dir="ltr">
<div lang="x-western">
<pre style="font-family:-moz-fixed;font-size:12px">Dear maintainers of that most wonderful package that makes R fast with
big data,
I've recently discovered fread. It's amazing. My call to read.fwf on a
4GB file that took all night now takes under a minute after conversion
to csv via csvkit/in2csv.
However, automatic type detection is working very poorly, probably due
to the presence of a large number of columns with high rates of
missingness, plus a large number of character columns with encoded
values (these are medical and diagnostic codes).
Normally I'd specify colClasses, and the warning messages even tell me I
should specify colClasses, but there's no colClasses argument to fread.
Any thoughts on solving this? Verbose output, warnings, and a
comparison of the guesses vs. what the documentation on the file says it
is are found below. Unfortunately the data can't be shared, even in
small portions so I can't make this reproducible.
Thanks!
Ari
</pre>
<pre>> dt <- fread('myfile.csv', verbose=TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 30) ... ','
Found 393 columns
First row with 393 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 2994440
Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows
Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows)
Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows)
Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows)
0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867'
Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867'
Bumping column 146 from REAL to STR on data row 9, field contains 'V5867'
Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869'
Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869'
Bumping column 147 from REAL to STR on data row 9, field contains 'V5869'
Bumping column 142 from INT to INT64 on data row 10, field contains 'V140'
Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140'
Bumping column 142 from REAL to STR on data row 10, field contains 'V140'
Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885'
</pre>
</div>
</div>
</blockquote>
</blockquote>
<br>
</body>
</html>