<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix"><br>

      But I think in the diagnostics you sent,  the final result was

      still correct.   The initial guess may have been poor, but it

      bumped the columns mid read and worked it out.  Why do you need to

      set colClasses?  What was wrong in the final result?<br>

      <br>

      (BTW, this thread was failing the mailman size filter (100k

      message size). I let them through and chopped the history on this

      one for that reason. )<br>

      <br>

      <br>

      On 12/09/13 23:42, Matthew Dowle wrote:<br>

    </div>

    <blockquote cite="mid:5232434E.3050608@mdowle.plus.com" type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      <div class="moz-cite-prefix"><br>

        Is that v1.8.10 as on CRAN?   It doesn't look like it from a few

        clues in the output below.<br>

        v1.8.10 has colClasses working, see NEWS.<br>

        <br>

        On 12/09/13 22:32, Ari Friedman wrote:<br>

      </div>

      <blockquote

cite="mid:CAAT1DuNBmEvJJOx71G2XB9Xz5zF+xTXpf256B96s=M3kq7u+gw@mail.gmail.com"

        type="cite">

        <div dir="ltr">

          <div lang="x-western">

            <pre style="font-family:-moz-fixed;font-size:12px">Dear maintainers of that most wonderful package that makes R fast with

big data,

I've recently discovered fread.  It's amazing.  My call to read.fwf on a

4GB file that took all night now takes under a minute after conversion

to csv via csvkit/in2csv.

However, automatic type detection is working very poorly, probably due

to the presence of a large number of columns with high rates of

missingness, plus a large number of character columns with encoded

values (these are medical and diagnostic codes).

Normally I'd specify colClasses, and the warning messages even tell me I

should specify colClasses, but there's no colClasses argument to fread.

Any thoughts on solving this?  Verbose output, warnings, and a

comparison of the guesses vs. what the documentation on the file says it

is are found below.  Unfortunately the data can't be shared, even in

small portions so I can't make this reproducible.

Thanks!

Ari

</pre>

            <pre>> dt <- fread('myfile.csv', verbose=TRUE)

Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.

Using line 30 to detect sep (the last non blank line in the first 30) ... ','

Found 393 columns

First row with 393 fields occurs on line 1 (either column names or first row of data)

All the fields on line 1 are character fields. Treating as the column names.

Count of eol after first data row: 2994440

Subtracted 1 for last eol and any trailing empty lines, leaving 2994439 data rows

Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (first 5 rows)

Type codes: 000000000000000000330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+middle 5 rows)

Type codes: 000000000000003303330330000000000000000000000000000000000000000000000000303000000000000000000000000000000000000000000000000000000000000000000003300000000000000000000000000000000000000000000000000000000000000000000030000000000000000000000000000003100300000000000000000000000020000000000000000000000000000000000000000000000000000000000030000300000002000000000000000000000000000000000000000000000 (+last 5 rows)

0%Bumping column 146 from INT to INT64 on data row 9, field contains 'V5867'

Bumping column 146 from INT64 to REAL on data row 9, field contains 'V5867'

Bumping column 146 from REAL to STR on data row 9, field contains 'V5867'

Bumping column 147 from INT to INT64 on data row 9, field contains 'V5869'

Bumping column 147 from INT64 to REAL on data row 9, field contains 'V5869'

Bumping column 147 from REAL to STR on data row 9, field contains 'V5869'

Bumping column 142 from INT to INT64 on data row 10, field contains 'V140'

Bumping column 142 from INT64 to REAL on data row 10, field contains 'V140'

Bumping column 142 from REAL to STR on data row 10, field contains 'V140'

Bumping column 17 from INT to INT64 on data row 12, field contains 'J1885'

</pre>

          </div>

        </div>

      </blockquote>

    </blockquote>

    <br>

  </body>

</html>