[datatable-help] fread for flat files

Michael Smith my.r.help at gmail.com
Mon Apr 21 03:13:46 CEST 2014


If you want a completely R-centric version, do it as follows. First,
create the following file, save it as `flat2csv.R` and make it
executable (chmod u+x flat2csv.R).

#! /usr/bin/env Rscript
col.start <- c(1, 9, 12)
col.end <- c(8, 11, 13)
con  <- file("stdin", open = "r")
while (length(this.line <- readLines(con, n = 1, warn = FALSE)) > 0)
  writeLines(
    paste0(
      substring(
        this.line,
        col.start,
        col.end),
      collapse = ","))
close(con)


Then open R and run the following code to read your flat file:

library("data.table")
fread("./flat2csv.R <my-flat-file.txt")

That works on my Linux box with the example you gave in your earlier
email (saved as my-flat-file.txt).

pat000010381
pat000020292
pat000030571

Hope that helps.

Cheers,

M


On 04/21/2014 08:16 AM, Michael Smith wrote:
> Take a look here:
> 
> http://stackoverflow.com/questions/8630053/unix-cut-command-adding-own-delimiter
> 
> I think the `sed` command at the bottom should work easily, but you
> would need to adjust the number of dots to your column width.
> 
> Once you've figured out the correct command, you can then use it like this:
> 
> DT <- fread(pipe("sed ..... <flat-file.txt"))
> 
> On the other hand, if you can get it into SAS easily, then you can just
> convert it from there using Stat/Transfer. But I do understand that
> you're looking for a data.table-centric solution, so maybe the above
> attempt will help.
> 
> Cheers,
> 
> M
> 
> 
> 
> On 04/20/2014 11:48 PM, Mark Danese wrote:
>> There is no column separator for what I am referring to as "flat files"
>> which is the challenge.  My apologies if my terminology is off. By “flat
>> file” I mean that every field has a constant width for every row.  So, a
>> file with 3 variables might have an patient id character field that is 8
>> wide, followed by an age in a numeric field 3 wide, followed by a gender
>> as an integer that is 1 wide.  It would look like this.  Sometimes there
>> are spaces if fields are empty or if the variables are smaller than the
>> space allotted.  
>> pat000010381
>> pat000020292
>> pat000030571 
>>
>> which needs to be separated into
>>
>> Patid		age	gender
>> pat00001	038	1
>> pat00002	029	2
>> pat00003	057	1 
>>
>>
>> We have Stat/Transfer and it doesn’t do flat files as far as I can tell.
>> As I said, we can do it quickly and easily in SAS.  Most datasets ship
>> with a SAS program for conversion and Adam Damico has written the R
>> package SAScii to parse SAS load files into an R script that goes to
>> read.fwf().  But I am trying to see if it is possible to allow fread to
>> take a vector to split the file.  For the above it would be something like
>> passing sep=(8,3,1).
>>
>> All Medicare claims data and most health related national surveys come
>> this way.  For an example, the bottom of this file shows how such a file
>> is loaded
>>
>> https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core
>> .SAS
>>
>> This is the first 20% of the list:
>> *** Read data elements from the ASCII file ***;
>> INPUT 
>>       @1      AGE                      N3PF.
>>       @4      AGEDAY                   N3PF.
>>       @7      AMONTH                   N2PF.
>>       @9      ASOURCE                  N2PF.
>>       @11     ASOURCEUB92              $CHAR1.
>>       @12     ASOURCE_X                $CHAR3.
>>       @15     ATYPE                    N2PF.
>>       @17     AWEEKEND                 N2PF.
>>       @19     DIED                     N2PF.
>>       @21     DISCWT                   N11P7F.
>>
>>
>>
>>
>> On 4/20/14, 7:36 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>>
>>> Not sure exactly what you mean by "flat file." I previously assumed you
>>> mean fixed width formatted data, but now you say they are concatenated
>>> and there are no spaces. So what's the column separator? Tab, comma, ...?
>>>
>>> If everything else fails, try Stat/Transfer.
>>>
>>> M
>>>
>>> On 04/20/2014 03:04 PM, Mark Danese wrote:
>>>> Thanks Michael.  The flat file format doesn’t have spaces between
>>>> fields.
>>>> They are all concatenated.  It may be possible to use sed with a vector
>>>> of
>>>> widths, but I am not a command-line person (yet).
>>>>
>>>> It just may be one of those things that isn’t easy to implement in
>>>> fread.
>>>> In healthcare in the US there are still a lot of flat files out there.
>>>> We
>>>> usually use SAS but I am trying to get away from that.  And R can read
>>>> flat files(read.fwf), but it is pretty slow.  From what I understand,
>>>> read.fwf actually does insert commas and then reads the file.  So, it
>>>> might be possible to hack read.fwf and fread together somehow.
>>>>
>>>> My first experience with fread was to read in a 1.6 GB file in 30
>>>> seconds.
>>>>  That was pretty impressive.
>>>>
>>>>
>>>> On 4/19/14, 5:07 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>>>>
>>>>> Probably you could do this from the Linux command line using `sed`,
>>>>> i.e.
>>>>> to replace several spaces with a comma.
>>>>>
>>>>> https://www.google.com/search?q=sed+replace+space+with+comma
>>>>>
>>>>> If you're on Windows, you probably can do the same using Cygwin.
>>>>>
>>>>> M
>>>>>
>>>>>
>>>>> On 04/19/2014 12:37 AM, Mark Danese wrote:
>>>>>> Is it possible to pass a vector of column widths to have fread read
>>>>>> in a
>>>>>> flat file?  I saw that someone suggested using csvkit to add commas
>>>>>> and
>>>>>> then use data table, but that is beyond my skill set.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>
>>>>>>
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
>>>>>> he
>>>>>> lp
>>>>>>
>>>>
>>


More information about the datatable-help mailing list