[datatable-help] fread for flat files
Michael Smith
my.r.help at gmail.com
Mon Apr 21 03:13:46 CEST 2014
If you want a completely R-centric version, do it as follows. First,
create the following file, save it as `flat2csv.R` and make it
executable (chmod u+x flat2csv.R).
#! /usr/bin/env Rscript
col.start <- c(1, 9, 12)
col.end <- c(8, 11, 13)
con <- file("stdin", open = "r")
while (length(this.line <- readLines(con, n = 1, warn = FALSE)) > 0)
writeLines(
paste0(
substring(
this.line,
col.start,
col.end),
collapse = ","))
close(con)
Then open R and run the following code to read your flat file:
library("data.table")
fread("./flat2csv.R <my-flat-file.txt")
That works on my Linux box with the example you gave in your earlier
email (saved as my-flat-file.txt).
pat000010381
pat000020292
pat000030571
Hope that helps.
Cheers,
M
On 04/21/2014 08:16 AM, Michael Smith wrote:
> Take a look here:
>
> http://stackoverflow.com/questions/8630053/unix-cut-command-adding-own-delimiter
>
> I think the `sed` command at the bottom should work easily, but you
> would need to adjust the number of dots to your column width.
>
> Once you've figured out the correct command, you can then use it like this:
>
> DT <- fread(pipe("sed ..... <flat-file.txt"))
>
> On the other hand, if you can get it into SAS easily, then you can just
> convert it from there using Stat/Transfer. But I do understand that
> you're looking for a data.table-centric solution, so maybe the above
> attempt will help.
>
> Cheers,
>
> M
>
>
>
> On 04/20/2014 11:48 PM, Mark Danese wrote:
>> There is no column separator for what I am referring to as "flat files"
>> which is the challenge. My apologies if my terminology is off. By “flat
>> file” I mean that every field has a constant width for every row. So, a
>> file with 3 variables might have an patient id character field that is 8
>> wide, followed by an age in a numeric field 3 wide, followed by a gender
>> as an integer that is 1 wide. It would look like this. Sometimes there
>> are spaces if fields are empty or if the variables are smaller than the
>> space allotted.
>> pat000010381
>> pat000020292
>> pat000030571
>>
>> which needs to be separated into
>>
>> Patid age gender
>> pat00001 038 1
>> pat00002 029 2
>> pat00003 057 1
>>
>>
>> We have Stat/Transfer and it doesn’t do flat files as far as I can tell.
>> As I said, we can do it quickly and easily in SAS. Most datasets ship
>> with a SAS program for conversion and Adam Damico has written the R
>> package SAScii to parse SAS load files into an R script that goes to
>> read.fwf(). But I am trying to see if it is possible to allow fread to
>> take a vector to split the file. For the above it would be something like
>> passing sep=(8,3,1).
>>
>> All Medicare claims data and most health related national surveys come
>> this way. For an example, the bottom of this file shows how such a file
>> is loaded
>>
>> https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core
>> .SAS
>>
>> This is the first 20% of the list:
>> *** Read data elements from the ASCII file ***;
>> INPUT
>> @1 AGE N3PF.
>> @4 AGEDAY N3PF.
>> @7 AMONTH N2PF.
>> @9 ASOURCE N2PF.
>> @11 ASOURCEUB92 $CHAR1.
>> @12 ASOURCE_X $CHAR3.
>> @15 ATYPE N2PF.
>> @17 AWEEKEND N2PF.
>> @19 DIED N2PF.
>> @21 DISCWT N11P7F.
>>
>>
>>
>>
>> On 4/20/14, 7:36 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>>
>>> Not sure exactly what you mean by "flat file." I previously assumed you
>>> mean fixed width formatted data, but now you say they are concatenated
>>> and there are no spaces. So what's the column separator? Tab, comma, ...?
>>>
>>> If everything else fails, try Stat/Transfer.
>>>
>>> M
>>>
>>> On 04/20/2014 03:04 PM, Mark Danese wrote:
>>>> Thanks Michael. The flat file format doesn’t have spaces between
>>>> fields.
>>>> They are all concatenated. It may be possible to use sed with a vector
>>>> of
>>>> widths, but I am not a command-line person (yet).
>>>>
>>>> It just may be one of those things that isn’t easy to implement in
>>>> fread.
>>>> In healthcare in the US there are still a lot of flat files out there.
>>>> We
>>>> usually use SAS but I am trying to get away from that. And R can read
>>>> flat files(read.fwf), but it is pretty slow. From what I understand,
>>>> read.fwf actually does insert commas and then reads the file. So, it
>>>> might be possible to hack read.fwf and fread together somehow.
>>>>
>>>> My first experience with fread was to read in a 1.6 GB file in 30
>>>> seconds.
>>>> That was pretty impressive.
>>>>
>>>>
>>>> On 4/19/14, 5:07 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>>>>
>>>>> Probably you could do this from the Linux command line using `sed`,
>>>>> i.e.
>>>>> to replace several spaces with a comma.
>>>>>
>>>>> https://www.google.com/search?q=sed+replace+space+with+comma
>>>>>
>>>>> If you're on Windows, you probably can do the same using Cygwin.
>>>>>
>>>>> M
>>>>>
>>>>>
>>>>> On 04/19/2014 12:37 AM, Mark Danese wrote:
>>>>>> Is it possible to pass a vector of column widths to have fread read
>>>>>> in a
>>>>>> flat file? I saw that someone suggested using csvkit to add commas
>>>>>> and
>>>>>> then use data table, but that is beyond my skill set.
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>
>>>>>>
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
>>>>>> he
>>>>>> lp
>>>>>>
>>>>
>>
More information about the datatable-help
mailing list