[datatable-help] fread for flat files
Mark Danese
mark at outins.com
Sun Apr 20 17:48:25 CEST 2014
There is no column separator for what I am referring to as "flat files"
which is the challenge. My apologies if my terminology is off. By “flat
file” I mean that every field has a constant width for every row. So, a
file with 3 variables might have an patient id character field that is 8
wide, followed by an age in a numeric field 3 wide, followed by a gender
as an integer that is 1 wide. It would look like this. Sometimes there
are spaces if fields are empty or if the variables are smaller than the
space allotted.
pat000010381
pat000020292
pat000030571
which needs to be separated into
Patid age gender
pat00001 038 1
pat00002 029 2
pat00003 057 1
We have Stat/Transfer and it doesn’t do flat files as far as I can tell.
As I said, we can do it quickly and easily in SAS. Most datasets ship
with a SAS program for conversion and Adam Damico has written the R
package SAScii to parse SAS load files into an R script that goes to
read.fwf(). But I am trying to see if it is possible to allow fread to
take a vector to split the file. For the above it would be something like
passing sep=(8,3,1).
All Medicare claims data and most health related national surveys come
this way. For an example, the bottom of this file shows how such a file
is loaded
https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core
.SAS
This is the first 20% of the list:
*** Read data elements from the ASCII file ***;
INPUT
@1 AGE N3PF.
@4 AGEDAY N3PF.
@7 AMONTH N2PF.
@9 ASOURCE N2PF.
@11 ASOURCEUB92 $CHAR1.
@12 ASOURCE_X $CHAR3.
@15 ATYPE N2PF.
@17 AWEEKEND N2PF.
@19 DIED N2PF.
@21 DISCWT N11P7F.
On 4/20/14, 7:36 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>Not sure exactly what you mean by "flat file." I previously assumed you
>mean fixed width formatted data, but now you say they are concatenated
>and there are no spaces. So what's the column separator? Tab, comma, ...?
>
>If everything else fails, try Stat/Transfer.
>
>M
>
>On 04/20/2014 03:04 PM, Mark Danese wrote:
>> Thanks Michael. The flat file format doesn’t have spaces between
>>fields.
>> They are all concatenated. It may be possible to use sed with a vector
>>of
>> widths, but I am not a command-line person (yet).
>>
>> It just may be one of those things that isn’t easy to implement in
>>fread.
>> In healthcare in the US there are still a lot of flat files out there.
>>We
>> usually use SAS but I am trying to get away from that. And R can read
>> flat files(read.fwf), but it is pretty slow. From what I understand,
>> read.fwf actually does insert commas and then reads the file. So, it
>> might be possible to hack read.fwf and fread together somehow.
>>
>> My first experience with fread was to read in a 1.6 GB file in 30
>>seconds.
>> That was pretty impressive.
>>
>>
>> On 4/19/14, 5:07 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>>
>>> Probably you could do this from the Linux command line using `sed`,
>>>i.e.
>>> to replace several spaces with a comma.
>>>
>>> https://www.google.com/search?q=sed+replace+space+with+comma
>>>
>>> If you're on Windows, you probably can do the same using Cygwin.
>>>
>>> M
>>>
>>>
>>> On 04/19/2014 12:37 AM, Mark Danese wrote:
>>>> Is it possible to pass a vector of column widths to have fread read
>>>>in a
>>>> flat file? I saw that someone suggested using csvkit to add commas
>>>>and
>>>> then use data table, but that is beyond my skill set.
>>>>
>>>>
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>>
>>>>https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
>>>>he
>>>> lp
>>>>
>>
More information about the datatable-help
mailing list