[datatable-help] fread for flat files

Michael Smith my.r.help at gmail.com
Mon Apr 21 02:16:06 CEST 2014


Take a look here:

http://stackoverflow.com/questions/8630053/unix-cut-command-adding-own-delimiter

I think the `sed` command at the bottom should work easily, but you
would need to adjust the number of dots to your column width.

Once you've figured out the correct command, you can then use it like this:

DT <- fread(pipe("sed ..... <flat-file.txt"))

On the other hand, if you can get it into SAS easily, then you can just
convert it from there using Stat/Transfer. But I do understand that
you're looking for a data.table-centric solution, so maybe the above
attempt will help.

Cheers,

M



On 04/20/2014 11:48 PM, Mark Danese wrote:
> There is no column separator for what I am referring to as "flat files"
> which is the challenge.  My apologies if my terminology is off. By “flat
> file” I mean that every field has a constant width for every row.  So, a
> file with 3 variables might have an patient id character field that is 8
> wide, followed by an age in a numeric field 3 wide, followed by a gender
> as an integer that is 1 wide.  It would look like this.  Sometimes there
> are spaces if fields are empty or if the variables are smaller than the
> space allotted.  
> pat000010381
> pat000020292
> pat000030571 
> 
> which needs to be separated into
> 
> Patid		age	gender
> pat00001	038	1
> pat00002	029	2
> pat00003	057	1 
> 
> 
> We have Stat/Transfer and it doesn’t do flat files as far as I can tell.
> As I said, we can do it quickly and easily in SAS.  Most datasets ship
> with a SAS program for conversion and Adam Damico has written the R
> package SAScii to parse SAS load files into an R script that goes to
> read.fwf().  But I am trying to see if it is possible to allow fread to
> take a vector to split the file.  For the above it would be something like
> passing sep=(8,3,1).
> 
> All Medicare claims data and most health related national surveys come
> this way.  For an example, the bottom of this file shows how such a file
> is loaded
> 
> https://www.hcup-us.ahrq.gov/db/nation/nis/tools/pgms/SASLoad_NIS_2011_Core
> .SAS
> 
> This is the first 20% of the list:
> *** Read data elements from the ASCII file ***;
> INPUT 
>       @1      AGE                      N3PF.
>       @4      AGEDAY                   N3PF.
>       @7      AMONTH                   N2PF.
>       @9      ASOURCE                  N2PF.
>       @11     ASOURCEUB92              $CHAR1.
>       @12     ASOURCE_X                $CHAR3.
>       @15     ATYPE                    N2PF.
>       @17     AWEEKEND                 N2PF.
>       @19     DIED                     N2PF.
>       @21     DISCWT                   N11P7F.
> 
> 
> 
> 
> On 4/20/14, 7:36 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
> 
>> Not sure exactly what you mean by "flat file." I previously assumed you
>> mean fixed width formatted data, but now you say they are concatenated
>> and there are no spaces. So what's the column separator? Tab, comma, ...?
>>
>> If everything else fails, try Stat/Transfer.
>>
>> M
>>
>> On 04/20/2014 03:04 PM, Mark Danese wrote:
>>> Thanks Michael.  The flat file format doesn’t have spaces between
>>> fields.
>>> They are all concatenated.  It may be possible to use sed with a vector
>>> of
>>> widths, but I am not a command-line person (yet).
>>>
>>> It just may be one of those things that isn’t easy to implement in
>>> fread.
>>> In healthcare in the US there are still a lot of flat files out there.
>>> We
>>> usually use SAS but I am trying to get away from that.  And R can read
>>> flat files(read.fwf), but it is pretty slow.  From what I understand,
>>> read.fwf actually does insert commas and then reads the file.  So, it
>>> might be possible to hack read.fwf and fread together somehow.
>>>
>>> My first experience with fread was to read in a 1.6 GB file in 30
>>> seconds.
>>>  That was pretty impressive.
>>>
>>>
>>> On 4/19/14, 5:07 AM, "Michael Smith" <my.r.help at gmail.com> wrote:
>>>
>>>> Probably you could do this from the Linux command line using `sed`,
>>>> i.e.
>>>> to replace several spaces with a comma.
>>>>
>>>> https://www.google.com/search?q=sed+replace+space+with+comma
>>>>
>>>> If you're on Windows, you probably can do the same using Cygwin.
>>>>
>>>> M
>>>>
>>>>
>>>> On 04/19/2014 12:37 AM, Mark Danese wrote:
>>>>> Is it possible to pass a vector of column widths to have fread read
>>>>> in a
>>>>> flat file?  I saw that someone suggested using csvkit to add commas
>>>>> and
>>>>> then use data table, but that is beyond my skill set.
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
>>>>> he
>>>>> lp
>>>>>
>>>
> 


More information about the datatable-help mailing list