[datatable-help] datatable-help Digest, Vol 41, Issue 3

Paul Harding p.harding at paniscus.com
Wed Jul 3 12:56:09 CEST 2013


For me, in a similar context, this would be particularly useful with SQL
Server output, where if you need head headers it's not possible to lose the
second line of underlining:

header1 header2 header3
------- ------- -------
tom   dick   harry

and possibly for other flavours of SQL too. For the huge files (20GB) I use
fread for I use a perl script, for smaller ones
  df <- read.csv(con, header=F, skip=2, na.strings="NULL")
  names(df)<-do.call(rbind,(strsplit(readLines(con,1),",")))[1,]

Such a pain. So as this is an SQL server 'feature' it would be really
useful if fread could discard unwanted lines of header. Perhaps a regexp
parameter?

Regards
Paul




On 3 July 2013 11:00, <datatable-help-request at lists.r-forge.r-project.org>wrote:

> Send datatable-help mailing list submissions to
>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org
>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. Re: fread -- multiple header lines and multiple whitespace
>       characters (Eduard Antonyan)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 2 Jul 2013 10:29:57 -0500
> From: Eduard Antonyan <eduard.antonyan at gmail.com>
> To: Harish <harishv_99 at yahoo.com>
> Cc: "datatable-help at lists.r-forge.r-project.org"
>         <datatable-help at lists.r-forge.r-project.org>
> Subject: Re: [datatable-help] fread -- multiple header lines and
>         multiple whitespace characters
> Message-ID:
>         <CAHZcBOpkh+05wNLYD17YQxXx+JbOL3SmkwoP+Y=
> dWZ5hNEKzog at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I don't know how to do this with fread, but it sounds like a good feature
> request.
>
> If you want to do this in R (without fread), you could use readLines to
> read until you get to the header, count the number of lines it took and use
> 'skip' param in read.table to read the file in. I think I remember seeing
> smth like that done on SO at some point, but you can always post there to
> get more advice as there is generally more people who'll be able to help
> you there.
>
>
> On Sun, Jun 30, 2013 at 3:21 AM, Harish <harishv_99 at yahoo.com> wrote:
>
> > Hi,
> >
> > I am wondering whether it is possible to read a file using fread() with:
> > 1) Multiple header lines, and
> > 2) Multiple whitespace characters separating fields
> >
> > The sample of the input file is as follows:
> > -------------
> > Garbage header information
> > that I need to skip when reading...
> > Number of lines here are variable.
> >
> >              Serial_Number   PHIv     Lu/W
> >                     (-)      (lm)     (lm/W)
> >            ABCDEFG  27.0264 103.58
> >            HIJKLMNO  33.9143  91.03
> >
> > Some footer information
> > that spans multiple lines
> > -------------
> >
> > To handle the multiple lines of headers, I would have to read the file
> > using fread() first, reprocess the file using a similar algorithm to
> > identify the actual header -- i.e. one line above what fread() would
> > identify as the header, then throw away the names of the columns fread()
> > created and rename it to the actual ones I find.  However, this seems to
> be
> > highly inefficient since I would replicate what fread() did within R --
> not
> > to mention I do not quite know how to do that.
> >
> > As far as handling the multiple (and variable) spaces for separator, I do
> > not see fread() being able to handle this either.  read.table() however
> > does with the default sep="" value.  Of course, that does not handle the
> > garbage headers and footers that fread() so beautifully avoids with its
> > autostart algorithm.
> >
> > Any suggestions as to how I would do this easily?  I have lots of these
> > files to read, and doing manual editing is not desirable.  If there is a
> > hack I can do with fread(), that would be ideal.
> >
> > Thanks a lot for your help.
> >
> >
> > Regards,
> > Harish
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130702/8fb5e48d/attachment-0001.html
> >
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 41, Issue 3
> *********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130703/4190ec71/attachment.html>


More information about the datatable-help mailing list