[datatable-help] fread: skip

Sun May 12 14:26:35 CEST 2013

1.8.8 is the most recent version on CRAN so I have now installed 1.8.9
from R-Forge now and the sample csv I was using does indeed work
attempting to do the best it can with the mucked up header.   Maybe
this is sufficient and a skip is not needed but the fact is that there
is no facility to skip over the bad header had I wanted to.

On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>
>> Not with the csv I tried.  The header is messed up (most of the header
>> fields are missing) and it misconstrues it as data.
>
>
> That was fixed a while ago in v1.8.9, from NEWS :
>
> "  [fread] If some column names are blank they are now given default names
>    rather than causing the header row to be read as a data row "
>
>
>> The automation is great but some way to force its behavior when you
>> know what it should do seems essential since heuristics can't be
>> expected to work in all cases.
>
>
> I suspect the heuristics in v1.8.9 work on all your examples so far, but ok
> point taken.
>
> fread allows control of 'autostart' already. This is a line number (default
> 30) within the regular data block used to detect the separator and search
> upwards from to find the first data row and/or column names.
>
> Will add 'skip' then. It'll be like setting autostart=skip+1 but turning off
> the search upwards part. Line skip+1 will be used to detect the separator
> when sep="auto" and used as column names according to
> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify both
> autostart and skip in the same call.  If that sounds ok?
>
> Matthew
>
>
>
>>
>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>> <mdowle at mdowle.plus.com> wrote:
>>>
>>>
>>> Hi,
>>>
>>> Does the auto skip feature of fread cover both of those?  From ?fread :
>>>
>>>   " Once the separator is found on line autostart, the number of columns
>>> is
>>> determined. Then the file is searched backwards from autostart until a
>>> row
>>> is found that doesn't have that number of columns, or the start of file
>>> is
>>> reached. Thus, the first data row is found and any human readable banners
>>> are automatically skipped. This feature can be particularly useful for
>>> loading a set of files which may not all have consistently sized banners.
>>> "
>>>
>>> There were also some issue with header=FALSE in the first release (1.8.8)
>>> which have since been fixed in 1.8.9.
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>
>>>>
>>>> I would find it useful if fread had a skip= argument as in read.table
>>>> since I have files from time to time that have garbage at the top.
>>>> Another situation I find from time to time is that the header is
>>>> messed up but one can still read the file if one can skip over the
>>>> header and specify header = FALSE.
>>>>
>>>> An extra feature that would be nice but less important would be if one
>>>> could specify skip = "string" and have it skip all lines until it
>>>> found one with "string": in it and then start reading from the matched
>>>> row onward.   Normally the string would be chosen to be a string found
>>>> in the header and not likely found prior to the header. read.xls in
>>>> gdata has a similar feature  and I find it quite handy at times.
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>> _______________________________________________
>>>> datatable-help mailing list
>>>> datatable-help at lists.r-forge.r-project.org
>>>>
>>>>
>>>>
>>>>
>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com