[datatable-help] fread: skip

Sun May 12 12:29:35 CEST 2013

On 12.05.2013 00:47, Gabor Grothendieck wrote:
> Not with the csv I tried.  The header is messed up (most of the 
> header
> fields are missing) and it misconstrues it as data.

That was fixed a while ago in v1.8.9, from NEWS :

"  [fread] If some column names are blank they are now given default 
names
    rather than causing the header row to be read as a data row "

> The automation is great but some way to force its behavior when you
> know what it should do seems essential since heuristics can't be
> expected to work in all cases.

I suspect the heuristics in v1.8.9 work on all your examples so far, 
but ok point taken.

fread allows control of 'autostart' already. This is a line number 
(default 30) within the regular data block used to detect the separator 
and search upwards from to find the first data row and/or column names.

Will add 'skip' then. It'll be like setting autostart=skip+1 but 
turning off the search upwards part. Line skip+1 will be used to detect 
the separator when sep="auto" and used as column names according to 
header="auto"|TRUE|FALSE as usual.  It'll be an error to specify both 
autostart and skip in the same call.  If that sounds ok?

Matthew

>
> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>>
>> Hi,
>>
>> Does the auto skip feature of fread cover both of those?  From 
>> ?fread :
>>
>>   " Once the separator is found on line autostart, the number of 
>> columns is
>> determined. Then the file is searched backwards from autostart until 
>> a row
>> is found that doesn't have that number of columns, or the start of 
>> file is
>> reached. Thus, the first data row is found and any human readable 
>> banners
>> are automatically skipped. This feature can be particularly useful 
>> for
>> loading a set of files which may not all have consistently sized 
>> banners. "
>>
>> There were also some issue with header=FALSE in the first release 
>> (1.8.8)
>> which have since been fixed in 1.8.9.
>>
>> Matthew
>>
>>
>>
>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>
>>> I would find it useful if fread had a skip= argument as in 
>>> read.table
>>> since I have files from time to time that have garbage at the top.
>>> Another situation I find from time to time is that the header is
>>> messed up but one can still read the file if one can skip over the
>>> header and specify header = FALSE.
>>>
>>> An extra feature that would be nice but less important would be if 
>>> one
>>> could specify skip = "string" and have it skip all lines until it
>>> found one with "string": in it and then start reading from the 
>>> matched
>>> row onward.   Normally the string would be chosen to be a string 
>>> found
>>> in the header and not likely found prior to the header. read.xls in
>>> gdata has a similar feature  and I find it quite handy at times.
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>>
>>> 
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help