[datatable-help] fread: skip
Matthew Dowle
mdowle at mdowle.plus.com
Sun May 12 12:29:35 CEST 2013
On 12.05.2013 00:47, Gabor Grothendieck wrote:
> Not with the csv I tried. The header is messed up (most of the
> header
> fields are missing) and it misconstrues it as data.
That was fixed a while ago in v1.8.9, from NEWS :
" [fread] If some column names are blank they are now given default
names
rather than causing the header row to be read as a data row "
> The automation is great but some way to force its behavior when you
> know what it should do seems essential since heuristics can't be
> expected to work in all cases.
I suspect the heuristics in v1.8.9 work on all your examples so far,
but ok point taken.
fread allows control of 'autostart' already. This is a line number
(default 30) within the regular data block used to detect the separator
and search upwards from to find the first data row and/or column names.
Will add 'skip' then. It'll be like setting autostart=skip+1 but
turning off the search upwards part. Line skip+1 will be used to detect
the separator when sep="auto" and used as column names according to
header="auto"|TRUE|FALSE as usual. It'll be an error to specify both
autostart and skip in the same call. If that sounds ok?
Matthew
>
> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>>
>> Hi,
>>
>> Does the auto skip feature of fread cover both of those? From
>> ?fread :
>>
>> " Once the separator is found on line autostart, the number of
>> columns is
>> determined. Then the file is searched backwards from autostart until
>> a row
>> is found that doesn't have that number of columns, or the start of
>> file is
>> reached. Thus, the first data row is found and any human readable
>> banners
>> are automatically skipped. This feature can be particularly useful
>> for
>> loading a set of files which may not all have consistently sized
>> banners. "
>>
>> There were also some issue with header=FALSE in the first release
>> (1.8.8)
>> which have since been fixed in 1.8.9.
>>
>> Matthew
>>
>>
>>
>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>
>>> I would find it useful if fread had a skip= argument as in
>>> read.table
>>> since I have files from time to time that have garbage at the top.
>>> Another situation I find from time to time is that the header is
>>> messed up but one can still read the file if one can skip over the
>>> header and specify header = FALSE.
>>>
>>> An extra feature that would be nice but less important would be if
>>> one
>>> could specify skip = "string" and have it skip all lines until it
>>> found one with "string": in it and then start reading from the
>>> matched
>>> row onward. Normally the string would be chosen to be a string
>>> found
>>> in the header and not likely found prior to the header. read.xls in
>>> gdata has a similar feature and I find it quite handy at times.
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>> _______________________________________________
>>> datatable-help mailing list
>>> datatable-help at lists.r-forge.r-project.org
>>>
>>>
>>>
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list