[datatable-help] fread: skip
Matthew Dowle
mdowle at mdowle.plus.com
Sun May 12 15:01:49 CEST 2013
Hi,
I suspect you may not have scrolled further down in my reply where I
wrote more?
Matthew
On 12.05.2013 13:26, Gabor Grothendieck wrote:
> 1.8.8 is the most recent version on CRAN so I have now installed
> 1.8.9
> from R-Forge now and the sample csv I was using does indeed work
> attempting to do the best it can with the mucked up header. Maybe
> this is sufficient and a skip is not needed but the fact is that
> there
> is no facility to skip over the bad header had I wanted to.
>
> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>
>>> Not with the csv I tried. The header is messed up (most of the
>>> header
>>> fields are missing) and it misconstrues it as data.
>>
>>
>> That was fixed a while ago in v1.8.9, from NEWS :
>>
>> " [fread] If some column names are blank they are now given default
>> names
>> rather than causing the header row to be read as a data row "
>>
>>
>>> The automation is great but some way to force its behavior when you
>>> know what it should do seems essential since heuristics can't be
>>> expected to work in all cases.
>>
>>
>> I suspect the heuristics in v1.8.9 work on all your examples so far,
>> but ok
>> point taken.
>>
>> fread allows control of 'autostart' already. This is a line number
>> (default
>> 30) within the regular data block used to detect the separator and
>> search
>> upwards from to find the first data row and/or column names.
>>
>> Will add 'skip' then. It'll be like setting autostart=skip+1 but
>> turning off
>> the search upwards part. Line skip+1 will be used to detect the
>> separator
>> when sep="auto" and used as column names according to
>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify
>> both
>> autostart and skip in the same call. If that sounds ok?
>>
>> Matthew
>>
>>
>>
>>>
>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>> <mdowle at mdowle.plus.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Does the auto skip feature of fread cover both of those? From
>>>> ?fread :
>>>>
>>>> " Once the separator is found on line autostart, the number of
>>>> columns
>>>> is
>>>> determined. Then the file is searched backwards from autostart
>>>> until a
>>>> row
>>>> is found that doesn't have that number of columns, or the start of
>>>> file
>>>> is
>>>> reached. Thus, the first data row is found and any human readable
>>>> banners
>>>> are automatically skipped. This feature can be particularly useful
>>>> for
>>>> loading a set of files which may not all have consistently sized
>>>> banners.
>>>> "
>>>>
>>>> There were also some issue with header=FALSE in the first release
>>>> (1.8.8)
>>>> which have since been fixed in 1.8.9.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>
>>>>>
>>>>> I would find it useful if fread had a skip= argument as in
>>>>> read.table
>>>>> since I have files from time to time that have garbage at the
>>>>> top.
>>>>> Another situation I find from time to time is that the header is
>>>>> messed up but one can still read the file if one can skip over
>>>>> the
>>>>> header and specify header = FALSE.
>>>>>
>>>>> An extra feature that would be nice but less important would be
>>>>> if one
>>>>> could specify skip = "string" and have it skip all lines until it
>>>>> found one with "string": in it and then start reading from the
>>>>> matched
>>>>> row onward. Normally the string would be chosen to be a string
>>>>> found
>>>>> in the header and not likely found prior to the header. read.xls
>>>>> in
>>>>> gdata has a similar feature and I find it quite handy at times.
>>>>>
>>>>> --
>>>>> Statistics & Software Consulting
>>>>> GKX Group, GKX Associates Inc.
>>>>> tel: 1-877-GKX-GROUP
>>>>> email: ggrothendieck at gmail.com
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
More information about the datatable-help
mailing list