[datatable-help] fread: skip

Matthew Dowle mdowle at mdowle.plus.com
Sun May 12 15:01:49 CEST 2013


Hi,

I suspect you may not have scrolled further down in my reply where I 
wrote more?

Matthew


On 12.05.2013 13:26, Gabor Grothendieck wrote:
> 1.8.8 is the most recent version on CRAN so I have now installed 
> 1.8.9
> from R-Forge now and the sample csv I was using does indeed work
> attempting to do the best it can with the mucked up header.   Maybe
> this is sufficient and a skip is not needed but the fact is that 
> there
> is no facility to skip over the bad header had I wanted to.
>
> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>
>>> Not with the csv I tried.  The header is messed up (most of the 
>>> header
>>> fields are missing) and it misconstrues it as data.
>>
>>
>> That was fixed a while ago in v1.8.9, from NEWS :
>>
>> "  [fread] If some column names are blank they are now given default 
>> names
>>    rather than causing the header row to be read as a data row "
>>
>>
>>> The automation is great but some way to force its behavior when you
>>> know what it should do seems essential since heuristics can't be
>>> expected to work in all cases.
>>
>>
>> I suspect the heuristics in v1.8.9 work on all your examples so far, 
>> but ok
>> point taken.
>>
>> fread allows control of 'autostart' already. This is a line number 
>> (default
>> 30) within the regular data block used to detect the separator and 
>> search
>> upwards from to find the first data row and/or column names.
>>
>> Will add 'skip' then. It'll be like setting autostart=skip+1 but 
>> turning off
>> the search upwards part. Line skip+1 will be used to detect the 
>> separator
>> when sep="auto" and used as column names according to
>> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify 
>> both
>> autostart and skip in the same call.  If that sounds ok?
>>
>> Matthew
>>
>>
>>
>>>
>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>> <mdowle at mdowle.plus.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Does the auto skip feature of fread cover both of those?  From 
>>>> ?fread :
>>>>
>>>>   " Once the separator is found on line autostart, the number of 
>>>> columns
>>>> is
>>>> determined. Then the file is searched backwards from autostart 
>>>> until a
>>>> row
>>>> is found that doesn't have that number of columns, or the start of 
>>>> file
>>>> is
>>>> reached. Thus, the first data row is found and any human readable 
>>>> banners
>>>> are automatically skipped. This feature can be particularly useful 
>>>> for
>>>> loading a set of files which may not all have consistently sized 
>>>> banners.
>>>> "
>>>>
>>>> There were also some issue with header=FALSE in the first release 
>>>> (1.8.8)
>>>> which have since been fixed in 1.8.9.
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>
>>>>>
>>>>> I would find it useful if fread had a skip= argument as in 
>>>>> read.table
>>>>> since I have files from time to time that have garbage at the 
>>>>> top.
>>>>> Another situation I find from time to time is that the header is
>>>>> messed up but one can still read the file if one can skip over 
>>>>> the
>>>>> header and specify header = FALSE.
>>>>>
>>>>> An extra feature that would be nice but less important would be 
>>>>> if one
>>>>> could specify skip = "string" and have it skip all lines until it
>>>>> found one with "string": in it and then start reading from the 
>>>>> matched
>>>>> row onward.   Normally the string would be chosen to be a string 
>>>>> found
>>>>> in the header and not likely found prior to the header. read.xls 
>>>>> in
>>>>> gdata has a similar feature  and I find it quite handy at times.
>>>>>
>>>>> --
>>>>> Statistics & Software Consulting
>>>>> GKX Group, GKX Associates Inc.
>>>>> tel: 1-877-GKX-GROUP
>>>>> email: ggrothendieck at gmail.com
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com



More information about the datatable-help mailing list