[datatable-help] fread: skip

Sun May 12 19:33:35 CEST 2013

Since I'm in the fread code at the moment I added 'skip' (rev 864).
4 tests added :

> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n"
> fread(input)
    some bad data
1:    A   B    C
2:    1   3    5
3:    2   4    6
> fread(input, skip=1)
    A B C
1: 1 3 5
2: 2 4 6
> fread(input, skip=2)
    V1 V2 V3
1:  1  3  5
2:  2  4  6
> fread(input, skip=2, header=TRUE)
    1 3 5
1: 2 4 6
>

On 12.05.2013 14:24, Gabor Grothendieck wrote:
> Sorry, I did indeed miss the portion of the reply at the very bottom.
> Yes, that seems good.
>
> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>>
>> Hi,
>>
>> I suspect you may not have scrolled further down in my reply where I 
>> wrote
>> more?
>>
>> Matthew
>>
>>
>>
>> On 12.05.2013 13:26, Gabor Grothendieck wrote:
>>>
>>> 1.8.8 is the most recent version on CRAN so I have now installed 
>>> 1.8.9
>>> from R-Forge now and the sample csv I was using does indeed work
>>> attempting to do the best it can with the mucked up header.   Maybe
>>> this is sufficient and a skip is not needed but the fact is that 
>>> there
>>> is no facility to skip over the bad header had I wanted to.
>>>
>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
>>> <mdowle at mdowle.plus.com> wrote:
>>>>
>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>>>
>>>>>
>>>>> Not with the csv I tried.  The header is messed up (most of the 
>>>>> header
>>>>> fields are missing) and it misconstrues it as data.
>>>>
>>>>
>>>>
>>>> That was fixed a while ago in v1.8.9, from NEWS :
>>>>
>>>> "  [fread] If some column names are blank they are now given 
>>>> default
>>>> names
>>>>    rather than causing the header row to be read as a data row "
>>>>
>>>>
>>>>> The automation is great but some way to force its behavior when 
>>>>> you
>>>>> know what it should do seems essential since heuristics can't be
>>>>> expected to work in all cases.
>>>>
>>>>
>>>>
>>>> I suspect the heuristics in v1.8.9 work on all your examples so 
>>>> far, but
>>>> ok
>>>> point taken.
>>>>
>>>> fread allows control of 'autostart' already. This is a line number
>>>> (default
>>>> 30) within the regular data block used to detect the separator and 
>>>> search
>>>> upwards from to find the first data row and/or column names.
>>>>
>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but 
>>>> turning
>>>> off
>>>> the search upwards part. Line skip+1 will be used to detect the 
>>>> separator
>>>> when sep="auto" and used as column names according to
>>>> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify 
>>>> both
>>>> autostart and skip in the same call.  If that sounds ok?
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>>>
>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Does the auto skip feature of fread cover both of those?  From 
>>>>>> ?fread :
>>>>>>
>>>>>>   " Once the separator is found on line autostart, the number of
>>>>>> columns
>>>>>> is
>>>>>> determined. Then the file is searched backwards from autostart 
>>>>>> until a
>>>>>> row
>>>>>> is found that doesn't have that number of columns, or the start 
>>>>>> of file
>>>>>> is
>>>>>> reached. Thus, the first data row is found and any human 
>>>>>> readable
>>>>>> banners
>>>>>> are automatically skipped. This feature can be particularly 
>>>>>> useful for
>>>>>> loading a set of files which may not all have consistently sized
>>>>>> banners.
>>>>>> "
>>>>>>
>>>>>> There were also some issue with header=FALSE in the first 
>>>>>> release
>>>>>> (1.8.8)
>>>>>> which have since been fixed in 1.8.9.
>>>>>>
>>>>>> Matthew
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I would find it useful if fread had a skip= argument as in 
>>>>>>> read.table
>>>>>>> since I have files from time to time that have garbage at the 
>>>>>>> top.
>>>>>>> Another situation I find from time to time is that the header 
>>>>>>> is
>>>>>>> messed up but one can still read the file if one can skip over 
>>>>>>> the
>>>>>>> header and specify header = FALSE.
>>>>>>>
>>>>>>> An extra feature that would be nice but less important would be 
>>>>>>> if one
>>>>>>> could specify skip = "string" and have it skip all lines until 
>>>>>>> it
>>>>>>> found one with "string": in it and then start reading from the 
>>>>>>> matched
>>>>>>> row onward.   Normally the string would be chosen to be a 
>>>>>>> string found
>>>>>>> in the header and not likely found prior to the header. 
>>>>>>> read.xls in
>>>>>>> gdata has a similar feature  and I find it quite handy at 
>>>>>>> times.
>>>>>>>
>>>>>>> --
>>>>>>> Statistics & Software Consulting
>>>>>>> GKX Group, GKX Associates Inc.
>>>>>>> tel: 1-877-GKX-GROUP
>>>>>>> email: ggrothendieck at gmail.com
>>>>>>> _______________________________________________
>>>>>>> datatable-help mailing list
>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>
>>>
>>>
>>>
>>> --
>>> Statistics & Software Consulting
>>> GKX Group, GKX Associates Inc.
>>> tel: 1-877-GKX-GROUP
>>> email: ggrothendieck at gmail.com
>>
>>