[datatable-help] fread: skip

Mon May 13 00:01:32 CEST 2013

And skip="string" is also now added and gdata credited (nice idea!)

> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nreal 
> data:\nA,B,C\n1,3,5\n2,4,6\n"
> cat(input)
some,bad,data

some,cols
1,2
3,4


real data:
A,B,C
1,3,5
2,4,6
> fread(input, skip="B,C")
    A B C
1: 1 3 5
2: 2 4 6
> fread(input)   # autostart handles this case already (since the "real 
> data:" line doesn't contain 2 * sep)
    A B C
1: 1 3 5
2: 2 4 6
> fread(input, skip="some,cols")  # using skip="string" to get the 
> middle table
    some cols
1:    1    2
2:    3    4
Warning message:
In fread(input, skip = "some,cols") :
   Stopped reading at empty line, 2 lines after the 'skip' string was 
found, but text exists afterwards (discarded): real data:


Further example :

> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n1 
> 3\n2 4\n"
> cat(input)
some,bad,data

some,cols
1,2
3,4

real data:
A B
1 3
2 4
> fread(input)    # with space as separator autostart can't distinguish 
> the "real data:" line.  header wouldn't help here.
    real data:
1:    A     B
2:    1     3
3:    2     4
> fread(input, skip="B")   # skip="string" needed (skip=n onerous). 
> Nice!
    A B
1: 1 3
2: 2 4
>

Matthew


On 12.05.2013 18:33, Matthew Dowle wrote:
> Since I'm in the fread code at the moment I added 'skip' (rev 864).
> 4 tests added :
>
>> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n"
>> fread(input)
>    some bad data
> 1:    A   B    C
> 2:    1   3    5
> 3:    2   4    6
>> fread(input, skip=1)
>    A B C
> 1: 1 3 5
> 2: 2 4 6
>> fread(input, skip=2)
>    V1 V2 V3
> 1:  1  3  5
> 2:  2  4  6
>> fread(input, skip=2, header=TRUE)
>    1 3 5
> 1: 2 4 6
>>
>
>
> On 12.05.2013 14:24, Gabor Grothendieck wrote:
>> Sorry, I did indeed miss the portion of the reply at the very 
>> bottom.
>> Yes, that seems good.
>>
>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
>> <mdowle at mdowle.plus.com> wrote:
>>>
>>> Hi,
>>>
>>> I suspect you may not have scrolled further down in my reply where 
>>> I wrote
>>> more?
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 12.05.2013 13:26, Gabor Grothendieck wrote:
>>>>
>>>> 1.8.8 is the most recent version on CRAN so I have now installed 
>>>> 1.8.9
>>>> from R-Forge now and the sample csv I was using does indeed work
>>>> attempting to do the best it can with the mucked up header.   
>>>> Maybe
>>>> this is sufficient and a skip is not needed but the fact is that 
>>>> there
>>>> is no facility to skip over the bad header had I wanted to.
>>>>
>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>
>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>>>>
>>>>>>
>>>>>> Not with the csv I tried.  The header is messed up (most of the 
>>>>>> header
>>>>>> fields are missing) and it misconstrues it as data.
>>>>>
>>>>>
>>>>>
>>>>> That was fixed a while ago in v1.8.9, from NEWS :
>>>>>
>>>>> "  [fread] If some column names are blank they are now given 
>>>>> default
>>>>> names
>>>>>    rather than causing the header row to be read as a data row "
>>>>>
>>>>>
>>>>>> The automation is great but some way to force its behavior when 
>>>>>> you
>>>>>> know what it should do seems essential since heuristics can't be
>>>>>> expected to work in all cases.
>>>>>
>>>>>
>>>>>
>>>>> I suspect the heuristics in v1.8.9 work on all your examples so 
>>>>> far, but
>>>>> ok
>>>>> point taken.
>>>>>
>>>>> fread allows control of 'autostart' already. This is a line 
>>>>> number
>>>>> (default
>>>>> 30) within the regular data block used to detect the separator 
>>>>> and search
>>>>> upwards from to find the first data row and/or column names.
>>>>>
>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but 
>>>>> turning
>>>>> off
>>>>> the search upwards part. Line skip+1 will be used to detect the 
>>>>> separator
>>>>> when sep="auto" and used as column names according to
>>>>> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify 
>>>>> both
>>>>> autostart and skip in the same call.  If that sounds ok?
>>>>>
>>>>> Matthew
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Does the auto skip feature of fread cover both of those?  From 
>>>>>>> ?fread :
>>>>>>>
>>>>>>>   " Once the separator is found on line autostart, the number 
>>>>>>> of
>>>>>>> columns
>>>>>>> is
>>>>>>> determined. Then the file is searched backwards from autostart 
>>>>>>> until a
>>>>>>> row
>>>>>>> is found that doesn't have that number of columns, or the start 
>>>>>>> of file
>>>>>>> is
>>>>>>> reached. Thus, the first data row is found and any human 
>>>>>>> readable
>>>>>>> banners
>>>>>>> are automatically skipped. This feature can be particularly 
>>>>>>> useful for
>>>>>>> loading a set of files which may not all have consistently 
>>>>>>> sized
>>>>>>> banners.
>>>>>>> "
>>>>>>>
>>>>>>> There were also some issue with header=FALSE in the first 
>>>>>>> release
>>>>>>> (1.8.8)
>>>>>>> which have since been fixed in 1.8.9.
>>>>>>>
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I would find it useful if fread had a skip= argument as in 
>>>>>>>> read.table
>>>>>>>> since I have files from time to time that have garbage at the 
>>>>>>>> top.
>>>>>>>> Another situation I find from time to time is that the header 
>>>>>>>> is
>>>>>>>> messed up but one can still read the file if one can skip over 
>>>>>>>> the
>>>>>>>> header and specify header = FALSE.
>>>>>>>>
>>>>>>>> An extra feature that would be nice but less important would 
>>>>>>>> be if one
>>>>>>>> could specify skip = "string" and have it skip all lines until 
>>>>>>>> it
>>>>>>>> found one with "string": in it and then start reading from the 
>>>>>>>> matched
>>>>>>>> row onward.   Normally the string would be chosen to be a 
>>>>>>>> string found
>>>>>>>> in the header and not likely found prior to the header. 
>>>>>>>> read.xls in
>>>>>>>> gdata has a similar feature  and I find it quite handy at 
>>>>>>>> times.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Statistics & Software Consulting
>>>>>>>> GKX Group, GKX Associates Inc.
>>>>>>>> tel: 1-877-GKX-GROUP
>>>>>>>> email: ggrothendieck at gmail.com
>>>>>>>> _______________________________________________
>>>>>>>> datatable-help mailing list
>>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>
>>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help