[datatable-help] fread: skip

Gabor Grothendieck ggrothendieck at gmail.com
Mon May 13 00:19:04 CEST 2013


Looks really nice.

On Sun, May 12, 2013 at 6:01 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> And skip="string" is also now added and gdata credited (nice idea!)
>
>> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nreal
>> data:\nA,B,C\n1,3,5\n2,4,6\n"
>> cat(input)
>
> some,bad,data
>
> some,cols
> 1,2
> 3,4
>
>
> real data:
> A,B,C
> 1,3,5
> 2,4,6
>>
>> fread(input, skip="B,C")
>
>    A B C
> 1: 1 3 5
> 2: 2 4 6
>>
>> fread(input)   # autostart handles this case already (since the "real
>> data:" line doesn't contain 2 * sep)
>
>    A B C
> 1: 1 3 5
> 2: 2 4 6
>>
>> fread(input, skip="some,cols")  # using skip="string" to get the middle
>> table
>
>    some cols
> 1:    1    2
> 2:    3    4
> Warning message:
> In fread(input, skip = "some,cols") :
>   Stopped reading at empty line, 2 lines after the 'skip' string was found,
> but text exists afterwards (discarded): real data:
>
>
> Further example :
>
>> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n1 3\n2
>> 4\n"
>> cat(input)
>
> some,bad,data
>
> some,cols
> 1,2
> 3,4
>
> real data:
> A B
> 1 3
> 2 4
>>
>> fread(input)    # with space as separator autostart can't distinguish the
>> "real data:" line.  header wouldn't help here.
>
>    real data:
> 1:    A     B
> 2:    1     3
> 3:    2     4
>>
>> fread(input, skip="B")   # skip="string" needed (skip=n onerous). Nice!
>
>    A B
> 1: 1 3
> 2: 2 4
>>
>>
>
> Matthew
>
>
>
> On 12.05.2013 18:33, Matthew Dowle wrote:
>>
>> Since I'm in the fread code at the moment I added 'skip' (rev 864).
>> 4 tests added :
>>
>>> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n"
>>> fread(input)
>>
>>    some bad data
>> 1:    A   B    C
>> 2:    1   3    5
>> 3:    2   4    6
>>>
>>> fread(input, skip=1)
>>
>>    A B C
>> 1: 1 3 5
>> 2: 2 4 6
>>>
>>> fread(input, skip=2)
>>
>>    V1 V2 V3
>> 1:  1  3  5
>> 2:  2  4  6
>>>
>>> fread(input, skip=2, header=TRUE)
>>
>>    1 3 5
>> 1: 2 4 6
>>>
>>>
>>
>>
>> On 12.05.2013 14:24, Gabor Grothendieck wrote:
>>>
>>> Sorry, I did indeed miss the portion of the reply at the very bottom.
>>> Yes, that seems good.
>>>
>>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
>>> <mdowle at mdowle.plus.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I suspect you may not have scrolled further down in my reply where I
>>>> wrote
>>>> more?
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 12.05.2013 13:26, Gabor Grothendieck wrote:
>>>>>
>>>>>
>>>>> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9
>>>>> from R-Forge now and the sample csv I was using does indeed work
>>>>> attempting to do the best it can with the mucked up header.   Maybe
>>>>> this is sufficient and a skip is not needed but the fact is that there
>>>>> is no facility to skip over the bad header had I wanted to.
>>>>>
>>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Not with the csv I tried.  The header is messed up (most of the
>>>>>>> header
>>>>>>> fields are missing) and it misconstrues it as data.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> That was fixed a while ago in v1.8.9, from NEWS :
>>>>>>
>>>>>> "  [fread] If some column names are blank they are now given default
>>>>>> names
>>>>>>    rather than causing the header row to be read as a data row "
>>>>>>
>>>>>>
>>>>>>> The automation is great but some way to force its behavior when you
>>>>>>> know what it should do seems essential since heuristics can't be
>>>>>>> expected to work in all cases.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suspect the heuristics in v1.8.9 work on all your examples so far,
>>>>>> but
>>>>>> ok
>>>>>> point taken.
>>>>>>
>>>>>> fread allows control of 'autostart' already. This is a line number
>>>>>> (default
>>>>>> 30) within the regular data block used to detect the separator and
>>>>>> search
>>>>>> upwards from to find the first data row and/or column names.
>>>>>>
>>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but
>>>>>> turning
>>>>>> off
>>>>>> the search upwards part. Line skip+1 will be used to detect the
>>>>>> separator
>>>>>> when sep="auto" and used as column names according to
>>>>>> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify both
>>>>>> autostart and skip in the same call.  If that sounds ok?
>>>>>>
>>>>>> Matthew
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Does the auto skip feature of fread cover both of those?  From
>>>>>>>> ?fread :
>>>>>>>>
>>>>>>>>   " Once the separator is found on line autostart, the number of
>>>>>>>> columns
>>>>>>>> is
>>>>>>>> determined. Then the file is searched backwards from autostart until
>>>>>>>> a
>>>>>>>> row
>>>>>>>> is found that doesn't have that number of columns, or the start of
>>>>>>>> file
>>>>>>>> is
>>>>>>>> reached. Thus, the first data row is found and any human readable
>>>>>>>> banners
>>>>>>>> are automatically skipped. This feature can be particularly useful
>>>>>>>> for
>>>>>>>> loading a set of files which may not all have consistently sized
>>>>>>>> banners.
>>>>>>>> "
>>>>>>>>
>>>>>>>> There were also some issue with header=FALSE in the first release
>>>>>>>> (1.8.8)
>>>>>>>> which have since been fixed in 1.8.9.
>>>>>>>>
>>>>>>>> Matthew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would find it useful if fread had a skip= argument as in
>>>>>>>>> read.table
>>>>>>>>> since I have files from time to time that have garbage at the top.
>>>>>>>>> Another situation I find from time to time is that the header is
>>>>>>>>> messed up but one can still read the file if one can skip over the
>>>>>>>>> header and specify header = FALSE.
>>>>>>>>>
>>>>>>>>> An extra feature that would be nice but less important would be if
>>>>>>>>> one
>>>>>>>>> could specify skip = "string" and have it skip all lines until it
>>>>>>>>> found one with "string": in it and then start reading from the
>>>>>>>>> matched
>>>>>>>>> row onward.   Normally the string would be chosen to be a string
>>>>>>>>> found
>>>>>>>>> in the header and not likely found prior to the header. read.xls in
>>>>>>>>> gdata has a similar feature  and I find it quite handy at times.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Statistics & Software Consulting
>>>>>>>>> GKX Group, GKX Associates Inc.
>>>>>>>>> tel: 1-877-GKX-GROUP
>>>>>>>>> email: ggrothendieck at gmail.com
>>>>>>>>> _______________________________________________
>>>>>>>>> datatable-help mailing list
>>>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Statistics & Software Consulting
>>>>> GKX Group, GKX Associates Inc.
>>>>> tel: 1-877-GKX-GROUP
>>>>> email: ggrothendieck at gmail.com
>>>>
>>>>
>>>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>



--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list