[datatable-help] fread: skip
Matthew Dowle
mdowle at mdowle.plus.com
Mon May 13 00:01:32 CEST 2013
And skip="string" is also now added and gdata credited (nice idea!)
> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\n\nreal
> data:\nA,B,C\n1,3,5\n2,4,6\n"
> cat(input)
some,bad,data
some,cols
1,2
3,4
real data:
A,B,C
1,3,5
2,4,6
> fread(input, skip="B,C")
A B C
1: 1 3 5
2: 2 4 6
> fread(input) # autostart handles this case already (since the "real
> data:" line doesn't contain 2 * sep)
A B C
1: 1 3 5
2: 2 4 6
> fread(input, skip="some,cols") # using skip="string" to get the
> middle table
some cols
1: 1 2
2: 3 4
Warning message:
In fread(input, skip = "some,cols") :
Stopped reading at empty line, 2 lines after the 'skip' string was
found, but text exists afterwards (discarded): real data:
Further example :
> input = "some,bad,data\n\nsome,cols\n1,2\n3,4\n\nreal data:\nA B\n1
> 3\n2 4\n"
> cat(input)
some,bad,data
some,cols
1,2
3,4
real data:
A B
1 3
2 4
> fread(input) # with space as separator autostart can't distinguish
> the "real data:" line. header wouldn't help here.
real data:
1: A B
2: 1 3
3: 2 4
> fread(input, skip="B") # skip="string" needed (skip=n onerous).
> Nice!
A B
1: 1 3
2: 2 4
>
Matthew
On 12.05.2013 18:33, Matthew Dowle wrote:
> Since I'm in the fread code at the moment I added 'skip' (rev 864).
> 4 tests added :
>
>> input = "some,bad,data\nA,B,C\n1,3,5\n2,4,6\n"
>> fread(input)
> some bad data
> 1: A B C
> 2: 1 3 5
> 3: 2 4 6
>> fread(input, skip=1)
> A B C
> 1: 1 3 5
> 2: 2 4 6
>> fread(input, skip=2)
> V1 V2 V3
> 1: 1 3 5
> 2: 2 4 6
>> fread(input, skip=2, header=TRUE)
> 1 3 5
> 1: 2 4 6
>>
>
>
> On 12.05.2013 14:24, Gabor Grothendieck wrote:
>> Sorry, I did indeed miss the portion of the reply at the very
>> bottom.
>> Yes, that seems good.
>>
>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
>> <mdowle at mdowle.plus.com> wrote:
>>>
>>> Hi,
>>>
>>> I suspect you may not have scrolled further down in my reply where
>>> I wrote
>>> more?
>>>
>>> Matthew
>>>
>>>
>>>
>>> On 12.05.2013 13:26, Gabor Grothendieck wrote:
>>>>
>>>> 1.8.8 is the most recent version on CRAN so I have now installed
>>>> 1.8.9
>>>> from R-Forge now and the sample csv I was using does indeed work
>>>> attempting to do the best it can with the mucked up header.
>>>> Maybe
>>>> this is sufficient and a skip is not needed but the fact is that
>>>> there
>>>> is no facility to skip over the bad header had I wanted to.
>>>>
>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>
>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>>>>
>>>>>>
>>>>>> Not with the csv I tried. The header is messed up (most of the
>>>>>> header
>>>>>> fields are missing) and it misconstrues it as data.
>>>>>
>>>>>
>>>>>
>>>>> That was fixed a while ago in v1.8.9, from NEWS :
>>>>>
>>>>> " [fread] If some column names are blank they are now given
>>>>> default
>>>>> names
>>>>> rather than causing the header row to be read as a data row "
>>>>>
>>>>>
>>>>>> The automation is great but some way to force its behavior when
>>>>>> you
>>>>>> know what it should do seems essential since heuristics can't be
>>>>>> expected to work in all cases.
>>>>>
>>>>>
>>>>>
>>>>> I suspect the heuristics in v1.8.9 work on all your examples so
>>>>> far, but
>>>>> ok
>>>>> point taken.
>>>>>
>>>>> fread allows control of 'autostart' already. This is a line
>>>>> number
>>>>> (default
>>>>> 30) within the regular data block used to detect the separator
>>>>> and search
>>>>> upwards from to find the first data row and/or column names.
>>>>>
>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but
>>>>> turning
>>>>> off
>>>>> the search upwards part. Line skip+1 will be used to detect the
>>>>> separator
>>>>> when sep="auto" and used as column names according to
>>>>> header="auto"|TRUE|FALSE as usual. It'll be an error to specify
>>>>> both
>>>>> autostart and skip in the same call. If that sounds ok?
>>>>>
>>>>> Matthew
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Does the auto skip feature of fread cover both of those? From
>>>>>>> ?fread :
>>>>>>>
>>>>>>> " Once the separator is found on line autostart, the number
>>>>>>> of
>>>>>>> columns
>>>>>>> is
>>>>>>> determined. Then the file is searched backwards from autostart
>>>>>>> until a
>>>>>>> row
>>>>>>> is found that doesn't have that number of columns, or the start
>>>>>>> of file
>>>>>>> is
>>>>>>> reached. Thus, the first data row is found and any human
>>>>>>> readable
>>>>>>> banners
>>>>>>> are automatically skipped. This feature can be particularly
>>>>>>> useful for
>>>>>>> loading a set of files which may not all have consistently
>>>>>>> sized
>>>>>>> banners.
>>>>>>> "
>>>>>>>
>>>>>>> There were also some issue with header=FALSE in the first
>>>>>>> release
>>>>>>> (1.8.8)
>>>>>>> which have since been fixed in 1.8.9.
>>>>>>>
>>>>>>> Matthew
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I would find it useful if fread had a skip= argument as in
>>>>>>>> read.table
>>>>>>>> since I have files from time to time that have garbage at the
>>>>>>>> top.
>>>>>>>> Another situation I find from time to time is that the header
>>>>>>>> is
>>>>>>>> messed up but one can still read the file if one can skip over
>>>>>>>> the
>>>>>>>> header and specify header = FALSE.
>>>>>>>>
>>>>>>>> An extra feature that would be nice but less important would
>>>>>>>> be if one
>>>>>>>> could specify skip = "string" and have it skip all lines until
>>>>>>>> it
>>>>>>>> found one with "string": in it and then start reading from the
>>>>>>>> matched
>>>>>>>> row onward. Normally the string would be chosen to be a
>>>>>>>> string found
>>>>>>>> in the header and not likely found prior to the header.
>>>>>>>> read.xls in
>>>>>>>> gdata has a similar feature and I find it quite handy at
>>>>>>>> times.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Statistics & Software Consulting
>>>>>>>> GKX Group, GKX Associates Inc.
>>>>>>>> tel: 1-877-GKX-GROUP
>>>>>>>> email: ggrothendieck at gmail.com
>>>>>>>> _______________________________________________
>>>>>>>> datatable-help mailing list
>>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Statistics & Software Consulting
>>>> GKX Group, GKX Associates Inc.
>>>> tel: 1-877-GKX-GROUP
>>>> email: ggrothendieck at gmail.com
>>>
>>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list