[datatable-help] fread: skip

Gabor Grothendieck ggrothendieck at gmail.com
Sun May 12 15:24:47 CEST 2013


Sorry, I did indeed miss the portion of the reply at the very bottom.
Yes, that seems good.

What about colClasses too?   I would think that there would be cases
where an automatic approach might not give the result wanted.  For
example, order numbers might all be numeric but you would want to
store them as character in case there are leading zeros.  In other
cases similar fields might validly have leading zeros but you would
want them regarded as numeric so there is no way to distinguish the
two cases except by having the user indicate their intention.

Also, there exist cases where
- fields are unquoted,
- fields are quoted and doubling the quotes are used to indicate an
actual quote and
- where fields are quoted but a backslash quote it used to denote an
actual quote.
Ideally all these situations could be handled through some combination
of automatic and specified arguments.  In the case of R's read.table
it cannot handle the back slashed quote case but handles the others
mentioned.


On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
>
> Hi,
>
> I suspect you may not have scrolled further down in my reply where I wrote
> more?
>
> Matthew
>
>
>
> On 12.05.2013 13:26, Gabor Grothendieck wrote:
>>
>> 1.8.8 is the most recent version on CRAN so I have now installed 1.8.9
>> from R-Forge now and the sample csv I was using does indeed work
>> attempting to do the best it can with the mucked up header.   Maybe
>> this is sufficient and a skip is not needed but the fact is that there
>> is no facility to skip over the bad header had I wanted to.
>>
>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
>> <mdowle at mdowle.plus.com> wrote:
>>>
>>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>>
>>>>
>>>> Not with the csv I tried.  The header is messed up (most of the header
>>>> fields are missing) and it misconstrues it as data.
>>>
>>>
>>>
>>> That was fixed a while ago in v1.8.9, from NEWS :
>>>
>>> "  [fread] If some column names are blank they are now given default
>>> names
>>>    rather than causing the header row to be read as a data row "
>>>
>>>
>>>> The automation is great but some way to force its behavior when you
>>>> know what it should do seems essential since heuristics can't be
>>>> expected to work in all cases.
>>>
>>>
>>>
>>> I suspect the heuristics in v1.8.9 work on all your examples so far, but
>>> ok
>>> point taken.
>>>
>>> fread allows control of 'autostart' already. This is a line number
>>> (default
>>> 30) within the regular data block used to detect the separator and search
>>> upwards from to find the first data row and/or column names.
>>>
>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but turning
>>> off
>>> the search upwards part. Line skip+1 will be used to detect the separator
>>> when sep="auto" and used as column names according to
>>> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify both
>>> autostart and skip in the same call.  If that sounds ok?
>>>
>>> Matthew
>>>
>>>
>>>
>>>>
>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Does the auto skip feature of fread cover both of those?  From ?fread :
>>>>>
>>>>>   " Once the separator is found on line autostart, the number of
>>>>> columns
>>>>> is
>>>>> determined. Then the file is searched backwards from autostart until a
>>>>> row
>>>>> is found that doesn't have that number of columns, or the start of file
>>>>> is
>>>>> reached. Thus, the first data row is found and any human readable
>>>>> banners
>>>>> are automatically skipped. This feature can be particularly useful for
>>>>> loading a set of files which may not all have consistently sized
>>>>> banners.
>>>>> "
>>>>>
>>>>> There were also some issue with header=FALSE in the first release
>>>>> (1.8.8)
>>>>> which have since been fixed in 1.8.9.
>>>>>
>>>>> Matthew
>>>>>
>>>>>
>>>>>
>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> I would find it useful if fread had a skip= argument as in read.table
>>>>>> since I have files from time to time that have garbage at the top.
>>>>>> Another situation I find from time to time is that the header is
>>>>>> messed up but one can still read the file if one can skip over the
>>>>>> header and specify header = FALSE.
>>>>>>
>>>>>> An extra feature that would be nice but less important would be if one
>>>>>> could specify skip = "string" and have it skip all lines until it
>>>>>> found one with "string": in it and then start reading from the matched
>>>>>> row onward.   Normally the string would be chosen to be a string found
>>>>>> in the header and not likely found prior to the header. read.xls in
>>>>>> gdata has a similar feature  and I find it quite handy at times.
>>>>>>
>>>>>> --
>>>>>> Statistics & Software Consulting
>>>>>> GKX Group, GKX Associates Inc.
>>>>>> tel: 1-877-GKX-GROUP
>>>>>> email: ggrothendieck at gmail.com
>>>>>> _______________________________________________
>>>>>> datatable-help mailing list
>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>> --
>> Statistics & Software Consulting
>> GKX Group, GKX Associates Inc.
>> tel: 1-877-GKX-GROUP
>> email: ggrothendieck at gmail.com
>
>



-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com


More information about the datatable-help mailing list