[datatable-help] fread: skip

Matthew Dowle mdowle at mdowle.plus.com
Sun May 12 17:20:44 CEST 2013


For that I think all that needs to be done (now) is adding something 
very similar to these few lines (from read.table) into fread at R level 
after the data has been read in :

        if (colClasses[i] == "factor")
            as.factor(data[[i]])
        else if (colClasses[i] == "Date")
            as.Date(data[[i]])
        else if (colClasses[i] == "POSIXct")
            as.POSIXct(data[[i]])
        else methods::as(data[[i]], colClasses[i])

Although I don't quite see why read.table explicity deals with factor, 
Date and POSIXct separately, rather than leaving them to the methods::as 
catch all at the end.

But reading dates (for example) as character and then converting to 
Date at R level is going to be relatively slow due to the intermediate 
character vector and adding all the unique strings to R's global cache. 
Direct reading of dates (e.g. by using Simon U's fasttime package) could 
be built in at C level at a later date just for speed, without breaking 
syntax or output types. In the meantime it would work at least. That's 
the thinking, anyway.

I found some discussion in R News 4.1 about Excel dates and times, but 
not on colClasses or that mapping specifically.   Currently in fread if 
a colClasses name isn't recognised as a basic type like 
integer|numeric|double|integer64|character,  then it's read as character 
and (to be done) as long as there's an as.() method for it that'll take 
care of it.  Reading numbers (such as offset from epoch) and then as() 
on that numeric|integer column isn't something I'd considered before (is 
that what you mean?)

Matthew


On 12.05.2013 15:44, Gabor Grothendieck wrote:
> That looks great.  It occurred to me in looking at this that one 
> thing
> that might be useful would be to provide some conversion routines 
> that
> can be specified as classes in the colClass vector that will convert
> numbers from Excel representing Dates or date/times to Date and
> POSIXct class respectively.  (The mapping is discussed in R News 
> 4/1.)
>
> On Sun, May 12, 2013 at 10:14 AM, Matthew Dowle
> <mdowle at mdowle.plus.com> wrote:
>>
>> Agreed too. colClasses was committed yesterday as luck would have 
>> it.
>>
>> ?fread now has :
>>
>>    colClasses : A character vector of classes (named or unnamed), as
>> read.csv. Or, type list enables setting ranges of columns by numeric
>> position. colClasses in fread is intended for rare overrides, not 
>> for
>> routine use. fread will only promote a column to a higher type if 
>> colClasses
>> requests it. It won't downgrade a column to a lower type since NAs 
>> would
>> result. You have to coerce such columns afterwards yourself, if you 
>> really
>> require data loss.
>>
>> The tests so far are as follows :
>>
>> input = 'A,B,C\n01,foo,3.140\n002,bar,6.28000\n'
>>
>> test(952, fread(input, colClasses=c(C="character")),
>> data.table(A=1:2,B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(953, fread(input, colClasses=c(C="character",A="numeric")),
>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(954, fread(input, colClasses=c(C="character",A="double")),
>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(955, fread(input, colClasses=list(character="C",double="A")),
>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(956, fread(input, colClasses=list(character=2:3,double="A")),
>> data.table(A=c(1.0,2.0),B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(957, fread(input, colClasses=list(character=1:3)),
>> data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(958, fread(input, colClasses="character"),
>> data.table(A=c("01","002"),B=c("foo","bar"),C=c("3.140","6.28000")))
>> test(959, fread(input, 
>> colClasses=c("character","double","numeric")),
>> data.table(A=c("01","002"),B=c("foo","bar"),C=c(3.14,6.28)))
>>
>> test(960, fread(input, colClasses=c("character","double")),
>> error="colClasses is unnamed and length 2 but there are 3 columns. 
>> See")
>> test(961, fread(input, colClasses=1:3), error="colClasses is not 
>> type list
>> or character vector")
>> test(962, fread(input, colClasses=list(1:3)), error="colClasses is 
>> type list
>> but has no names")
>> test(963, fread(input, colClasses=list(character="D")), 
>> error="Column name
>> 'D' in colClasses not found in data")
>> test(964, fread(input, colClasses=c(D="character")), error="Column 
>> name 'D'
>> in colClasses not found in data")
>> test(965, fread(input, colClasses=list(character=0)), error="Column 
>> number 0
>> (colClasses..1...1.) is out of range .1,ncol=3.")
>> test(966, fread(input, colClasses=list(character=2:4)), 
>> error="Column number
>> 4 (colClasses..1...3.) is out of range .1,ncol=3.")
>>
>> More detailed/trace info is provided when verbose=TRUE.
>>
>>
>> On embedded quotes there are known and documented problems still to 
>> resolve.
>> The issue there is subtle: when reading character columns, part of 
>> fread's
>> speed comes from pointing mkCharLen() directly to the field in 
>> memory mapped
>> region of RAM i.e. the field isn't copied into any intermediate 
>> buffer at
>> all. But for embedded quotes (either doubled or escaped) we do need 
>> to copy
>> to a buffer so we can remove the doubled quote, or escape character 
>> (i.e.
>> change the field) before calling mkCharLen().  That's not a problem 
>> per se,
>> but just a new twist to the C code to implement. In order to not 
>> slow down,
>> it need only copy that field to a buffer if a doubled or escaped 
>> quote was
>> actually present in that particular field.
>>
>> Matthew
>>
>>
>>
>> On 12.05.2013 14:24, Gabor Grothendieck wrote:
>>>
>>> Sorry, I did indeed miss the portion of the reply at the very 
>>> bottom.
>>> Yes, that seems good.
>>>
>>> What about colClasses too?   I would think that there would be 
>>> cases
>>> where an automatic approach might not give the result wanted.  For
>>> example, order numbers might all be numeric but you would want to
>>> store them as character in case there are leading zeros.  In other
>>> cases similar fields might validly have leading zeros but you would
>>> want them regarded as numeric so there is no way to distinguish the
>>> two cases except by having the user indicate their intention.
>>>
>>> Also, there exist cases where
>>> - fields are unquoted,
>>> - fields are quoted and doubling the quotes are used to indicate an
>>> actual quote and
>>> - where fields are quoted but a backslash quote it used to denote 
>>> an
>>> actual quote.
>>> Ideally all these situations could be handled through some 
>>> combination
>>> of automatic and specified arguments.  In the case of R's 
>>> read.table
>>> it cannot handle the back slashed quote case but handles the others
>>> mentioned.
>>>
>>>
>>> On Sun, May 12, 2013 at 9:01 AM, Matthew Dowle
>>> <mdowle at mdowle.plus.com> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I suspect you may not have scrolled further down in my reply where 
>>>> I
>>>> wrote
>>>> more?
>>>>
>>>> Matthew
>>>>
>>>>
>>>>
>>>> On 12.05.2013 13:26, Gabor Grothendieck wrote:
>>>>>
>>>>>
>>>>> 1.8.8 is the most recent version on CRAN so I have now installed 
>>>>> 1.8.9
>>>>> from R-Forge now and the sample csv I was using does indeed work
>>>>> attempting to do the best it can with the mucked up header.   
>>>>> Maybe
>>>>> this is sufficient and a skip is not needed but the fact is that 
>>>>> there
>>>>> is no facility to skip over the bad header had I wanted to.
>>>>>
>>>>> On Sun, May 12, 2013 at 6:29 AM, Matthew Dowle
>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>
>>>>>>
>>>>>> On 12.05.2013 00:47, Gabor Grothendieck wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Not with the csv I tried.  The header is messed up (most of the 
>>>>>>> header
>>>>>>> fields are missing) and it misconstrues it as data.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> That was fixed a while ago in v1.8.9, from NEWS :
>>>>>>
>>>>>> "  [fread] If some column names are blank they are now given 
>>>>>> default
>>>>>> names
>>>>>>    rather than causing the header row to be read as a data row "
>>>>>>
>>>>>>
>>>>>>> The automation is great but some way to force its behavior when 
>>>>>>> you
>>>>>>> know what it should do seems essential since heuristics can't 
>>>>>>> be
>>>>>>> expected to work in all cases.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suspect the heuristics in v1.8.9 work on all your examples so 
>>>>>> far,
>>>>>> but
>>>>>> ok
>>>>>> point taken.
>>>>>>
>>>>>> fread allows control of 'autostart' already. This is a line 
>>>>>> number
>>>>>> (default
>>>>>> 30) within the regular data block used to detect the separator 
>>>>>> and
>>>>>> search
>>>>>> upwards from to find the first data row and/or column names.
>>>>>>
>>>>>> Will add 'skip' then. It'll be like setting autostart=skip+1 but
>>>>>> turning
>>>>>> off
>>>>>> the search upwards part. Line skip+1 will be used to detect the
>>>>>> separator
>>>>>> when sep="auto" and used as column names according to
>>>>>> header="auto"|TRUE|FALSE as usual.  It'll be an error to specify 
>>>>>> both
>>>>>> autostart and skip in the same call.  If that sounds ok?
>>>>>>
>>>>>> Matthew
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Sat, May 11, 2013 at 6:35 PM, Matthew Dowle
>>>>>>> <mdowle at mdowle.plus.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Does the auto skip feature of fread cover both of those?  From 
>>>>>>>> ?fread
>>>>>>>> :
>>>>>>>>
>>>>>>>>   " Once the separator is found on line autostart, the number 
>>>>>>>> of
>>>>>>>> columns
>>>>>>>> is
>>>>>>>> determined. Then the file is searched backwards from autostart 
>>>>>>>> until
>>>>>>>> a
>>>>>>>> row
>>>>>>>> is found that doesn't have that number of columns, or the 
>>>>>>>> start of
>>>>>>>> file
>>>>>>>> is
>>>>>>>> reached. Thus, the first data row is found and any human 
>>>>>>>> readable
>>>>>>>> banners
>>>>>>>> are automatically skipped. This feature can be particularly 
>>>>>>>> useful
>>>>>>>> for
>>>>>>>> loading a set of files which may not all have consistently 
>>>>>>>> sized
>>>>>>>> banners.
>>>>>>>> "
>>>>>>>>
>>>>>>>> There were also some issue with header=FALSE in the first 
>>>>>>>> release
>>>>>>>> (1.8.8)
>>>>>>>> which have since been fixed in 1.8.9.
>>>>>>>>
>>>>>>>> Matthew
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 11.05.2013 23:16, Gabor Grothendieck wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I would find it useful if fread had a skip= argument as in
>>>>>>>>> read.table
>>>>>>>>> since I have files from time to time that have garbage at the 
>>>>>>>>> top.
>>>>>>>>> Another situation I find from time to time is that the header 
>>>>>>>>> is
>>>>>>>>> messed up but one can still read the file if one can skip 
>>>>>>>>> over the
>>>>>>>>> header and specify header = FALSE.
>>>>>>>>>
>>>>>>>>> An extra feature that would be nice but less important would 
>>>>>>>>> be if
>>>>>>>>> one
>>>>>>>>> could specify skip = "string" and have it skip all lines 
>>>>>>>>> until it
>>>>>>>>> found one with "string": in it and then start reading from 
>>>>>>>>> the
>>>>>>>>> matched
>>>>>>>>> row onward.   Normally the string would be chosen to be a 
>>>>>>>>> string
>>>>>>>>> found
>>>>>>>>> in the header and not likely found prior to the header. 
>>>>>>>>> read.xls in
>>>>>>>>> gdata has a similar feature  and I find it quite handy at 
>>>>>>>>> times.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Statistics & Software Consulting
>>>>>>>>> GKX Group, GKX Associates Inc.
>>>>>>>>> tel: 1-877-GKX-GROUP
>>>>>>>>> email: ggrothendieck at gmail.com
>>>>>>>>> _______________________________________________
>>>>>>>>> datatable-help mailing list
>>>>>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 
>>>>>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Statistics & Software Consulting
>>>>> GKX Group, GKX Associates Inc.
>>>>> tel: 1-877-GKX-GROUP
>>>>> email: ggrothendieck at gmail.com
>>>>
>>>>
>>>>
>>


More information about the datatable-help mailing list