[datatable-help] Odd problem using fread to read in a csv file: no data, just headers

carrieromichele carrieromichele at gmail.com
Thu Mar 6 13:43:12 CET 2014


I quickly read the last mail, Is this the test you needed guys?

> fread("http://www.cdc.gov/growthcharts/data/zscore/statage.csv",
verbose=FALSE)
trying URL 'http://www.cdc.gov/growthcharts/data/zscore/statage.csv'
Content type 'application/octet-stream' length 66087 bytes (64 Kb)
opened URL
downloaded 64 Kb

Empty data.table (0 rows) of 14 cols: Sex,Agemos,L,M,S,P3...
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C

[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.9.3

loaded via a namespace (and not attached):
[1] plyr_1.8.1     Rcpp_0.11.0    reshape2_1.2.2 Rook_1.0-9
stringr_0.6.2  tools_3.0.2
> fread("http://www.cdc.gov/growthcharts/data/zscore/statage.csv",
verbose=FALSE)
trying URL 'http://www.cdc.gov/growthcharts/data/zscore/statage.csv'
Content type 'application/octet-stream' length 66087 bytes (64 Kb)
opened URL
downloaded 64 Kb

Empty data.table (0 rows) of 14 cols: Sex,Agemos,L,M,S,P3...


On 6 March 2014 12:34, Matt Dowle <mdowle at mdowle.plus.com> wrote:

>
> Works for me as well on linux,  same output as Kevin's.
>
> I was perplexed as to why Farrel's output has :
>
>    File opened, filesize is 6.2E-05B
> but we see :
>
>    File opened, filesize is 0.000 GB
> That line is switched depending on Windows or not. Comparing them :
>
> // On Windows :
> if (verbose) Rprintf("File opened, filesize is %.3 GB\n",
> 1.0*filesize/(1024*1024*1024));
>
> // On non-Windows :
> if (verbose) Rprintf("File opened, filesize is %.3f GB\n",
> 1.0*filesize/(1024*1024*1024));
>
> So, a missing "f". Just committed a fix for that (r1223). That line is
> part of a block that is necessarily different on Windows because its file
> and mmap commands are different.  The missing 'f' could have feasibly
> corrupted memory somehow (strange that the "G" of "GB" got overwritten) and
> if so would explain why it thought it got to the end of the file before
> seeing the \n after the \r.
>
> Farrel - does v1.9.2 work for you on Windows with verbose=FALSE? If yes,
> then very likely verbose=TRUE will now work with commit 1223.  Best to
> start with a new R session to clear any possible memory corruption and then
> try :
>
>    fread("http://www.cdc.gov/growthcharts/data/zscore/statage.csv",
> verbose=FALSE)
>
> If not, can anyone else reproduce on Windows? If so, I'll need to debug it
> on Windows.
>
> Thanks,
> Matt
>
>
>
> On 06/03/14 05:19, Kevin Ushey wrote:
>
>> I think Matt and Arun will have more information -- IIUC, fread is
>> only now gaining support for reading from URLs on Windows.
>>
>> Something strange: I get different output on the file structure with
>> fread. Posting in case it's useful:
>>
>>  statagecdc <- fread("http://www.cdc.gov/growthcharts/data/zscore/
>>> statage.csv", verbose=T)
>>>
>> Input contains no \n. Taking this to be a filename to open
>> File opened, filesize is 0.000 GB
>> File is opened and mapped ok
>> Detected eol as \r\n (CRLF) in that order, the Windows standard.
>> Using line 30 to detect sep (the last non blank line in the first
>> 'autostart') ... sep=','
>> Found 14 columns
>> First row with 14 fields occurs on line 1 (either column names or
>> first row of data)
>> All the fields on line 1 are character fields. Treating as the column
>> names.
>> Count of eol after first data row: 437
>> Subtracted 1 for last eol and any trailing empty lines, leaving 436 data
>> rows
>> Type codes: 13333333333333 (first 5 rows)
>> Type codes: 13333333333333 (+middle 5 rows)
>> Type codes: 13333333333333 (+last 5 rows)
>> Type codes: 13333333333333 (after applying colClasses and integer64)
>> Type codes: 13333333333333 (after applying drop or select (if supplied)
>> Allocating 14 column slots (14 - 0 NULL)
>>     0.000s ( 13%) Memory map (rerun may be quicker)
>>     0.000s (  4%) sep and header detection
>>     0.000s ( 13%) Count rows (wc -l)
>>     0.001s ( 49%) Column type detection (first, middle and last 5 rows)
>>     0.000s (  1%) Allocation of 436x14 result (xMB) in RAM
>>     0.000s ( 19%) Reading data
>>     0.000s (  0%) Allocation for type bumps (if any), including gc time
>> if triggered
>>     0.000s (  0%) Coercing data already read in type bumps (if any)
>>     0.000s (  0%) Changing na.strings to NA
>>     0.002s        Total
>>
>> Note that fread sees \r\n as newlines for me.
>>
>>  sessionInfo()
>>>
>> R Under development (unstable) (2014-02-12 r64976)
>> Platform: x86_64-apple-darwin13.0.0 (64-bit)
>>
>> locale:
>> [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> other attached packages:
>> [1] data.table_1.9.1     knitr_1.5.15         devtools_1.4.1.99
>> BiocInstaller_1.13.3
>>
>> loaded via a namespace (and not attached):
>>   [1] compiler_3.1.0    digest_0.6.4      evaluate_0.5.1
>> formatR_0.10      httr_0.2          memoise_0.1
>>   [7] parallel_3.1.0    plyr_1.8          Rcpp_0.11.0.3
>> RCurl_1.95-4.1    reshape2_1.3.0.99 stringr_0.6.2
>> [13] tools_3.1.0       whisker_0.3-2
>>
>> Kevin
>>
>> On Wed, Mar 5, 2014 at 9:04 PM, Farrel Buchinsky <fjbuch at gmail.com>
>> wrote:
>>
>>> sessionInfo()
>>>>
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
>>> States.1252    LC_MONETARY=English_United States.1252
>>> [4] LC_NUMERIC=C                           LC_TIME=English_United
>>> States.1252
>>>
>>> attached base packages:
>>> [1] grid      stats     graphics  grDevices utils     datasets  methods
>>> base
>>>
>>> other attached packages:
>>> [1] reshape2_1.2.2    data.table_1.9.2  gridExtra_0.9.1   ggplot2_0.9.3.1
>>> RGoogleDocs_0.7-0
>>>
>>> loaded via a namespace (and not attached):
>>>   [1] colorspace_1.2-4   dichromat_2.0-0    digest_0.6.4
>>> gtable_0.1.2
>>> labeling_0.2       MASS_7.3-29        munsell_0.4.2
>>>   [8] plyr_1.8.1         proto_0.3-10       RColorBrewer_1.0-5
>>> Rcpp_0.11.0
>>> RCurl_1.95-4.1     scales_0.2.3       stringr_0.6.2
>>> [15] tools_3.0.2        XML_3.98-1.1
>>>
>>> Farrel Buchinsky
>>> Google Voice Tel: (412) 567-7870
>>>
>>>
>>> On Wed, Mar 5, 2014 at 10:55 PM, Kevin Ushey <kevinushey at gmail.com>
>>> wrote:
>>>
>>>> Works fine for me with data.table 1.9.1 on OS X. What is your
>>>> sessionInfo()?
>>>>
>>>> Kevin
>>>>
>>>> On Wed, Mar 5, 2014 at 7:53 PM, Farrel Buchinsky <fjbuch at gmail.com>
>>>> wrote:
>>>>
>>>>> Any idea why I am getting a data.table with headers only and zero data?
>>>>> How
>>>>> can I get around the problem.
>>>>>
>>>>> fread("http://www.cdc.gov/growthcharts/data/zscore/statage.csv",
>>>>> verbose=T)
>>>>> fails
>>>>> read.csv("http://www.cdc.gov/growthcharts/data/zscore/statage.csv")
>>>>> succeeds
>>>>>
>>>>>  statagecdc <-
>>>>>> fread("http://www.cdc.gov/growthcharts/data/zscore/statage.csv",
>>>>>> verbose=T)
>>>>>>
>>>>> trying URL 'http://www.cdc.gov/growthcharts/data/zscore/statage.csv'
>>>>> Content type 'application/octet-stream' length 66087 bytes (64 Kb)
>>>>> opened URL
>>>>> downloaded 64 Kb
>>>>>
>>>>> Input contains no \n. Taking this to be a filename to open
>>>>> File opened, filesize is  6.2E-05B
>>>>> File is opened and mapped ok
>>>>> Detected eol as \r only (no \n afterwards). An old Mac 9 standard,
>>>>> discontinued in 2002 according to Wikipedia.
>>>>> Using line 1 to detect sep (the last non blank line in the first
>>>>> 'autostart') ... sep=','
>>>>> Found 14 columns
>>>>> First row with 14 fields occurs on line 1 (either column names or first
>>>>> row
>>>>> of data)
>>>>> All the fields on line 1 are character fields. Treating as the column
>>>>> names.
>>>>> Byte after header row is eof or eol, 0 data rows present.
>>>>> Type codes: 00000000000000 (first 5 rows)
>>>>> Type codes: 00000000000000 (after applying colClasses and integer64)
>>>>> Type codes: 00000000000000 (after applying drop or select (if supplied)
>>>>> Allocating 14 column slots (14 - 0 NULL)
>>>>>     0.000s (  0%) Memory map (rerun may be quicker)
>>>>>     0.000s (  0%) sep and header detection
>>>>>     0.001s (100%) Count rows (wc -l)
>>>>>     0.000s (  0%) Column type detection (first, middle and last 5 rows)
>>>>>     0.000s (  0%) Allocation of 0x14 result (xMB) in RAM
>>>>>     0.000s (  0%) Reading data
>>>>>     0.000s (  0%) Allocation for type bumps (if any), including gc time
>>>>> if
>>>>> triggered
>>>>>     0.000s (  0%) Coercing data already read in type bumps (if any)
>>>>>     0.000s (  0%) Changing na.strings to NA
>>>>>     0.001s        Total
>>>>>
>>>>>
>>>>> Thanks a lot.
>>>>>
>>>>> Farrel Buchinsky
>>>>> Google Voice Tel: (412) 567-7870
>>>>>
>>>>> _______________________________________________
>>>>> datatable-help mailing list
>>>>> datatable-help at lists.r-forge.r-project.org
>>>>>
>>>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/
>>>>> listinfo/datatable-help
>>>>>
>>>>
>>>  _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/
>> listinfo/datatable-help
>>
>>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/
> listinfo/datatable-help
>



-- 


*PRIVATE**T:* +44 (0)77 3248 1517 *|* * E:*
carrieromichele at gmail.com<http://@gmail.com>


*OFFICET:* +44 (0)20 8236 8992 *|* * E:*
michele.carriero at evolve-analytics.com
*T:*  www.evolve-analytics.com


<http://www.evolve-analytics.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20140306/f193bef3/attachment-0001.html>


More information about the datatable-help mailing list