[datatable-help] NA handling not working?

Arunkumar Srinivasan aragorn168b at gmail.com
Mon Mar 16 18:54:25 CET 2015


What’s the issue here? It seems to have taken ~4 seconds IIUC. The problem seems that your file has a “?” at the line denoted, which results in having to coerce all the lines read previously to character type first. Handling ‘na.strings’ is on the list - https://github.com/Rdatatable/data.table/issues/504 but I don’t get as to why it’s choking.. 4 seconds isn’t a lot, really.

-- 
Arun

On 10 Mar 2015 at 15:13:20, Rivo R (rivokl at gmail.com) wrote:

Hi all,

I tried to load the following (huge) dataset as data.table but fread
seems to choke on NA's.
https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip

Steps:
1- Dowload and unzip
2-
> packageVersion("data.table")
[1] ‘1.9.4’
3-

> tmp <- fread(dataFile, sep=';', header=TRUE, na.strings=c("NA","'?'", ""),
+ stringsAsFactors=FALSE,
+ colClasses=c(rep("character",2), rep("numeric",7)), verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.121897 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep ';' on line 30 (the last non blank line in
the first 'autostart') ... found ok
Found 9 columns
First row with 9 fields occurs on line 1 (either column names or first
row of data)
'header' changed by user from 'auto' to TRUE
Count of eol after first data row: 2075260
Subtracted 1 for last eol and any trailing empty lines, leaving
2075259 data rows
Type codes ( first 5 rows): 443333333
Type codes (+ middle 5 rows): 443333333
Type codes (+ last 5 rows): 443333333
Type codes: 443333333 (after applying colClasses and integer64)
Type codes: 443333333 (after applying drop or select (if supplied)
Allocating 9 column slots (9 - 0 dropped)
Bumping column 3 from REAL to STR on data row 6840, field contains '?'
Bumping column 4 from REAL to STR on data row 6840, field contains '?'
Bumping column 5 from REAL to STR on data row 6840, field contains '?'
Bumping column 6 from REAL to STR on data row 6840, field contains '?'
Bumping column 7 from REAL to STR on data row 6840, field contains '?'
Bumping column 8 from REAL to STR on data row 6840, field contains '?'
Read 2075259 rows and 9 (of 9) columns from 0.122 GB file in 00:00:04
0.000s ( 0%) Memory map (rerun may be quicker)
0.001s ( 0%) sep and header detection
0.282s ( 7%) Count rows (wc -l)
0.002s ( 0%) Column type detection (first, middle and last 5 rows)
0.627s ( 16%) Allocation of 2075259x9 result (xMB) in RAM
2.525s ( 64%) Reading data
0.298s ( 8%) Allocation for type bumps (if any), including gc time
if triggered
0.123s ( 3%) Coercing data already read in type bumps (if any)
0.059s ( 2%) Changing na.strings to NA
3.917s Total

Any hint??
Kely
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20150316/8419569c/attachment.html>


More information about the datatable-help mailing list