[datatable-help] NA handling not working?

Rivo R rivokl at gmail.com
Tue Mar 10 15:12:48 CET 2015


Hi all,

I tried to load the following (huge) dataset as data.table but fread
seems to choke on NA's.
https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip

Steps:
1- Dowload and unzip
2-
> packageVersion("data.table")
[1] ‘1.9.4’
3-

> tmp <- fread(dataFile, sep=';', header=TRUE, na.strings=c("NA","'?'", ""),
+              stringsAsFactors=FALSE,
+              colClasses=c(rep("character",2), rep("numeric",7)), verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.121897 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep ';' on line 30 (the last non blank line in
the first 'autostart') ... found ok
Found 9 columns
First row with 9 fields occurs on line 1 (either column names or first
row of data)
'header' changed by user from 'auto' to TRUE
Count of eol after first data row: 2075260
Subtracted 1 for last eol and any trailing empty lines, leaving
2075259 data rows
Type codes (   first 5 rows): 443333333
Type codes (+ middle 5 rows): 443333333
Type codes (+   last 5 rows): 443333333
Type codes: 443333333 (after applying colClasses and integer64)
Type codes: 443333333 (after applying drop or select (if supplied)
Allocating 9 column slots (9 - 0 dropped)
Bumping column 3 from REAL to STR on data row 6840, field contains '?'
Bumping column 4 from REAL to STR on data row 6840, field contains '?'
Bumping column 5 from REAL to STR on data row 6840, field contains '?'
Bumping column 6 from REAL to STR on data row 6840, field contains '?'
Bumping column 7 from REAL to STR on data row 6840, field contains '?'
Bumping column 8 from REAL to STR on data row 6840, field contains '?'
Read 2075259 rows and 9 (of 9) columns from 0.122 GB file in 00:00:04
   0.000s (  0%) Memory map (rerun may be quicker)
   0.001s (  0%) sep and header detection
   0.282s (  7%) Count rows (wc -l)
   0.002s (  0%) Column type detection (first, middle and last 5 rows)
   0.627s ( 16%) Allocation of 2075259x9 result (xMB) in RAM
   2.525s ( 64%) Reading data
   0.298s (  8%) Allocation for type bumps (if any), including gc time
if triggered
   0.123s (  3%) Coercing data already read in type bumps (if any)
   0.059s (  2%) Changing na.strings to NA
   3.917s        Total

Any hint??
Kely


More information about the datatable-help mailing list