[datatable-help] NA handling not working?
Rivo R
rivokl at gmail.com
Tue Mar 10 15:12:48 CET 2015
Hi all,
I tried to load the following (huge) dataset as data.table but fread
seems to choke on NA's.
https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
Steps:
1- Dowload and unzip
2-
> packageVersion("data.table")
[1] ‘1.9.4’
3-
> tmp <- fread(dataFile, sep=';', header=TRUE, na.strings=c("NA","'?'", ""),
+ stringsAsFactors=FALSE,
+ colClasses=c(rep("character",2), rep("numeric",7)), verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.121897 GB.
Memory mapping ... ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep ';' on line 30 (the last non blank line in
the first 'autostart') ... found ok
Found 9 columns
First row with 9 fields occurs on line 1 (either column names or first
row of data)
'header' changed by user from 'auto' to TRUE
Count of eol after first data row: 2075260
Subtracted 1 for last eol and any trailing empty lines, leaving
2075259 data rows
Type codes ( first 5 rows): 443333333
Type codes (+ middle 5 rows): 443333333
Type codes (+ last 5 rows): 443333333
Type codes: 443333333 (after applying colClasses and integer64)
Type codes: 443333333 (after applying drop or select (if supplied)
Allocating 9 column slots (9 - 0 dropped)
Bumping column 3 from REAL to STR on data row 6840, field contains '?'
Bumping column 4 from REAL to STR on data row 6840, field contains '?'
Bumping column 5 from REAL to STR on data row 6840, field contains '?'
Bumping column 6 from REAL to STR on data row 6840, field contains '?'
Bumping column 7 from REAL to STR on data row 6840, field contains '?'
Bumping column 8 from REAL to STR on data row 6840, field contains '?'
Read 2075259 rows and 9 (of 9) columns from 0.122 GB file in 00:00:04
0.000s ( 0%) Memory map (rerun may be quicker)
0.001s ( 0%) sep and header detection
0.282s ( 7%) Count rows (wc -l)
0.002s ( 0%) Column type detection (first, middle and last 5 rows)
0.627s ( 16%) Allocation of 2075259x9 result (xMB) in RAM
2.525s ( 64%) Reading data
0.298s ( 8%) Allocation for type bumps (if any), including gc time
if triggered
0.123s ( 3%) Coercing data already read in type bumps (if any)
0.059s ( 2%) Changing na.strings to NA
3.917s Total
Any hint??
Kely
More information about the datatable-help
mailing list