[datatable-help] fread coercion of very small number to character

Mon Sep 2 16:51:41 CEST 2013

Hello,

When reading a file with very small numbers in scientific notation, fread bumps the column type to "character":

> tmp <- fread(files[1], verbose = TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t'
Found 5 columns
First row with 5 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 188308
Subtracted 1 for last eol and any trailing empty lines, leaving 188307 data rows
Type codes: 33302 (first 5 rows)
Type codes: 33302 (+middle 5 rows)
Type codes: 33302 (+last 5 rows)
Bumping column 5 from REAL to STR on data row 361, field contains '1.46761e-313'
   0.000s (  0%) Memory map (rerun may be quicker)
   0.000s (  0%) sep and header detection
   0.020s ( 13%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   0.020s ( 13%) Allocation of 188307x5 result (xMB) in RAM
   0.110s ( 73%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.000s (  0%) Changing na.strings to NA
   0.150s        Total
Warning message:
In fread(files[1], verbose = TRUE) :
  Bumped column 5 to type character on data row 361, field contains '1.46761e-313'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

Perhaps there is some cutoff at e-300, since the preceding number '3.34402e-299' is read in okay.

I can get round this by specifying the column as character using the colClasses argument, then coercing to numeric after the data has been read in. However it would be better if fread could read the data in as numeric in the first place, as read.table does (though much more slowly in my example).

A simple example where type is detected as numeric then bumped to character (Which rows are used as the middle 5? Does not seem  to be rows 7-11 as I would expect...)

> dat <- data.frame(one = LETTERS[1:17], two = 1:17)
> ## use strings here to replicate what I have in my data file
> dat$two[c(1, 9)] <- c("3.34402e-299", "1.46761e-313") 
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)

...
Type codes: 32 (first 5 rows)
Type codes: 32 (+middle 5 rows)
Type codes: 32 (+last 5 rows)
Bumping column 2 from REAL to STR on data row 9, field contains '1.46761e-313'
...

Another example where type is detected as character from the first 5 rows

> dat$two[1:2] <- c("3.34402e-299", "1.46761e-313") 
> write.table(dat, file = "test.txt", quote = FALSE, row.names = FALSE)
> fread("test.txt", verbose = TRUE)

...
Type codes: 33 (first 5 rows)
Type codes: 33 (+middle 5 rows)
Type codes: 33 (+last 5 rows)
...

So aside from the issue of which rows are used for type detection, it does seem that 3.34402e-299 is detected as numeric whilst 1.46761e-313 is detected as character. Compare vs. read.table:

> tmp <- read.table("test.txt", header = TRUE)
> lapply(tmp, class)
$one
[1] "factor"

$two
[1] "numeric"

Best wishes,

Heather

---
Package: data.table
 Version: 1.8.9
 Maintainer: Matthew Dowle <mdowle at mdowle.plus.com>
 Built: R 3.0.1; x86_64-pc-linux-gnu; 2013-06-26 21:24:22 UTC; unix

R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] data.table_1.8.9

loaded via a namespace (and not attached):
[1] compiler_3.0.1 tools_3.0.1