[datatable-help] NAs introduced by coercion in rbindlist()

patricknic patricknic at gmail.com
Sat Jan 5 00:18:54 CET 2013


Some output:

## NAs in bound data
> dt <- rbindlist(dtlist)
Warning messages:
1: In rbindlist(dtlist) : NAs introduced by coercion
2: In rbindlist(dtlist) : NAs introduced by coercion
3: In rbindlist(dtlist) : NAs introduced by coercion
4: In rbindlist(dtlist) : NAs introduced by coercion
5: In rbindlist(dtlist) : NAs introduced by coercion
6: In rbindlist(dtlist) : NAs introduced by coercion

## No NAs in list of data.tables
> sapply(dtlist, function(x) sum(is.na(x)))
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[32] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## Summary of bound data.table
> summary(dt)
  blockfips           land_area           water_area
 Length:11083767    Min.   :0.000e+00   Min.   :0.000e+00
 Class :character   1st Qu.:8.098e+03   1st Qu.:0.000e+00
 Mode  :character   Median :2.478e+04   Median :0.000e+00
                    Mean   :7.470e+05   Mean   :5.782e+04
                    3rd Qu.:1.788e+05   3rd Qu.:0.000e+00
                    Max.   :2.133e+09   Max.   :2.112e+09
                    NA's   :183         NA's   :14
      long              lat
 Min.   :-179.13   Min.   :18.91
 1st Qu.: -99.74   1st Qu.:34.18
 Median : -90.09   Median :38.64
 Mean   : -93.01   Mean   :38.11
 3rd Qu.: -82.07   3rd Qu.:41.73
 Max.   : 179.75   Max.   :71.40



> Many thanks. I'll take a look. If you can find a way to narrow
> down the problem then it might be quicker to resolve. Does it
> happen with the first 2 items passed to rblindlist, the first
> 10, which one causes the NA? If each item is chopped to the
> first 2 rows, does it still happen?
>

> lapply(seq_along(dtlist), function(x) dtlist[[x]][, tab := x])
> dt2 <- rbindlist(dtlist)
Warning messages:
1: In rbindlist(dtlist) : NAs introduced by coercion
2: In rbindlist(dtlist) : NAs introduced by coercion
3: In rbindlist(dtlist) : NAs introduced by coercion
4: In rbindlist(dtlist) : NAs introduced by coercion
5: In rbindlist(dtlist) : NAs introduced by coercion
6: In rbindlist(dtlist) : NAs introduced by coercion
> dt2[which(apply(is.na(dt2), 1, any)), table(tab)]
tab
  2  13  23  45  50
183   1  10   1   2

So, for the most part it's coming from the second list data.table.

> dtlist.first2 <- lapply(dtlist, function(x) x[1:2])
> dtlist.first10 <- lapply(dtlist, function(x) x[1:10])
> dtlist.first100 <- lapply(dtlist, function(x) x[1:100])
> dtlist.first1000 <- lapply(dtlist, function(x) x[1:1000])
> dt.first2 <- rbindlist(dtlist.first2)
> dt.first10 <- rbindlist(dtlist.first10)
> dt.first100 <- rbindlist(dtlist.first100)
Warning message:
In rbindlist(dtlist.first100) : NAs introduced by coercion
> dt.first1000 <- rbindlist(dtlist.first1000)
Warning messages:
1: In rbindlist(dtlist.first1000) : NAs introduced by coercion
2: In rbindlist(dtlist.first1000) : NAs introduced by coercion

And NAs start getting introduced somewhere between 10 and 100 row
data.tables, which seems really low.



> Also if the list of data.table/data.frame passed to rbindlist
> is called L,  and rbindlist(L) returns an NA column,  does
> lapply(L, sapply, class) reveal any type differences?
>

> do.call("rbind", lapply(dtlist, sapply, class))
      blockfips   land_area water_area long      lat
 [1,] "character" "integer" "integer"  "numeric" "numeric"
 [2,] "character" "numeric" "numeric"  "numeric" "numeric"
 [3,] "character" "integer" "integer"  "numeric" "numeric"
 [4,] "character" "integer" "integer"  "numeric" "numeric"
 [5,] "character" "integer" "integer"  "numeric" "numeric"
 [6,] "character" "integer" "integer"  "numeric" "numeric"
 [7,] "character" "integer" "integer"  "numeric" "numeric"
 [8,] "character" "integer" "integer"  "numeric" "numeric"
 [9,] "character" "integer" "integer"  "numeric" "numeric"
[10,] "character" "integer" "integer"  "numeric" "numeric"
[11,] "character" "integer" "integer"  "numeric" "numeric"
[12,] "character" "integer" "integer"  "numeric" "numeric"
[13,] "character" "numeric" "integer"  "numeric" "numeric"
[14,] "character" "integer" "integer"  "numeric" "numeric"
[15,] "character" "integer" "integer"  "numeric" "numeric"
[16,] "character" "integer" "integer"  "numeric" "numeric"
[17,] "character" "integer" "integer"  "numeric" "numeric"
[18,] "character" "integer" "integer"  "numeric" "numeric"
[19,] "character" "integer" "integer"  "numeric" "numeric"
[20,] "character" "integer" "integer"  "numeric" "numeric"
[21,] "character" "integer" "integer"  "numeric" "numeric"
[22,] "character" "integer" "integer"  "numeric" "numeric"
[23,] "character" "integer" "numeric"  "numeric" "numeric"
[24,] "character" "integer" "integer"  "numeric" "numeric"
[25,] "character" "integer" "integer"  "numeric" "numeric"
[26,] "character" "integer" "integer"  "numeric" "numeric"
[27,] "character" "integer" "integer"  "numeric" "numeric"
[28,] "character" "integer" "integer"  "numeric" "numeric"
[29,] "character" "integer" "integer"  "numeric" "numeric"
[30,] "character" "integer" "integer"  "numeric" "numeric"
[31,] "character" "integer" "integer"  "numeric" "numeric"
[32,] "character" "integer" "integer"  "numeric" "numeric"
[33,] "character" "integer" "integer"  "numeric" "numeric"
[34,] "character" "integer" "integer"  "numeric" "numeric"
[35,] "character" "integer" "integer"  "numeric" "numeric"
[36,] "character" "integer" "integer"  "numeric" "numeric"
[37,] "character" "integer" "integer"  "numeric" "numeric"
[38,] "character" "integer" "integer"  "numeric" "numeric"
[39,] "character" "integer" "integer"  "numeric" "numeric"
[40,] "character" "integer" "integer"  "numeric" "numeric"
[41,] "character" "integer" "integer"  "numeric" "numeric"
[42,] "character" "integer" "integer"  "numeric" "numeric"
[43,] "character" "integer" "integer"  "numeric" "numeric"
[44,] "character" "integer" "integer"  "numeric" "numeric"
[45,] "character" "numeric" "integer"  "numeric" "numeric"
[46,] "character" "integer" "integer"  "numeric" "numeric"
[47,] "character" "integer" "integer"  "numeric" "numeric"
[48,] "character" "integer" "integer"  "numeric" "numeric"
[49,] "character" "integer" "integer"  "numeric" "numeric"
[50,] "character" "integer" "numeric"  "numeric" "numeric"
[51,] "character" "integer" "integer"  "numeric" "numeric"

And there's the problem: in the problem list data.tables column 2 or 3 is
numeric instead of integer.



> It does sound like rblindlist should be issuing a warning or
> being more helpful at least, anyway.
> Hm. It seems I put it in but commented it out :
> if (TYPEOF(thiscol) != TYPEOF(target)) {
>      thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target)));
>      coerced = TRUE;
>      // TO DO: options(datatable.pedantic=TRUE) to issue this warning :
>      // warning("Column %d of item %d is type '%s', inconsistent with
> column %d of item %d's type
>
> ('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));
> }
> Likely that coerce is creating the NA. Types are taken from the first
> item of L.  If a column there is 'numeric' then in a later item L it's
> character, that'll give rise to an NA.
> Thinking about it, it can probably coerce the target to cope with the
> later item ...


> dtlist <- lapply(dtlist, function(x) x[, land_area :=
as.numeric(land_area)][, water_area := as.numeric(water_area)])
> dt <- rbindlist(dtlist)
> dt[, lapply(.SD, function(x) sum(is.na(x))), .SDcols=c("land_area",
"water_area")]
   land_area water_area
1:         0          0

And it's fixed.



Thanks,
Patrick


On Fri, Jan 4, 2013 at 4:52 AM, Matthew Dowle [via R] <
ml-node+s789695n4654623h37 at n4.nabble.com> wrote:

>
> Many thanks. I'll take a look. If you can find a way to narrow
> down the problem then it might be quicker to resolve. Does it
> happen with the first 2 items passed to rblindlist, the first
> 10, which one causes the NA? If each item is chopped to the
> first 2 rows, does it still happen?
>
> Also if the list of data.table/data.frame passed to rbindlist
> is called L,  and rbindlist(L) returns an NA column,  does
> lapply(L, sapply, class) reveal any type differences?
>
> It does sound like rblindlist should be issuing a warning or
> being more helpful at least, anyway.
>
> Hm. It seems I put it in but commented it out :
>
> if (TYPEOF(thiscol) != TYPEOF(target)) {
>      thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target)));
>      coerced = TRUE;
>      // TO DO: options(datatable.pedantic=TRUE) to issue this warning :
>      // warning("Column %d of item %d is type '%s', inconsistent with
> column %d of item %d's type
> ('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));
>
> }
>
> Likely that coerce is creating the NA. Types are taken from the first
> item of L.  If a column there is 'numeric' then in a later item L it's
> character, that'll give rise to an NA.
>
> Thinking about it, it can probably coerce the target to cope with the
> later item ...
>
>
> On 03.01.2013 20:30, patricknic wrote:
>
> > Apologies, I forgot to switch the directories in the code. Corrected
> > on
> > nabble and below.
> >
> >
> >
> >
> > # Directories
> > tempwd <- tempdir()
> > setwd(tempwd)
> >
> > # Packages
> > library(dataframe)
> > library(data.table)
> > library(foreign)
> >
> > # Get blocks and coordinates
> > state.fips <- as.character(c(paste0(0, c(1:2, 4:6, 8:9)), 10:13,
> > 15:42,
> > 44:51, 53:56))
> > tmpf <- tempfile(fileext=".zip")
> > dtlist <- lapply(state.fips, function(fips) {
> >   cat("State", fips, ":\t")
> >   nm <- paste0("tl_2011_", fips, "_tabblock")
> >   dbfname <- paste0(nm, ".dbf")
> >   if (!file.exists(file.path(tempwd, dbfname))) {
> >     cat("Downloading...\t")
> >     url <-
> > paste0("http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/",
> > nm, ".zip")
> >     download.file(url, destfile=tmp, quiet=FALSE)
> >     unzip(tmp, exdir=tempwd)
> >   }
> >   del <- dir(tempwd, pattern=nm)
> >   invisible(lapply(del[grep("dbf", del, invert=TRUE)], file.remove))
> >   cat("Reading...\t")
> >   df <- read.dbf(dbfname, as.is=TRUE)
> >   dt <- as.data.table(df)
> >   cat("Done\n")
> >   dt[, list(blockfips = GEOID, land_area = ALAND, water_area =
> > AWATER, long
> > = as.numeric(INTPTLON),
> >             lat = as.numeric(INTPTLAT))]
> > })
> > b <- rbindlist(dtlist)
> >
> > ### No NA problem:
> > dtlist2 <- lapply(dtlist, as.data.frame)
> > b2 <- do.call("rbind", dtlist2)
> >
> >
> >
> > --
> > View this message in context:
> >
> >
> http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html
> > Sent from the datatable-help mailing list archive at Nabble.com.
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=4654623&i=0>
> >
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4654623&i=1>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html
>  To unsubscribe from NAs introduced by coercion in rbindlist(), click here<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4654576&code=cGF0cmlja25pY0BnbWFpbC5jb218NDY1NDU3NnwtOTg4Njg1NDY3>
> .
> NAML<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654696.html
Sent from the datatable-help mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130104/92a52893/attachment-0001.html>


More information about the datatable-help mailing list