[datatable-help] NAs introduced by coercion in rbindlist()
patricknic
patricknic at gmail.com
Sat Jan 5 00:18:54 CET 2013
Some output:
## NAs in bound data
> dt <- rbindlist(dtlist)
Warning messages:
1: In rbindlist(dtlist) : NAs introduced by coercion
2: In rbindlist(dtlist) : NAs introduced by coercion
3: In rbindlist(dtlist) : NAs introduced by coercion
4: In rbindlist(dtlist) : NAs introduced by coercion
5: In rbindlist(dtlist) : NAs introduced by coercion
6: In rbindlist(dtlist) : NAs introduced by coercion
## No NAs in list of data.tables
> sapply(dtlist, function(x) sum(is.na(x)))
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[32] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Summary of bound data.table
> summary(dt)
blockfips land_area water_area
Length:11083767 Min. :0.000e+00 Min. :0.000e+00
Class :character 1st Qu.:8.098e+03 1st Qu.:0.000e+00
Mode :character Median :2.478e+04 Median :0.000e+00
Mean :7.470e+05 Mean :5.782e+04
3rd Qu.:1.788e+05 3rd Qu.:0.000e+00
Max. :2.133e+09 Max. :2.112e+09
NA's :183 NA's :14
long lat
Min. :-179.13 Min. :18.91
1st Qu.: -99.74 1st Qu.:34.18
Median : -90.09 Median :38.64
Mean : -93.01 Mean :38.11
3rd Qu.: -82.07 3rd Qu.:41.73
Max. : 179.75 Max. :71.40
> Many thanks. I'll take a look. If you can find a way to narrow
> down the problem then it might be quicker to resolve. Does it
> happen with the first 2 items passed to rblindlist, the first
> 10, which one causes the NA? If each item is chopped to the
> first 2 rows, does it still happen?
>
> lapply(seq_along(dtlist), function(x) dtlist[[x]][, tab := x])
> dt2 <- rbindlist(dtlist)
Warning messages:
1: In rbindlist(dtlist) : NAs introduced by coercion
2: In rbindlist(dtlist) : NAs introduced by coercion
3: In rbindlist(dtlist) : NAs introduced by coercion
4: In rbindlist(dtlist) : NAs introduced by coercion
5: In rbindlist(dtlist) : NAs introduced by coercion
6: In rbindlist(dtlist) : NAs introduced by coercion
> dt2[which(apply(is.na(dt2), 1, any)), table(tab)]
tab
2 13 23 45 50
183 1 10 1 2
So, for the most part it's coming from the second list data.table.
> dtlist.first2 <- lapply(dtlist, function(x) x[1:2])
> dtlist.first10 <- lapply(dtlist, function(x) x[1:10])
> dtlist.first100 <- lapply(dtlist, function(x) x[1:100])
> dtlist.first1000 <- lapply(dtlist, function(x) x[1:1000])
> dt.first2 <- rbindlist(dtlist.first2)
> dt.first10 <- rbindlist(dtlist.first10)
> dt.first100 <- rbindlist(dtlist.first100)
Warning message:
In rbindlist(dtlist.first100) : NAs introduced by coercion
> dt.first1000 <- rbindlist(dtlist.first1000)
Warning messages:
1: In rbindlist(dtlist.first1000) : NAs introduced by coercion
2: In rbindlist(dtlist.first1000) : NAs introduced by coercion
And NAs start getting introduced somewhere between 10 and 100 row
data.tables, which seems really low.
> Also if the list of data.table/data.frame passed to rbindlist
> is called L, and rbindlist(L) returns an NA column, does
> lapply(L, sapply, class) reveal any type differences?
>
> do.call("rbind", lapply(dtlist, sapply, class))
blockfips land_area water_area long lat
[1,] "character" "integer" "integer" "numeric" "numeric"
[2,] "character" "numeric" "numeric" "numeric" "numeric"
[3,] "character" "integer" "integer" "numeric" "numeric"
[4,] "character" "integer" "integer" "numeric" "numeric"
[5,] "character" "integer" "integer" "numeric" "numeric"
[6,] "character" "integer" "integer" "numeric" "numeric"
[7,] "character" "integer" "integer" "numeric" "numeric"
[8,] "character" "integer" "integer" "numeric" "numeric"
[9,] "character" "integer" "integer" "numeric" "numeric"
[10,] "character" "integer" "integer" "numeric" "numeric"
[11,] "character" "integer" "integer" "numeric" "numeric"
[12,] "character" "integer" "integer" "numeric" "numeric"
[13,] "character" "numeric" "integer" "numeric" "numeric"
[14,] "character" "integer" "integer" "numeric" "numeric"
[15,] "character" "integer" "integer" "numeric" "numeric"
[16,] "character" "integer" "integer" "numeric" "numeric"
[17,] "character" "integer" "integer" "numeric" "numeric"
[18,] "character" "integer" "integer" "numeric" "numeric"
[19,] "character" "integer" "integer" "numeric" "numeric"
[20,] "character" "integer" "integer" "numeric" "numeric"
[21,] "character" "integer" "integer" "numeric" "numeric"
[22,] "character" "integer" "integer" "numeric" "numeric"
[23,] "character" "integer" "numeric" "numeric" "numeric"
[24,] "character" "integer" "integer" "numeric" "numeric"
[25,] "character" "integer" "integer" "numeric" "numeric"
[26,] "character" "integer" "integer" "numeric" "numeric"
[27,] "character" "integer" "integer" "numeric" "numeric"
[28,] "character" "integer" "integer" "numeric" "numeric"
[29,] "character" "integer" "integer" "numeric" "numeric"
[30,] "character" "integer" "integer" "numeric" "numeric"
[31,] "character" "integer" "integer" "numeric" "numeric"
[32,] "character" "integer" "integer" "numeric" "numeric"
[33,] "character" "integer" "integer" "numeric" "numeric"
[34,] "character" "integer" "integer" "numeric" "numeric"
[35,] "character" "integer" "integer" "numeric" "numeric"
[36,] "character" "integer" "integer" "numeric" "numeric"
[37,] "character" "integer" "integer" "numeric" "numeric"
[38,] "character" "integer" "integer" "numeric" "numeric"
[39,] "character" "integer" "integer" "numeric" "numeric"
[40,] "character" "integer" "integer" "numeric" "numeric"
[41,] "character" "integer" "integer" "numeric" "numeric"
[42,] "character" "integer" "integer" "numeric" "numeric"
[43,] "character" "integer" "integer" "numeric" "numeric"
[44,] "character" "integer" "integer" "numeric" "numeric"
[45,] "character" "numeric" "integer" "numeric" "numeric"
[46,] "character" "integer" "integer" "numeric" "numeric"
[47,] "character" "integer" "integer" "numeric" "numeric"
[48,] "character" "integer" "integer" "numeric" "numeric"
[49,] "character" "integer" "integer" "numeric" "numeric"
[50,] "character" "integer" "numeric" "numeric" "numeric"
[51,] "character" "integer" "integer" "numeric" "numeric"
And there's the problem: in the problem list data.tables column 2 or 3 is
numeric instead of integer.
> It does sound like rblindlist should be issuing a warning or
> being more helpful at least, anyway.
> Hm. It seems I put it in but commented it out :
> if (TYPEOF(thiscol) != TYPEOF(target)) {
> thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target)));
> coerced = TRUE;
> // TO DO: options(datatable.pedantic=TRUE) to issue this warning :
> // warning("Column %d of item %d is type '%s', inconsistent with
> column %d of item %d's type
>
> ('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));
> }
> Likely that coerce is creating the NA. Types are taken from the first
> item of L. If a column there is 'numeric' then in a later item L it's
> character, that'll give rise to an NA.
> Thinking about it, it can probably coerce the target to cope with the
> later item ...
> dtlist <- lapply(dtlist, function(x) x[, land_area :=
as.numeric(land_area)][, water_area := as.numeric(water_area)])
> dt <- rbindlist(dtlist)
> dt[, lapply(.SD, function(x) sum(is.na(x))), .SDcols=c("land_area",
"water_area")]
land_area water_area
1: 0 0
And it's fixed.
Thanks,
Patrick
On Fri, Jan 4, 2013 at 4:52 AM, Matthew Dowle [via R] <
ml-node+s789695n4654623h37 at n4.nabble.com> wrote:
>
> Many thanks. I'll take a look. If you can find a way to narrow
> down the problem then it might be quicker to resolve. Does it
> happen with the first 2 items passed to rblindlist, the first
> 10, which one causes the NA? If each item is chopped to the
> first 2 rows, does it still happen?
>
> Also if the list of data.table/data.frame passed to rbindlist
> is called L, and rbindlist(L) returns an NA column, does
> lapply(L, sapply, class) reveal any type differences?
>
> It does sound like rblindlist should be issuing a warning or
> being more helpful at least, anyway.
>
> Hm. It seems I put it in but commented it out :
>
> if (TYPEOF(thiscol) != TYPEOF(target)) {
> thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target)));
> coerced = TRUE;
> // TO DO: options(datatable.pedantic=TRUE) to issue this warning :
> // warning("Column %d of item %d is type '%s', inconsistent with
> column %d of item %d's type
> ('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));
>
> }
>
> Likely that coerce is creating the NA. Types are taken from the first
> item of L. If a column there is 'numeric' then in a later item L it's
> character, that'll give rise to an NA.
>
> Thinking about it, it can probably coerce the target to cope with the
> later item ...
>
>
> On 03.01.2013 20:30, patricknic wrote:
>
> > Apologies, I forgot to switch the directories in the code. Corrected
> > on
> > nabble and below.
> >
> >
> >
> >
> > # Directories
> > tempwd <- tempdir()
> > setwd(tempwd)
> >
> > # Packages
> > library(dataframe)
> > library(data.table)
> > library(foreign)
> >
> > # Get blocks and coordinates
> > state.fips <- as.character(c(paste0(0, c(1:2, 4:6, 8:9)), 10:13,
> > 15:42,
> > 44:51, 53:56))
> > tmpf <- tempfile(fileext=".zip")
> > dtlist <- lapply(state.fips, function(fips) {
> > cat("State", fips, ":\t")
> > nm <- paste0("tl_2011_", fips, "_tabblock")
> > dbfname <- paste0(nm, ".dbf")
> > if (!file.exists(file.path(tempwd, dbfname))) {
> > cat("Downloading...\t")
> > url <-
> > paste0("http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/",
> > nm, ".zip")
> > download.file(url, destfile=tmp, quiet=FALSE)
> > unzip(tmp, exdir=tempwd)
> > }
> > del <- dir(tempwd, pattern=nm)
> > invisible(lapply(del[grep("dbf", del, invert=TRUE)], file.remove))
> > cat("Reading...\t")
> > df <- read.dbf(dbfname, as.is=TRUE)
> > dt <- as.data.table(df)
> > cat("Done\n")
> > dt[, list(blockfips = GEOID, land_area = ALAND, water_area =
> > AWATER, long
> > = as.numeric(INTPTLON),
> > lat = as.numeric(INTPTLAT))]
> > })
> > b <- rbindlist(dtlist)
> >
> > ### No NA problem:
> > dtlist2 <- lapply(dtlist, as.data.frame)
> > b2 <- do.call("rbind", dtlist2)
> >
> >
> >
> > --
> > View this message in context:
> >
> >
> http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html
> > Sent from the datatable-help mailing list archive at Nabble.com.
> > _______________________________________________
> > datatable-help mailing list
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=4654623&i=0>
> >
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4654623&i=1>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html
> To unsubscribe from NAs introduced by coercion in rbindlist(), click here<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4654576&code=cGF0cmlja25pY0BnbWFpbC5jb218NDY1NDU3NnwtOTg4Njg1NDY3>
> .
> NAML<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>
--
View this message in context: http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654696.html
Sent from the datatable-help mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130104/92a52893/attachment-0001.html>
More information about the datatable-help
mailing list