[datatable-help] NAs introduced by coercion in rbindlist()
Matthew Dowle
mdowle at mdowle.plus.com
Sat Jan 5 00:54:47 CET 2013
Excellent, thanks for confirming. Thinking about it now, with fresh
eyes, new feature request raised :
FR#2456 rbindlist should choose
the highest type per column, not the first
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2456&group_id=240&atid=978
where 'highest' means in this hierarchy: LGLSXP < INTSXP < REALSXP <
CPLXSXP < STRSXP
That would be easy and wouldn't hurt performance at
all.
On 04.01.2013 23:18, patricknic wrote:
> Some output:
>
> ##
NAs in bound data
>> dt
> Warning messages:
> 1: In rbindlist(dtlist)
: NAs introduced by coercion
> 2: In rbindlist(dtlist) : NAs introduced
by coercion
> 3: In rbindlist(dtlist) : NAs introduced by coercion
>
4: In rbindlist(dtlist) : NAs introduced by coercion
> 5: In
rbindlist(dtlist) : NAs introduced by coercion
> 6: In
rbindlist(dtlist) : NAs introduced by coercion
> ## No NAs in list of
data.tables
>> sapply(dtlist, function(x) sum(is.na [9](x)))
> [1] 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> [32] 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> ## Summary of bound data.table
>>
summary(dt)
> blockfips land_area water_area
> Length:11083767 Min.
:0.000e+00 Min. :0.000e+00
> Class :character 1st Qu.:8.098e+03 1st
Qu.:0.000e+00
> Mode :character Median :2.478e+04 Median :0.000e+00
>
Mean :7.470e+05 Mean :5.782e+04
> 3rd Qu.:1.788e+05 3rd Qu.:0.000e+00
> Max. :2.133e+09 Max. :2.112e+09
> NA's :183 NA's :14
> long lat
>
Min. :-179.13 Min. :18.91
> 1st Qu.: -99.74 1st Qu.:34.18
> Median :
-90.09 Median :38.64
> Mean : -93.01 Mean :38.11
> 3rd Qu.: -82.07 3rd
Qu.:41.73
> Max. : 179.75 Max. :71.40
>
>> Many thanks. I'll take a
look. If you can find a way to narrow
>> down the problem then it might
be quicker to resolve. Does it
>> happen with the first 2 items passed
to rblindlist, the first
>> 10, which one causes the NA? If each item
is chopped to the
>> first 2 rows, does it still happen?
>
>>
lapply(seq_along(dtlist), function(x) dtlist[[x]][, tab := x])
>> dt2
> Warning messages:
> 1: In rbindlist(dtlist) : NAs introduced by
coercion
> 2: In rbindlist(dtlist) : NAs introduced by coercion
> 3:
In rbindlist(dtlist) : NAs introduced by coercion
> 4: In
rbindlist(dtlist) : NAs introduced by coercion
> 5: In
rbindlist(dtlist) : NAs introduced by coercion
> 6: In
rbindlist(dtlist) : NAs introduced by coercion
>> dt2[which(apply(is.na
[10](dt2), 1, any)), table(tab)]
> tab
> 2 13 23 45 50
> 183 1 10 1 2
> So, for the most part it's coming from the second list data.table.
>> dtlist.first2
>
>> dtlist.first10
>> dtlist.first100
>>
dtlist.first1000
>> dt.first2
>> dt.first10
>> dt.first100
> Warning
message:
> In rbindlist(dtlist.first100) : NAs introduced by coercion
>> dt.first1000
> Warning messages:
> 1: In
rbindlist(dtlist.first1000) : NAs introduced by coercion
> 2: In
rbindlist(dtlist.first1000) : NAs introduced by coercion
> And NAs
start getting introduced somewhere between 10 and 100 row data.tables,
which seems really low.
>
>> Also if the list of data.table/data.frame
passed to rbindlist
>> is called L, and rbindlist(L) returns an NA
column, does
>> lapply(L, sapply, class) reveal any type differences?
>
>> do.call("rbind", lapply(dtlist, sapply, class))
> blockfips
land_area water_area long lat
> [1,] "character" "integer" "integer"
"numeric" "numeric"
> [2,] "character" "numeric" "numeric" "numeric"
"numeric"
> [3,] "character" "integer" "integer" "numeric" "numeric"
>
[4,] "character" "integer" "integer" "numeric" "numeric"
> [5,]
"character" "integer" "integer" "numeric" "numeric"
> [6,] "character"
"integer" "integer" "numeric" "numeric"
> [7,] "character" "integer"
"integer" "numeric" "numeric"
> [8,] "character" "integer" "integer"
"numeric" "numeric"
> [9,] "character" "integer" "integer" "numeric"
"numeric"
> [10,] "character" "integer" "integer" "numeric" "numeric"
> [11,] "character" "integer" "integer" "numeric" "numeric"
> [12,]
"character" "integer" "integer" "numeric" "numeric"
> [13,] "character"
"numeric" "integer" "numeric" "numeric"
> [14,] "character" "integer"
"integer" "numeric" "numeric"
> [15,] "character" "integer" "integer"
"numeric" "numeric"
> [16,] "character" "integer" "integer" "numeric"
"numeric"
> [17,] "character" "integer" "integer" "numeric" "numeric"
> [18,] "character" "integer" "integer" "numeric" "numeric"
> [19,]
"character" "integer" "integer" "numeric" "numeric"
> [20,] "character"
"integer" "integer" "numeric" "numeric"
> [21,] "character" "integer"
"integer" "numeric" "numeric"
> [22,] "character" "integer" "integer"
"numeric" "numeric"
> [23,] "character" "integer" "numeric" "numeric"
"numeric"
> [24,] "character" "integer" "integer" "numeric" "numeric"
> [25,] "character" "integer" "integer" "numeric" "numeric"
> [26,]
"character" "integer" "integer" "numeric" "numeric"
> [27,] "character"
"integer" "integer" "numeric" "numeric"
> [28,] "character" "integer"
"integer" "numeric" "numeric"
> [29,] "character" "integer" "integer"
"numeric" "numeric"
> [30,] "character" "integer" "integer" "numeric"
"numeric"
> [31,] "character" "integer" "integer" "numeric" "numeric"
> [32,] "character" "integer" "integer" "numeric" "numeric"
> [33,]
"character" "integer" "integer" "numeric" "numeric"
> [34,] "character"
"integer" "integer" "numeric" "numeric"
> [35,] "character" "integer"
"integer" "numeric" "numeric"
> [36,] "character" "integer" "integer"
"numeric" "numeric"
> [37,] "character" "integer" "integer" "numeric"
"numeric"
> [38,] "character" "integer" "integer" "numeric" "numeric"
> [39,] "character" "integer" "integer" "numeric" "numeric"
> [40,]
"character" "integer" "integer" "numeric" "numeric"
> [41,] "character"
"integer" "integer" "numeric" "numeric"
> [42,] "character" "integer"
"integer" "numeric" "numeric"
> [43,] "character" "integer" "integer"
"numeric" "numeric"
> [44,] "character" "integer" "integer" "numeric"
"numeric"
> [45,] "character" "numeric" "integer" "numeric" "numeric"
> [46,] "character" "integer" "integer" "numeric" "numeric"
> [47,]
"character" "integer" "integer" "numeric" "numeric"
> [48,] "character"
"integer" "integer" "numeric" "numeric"
> [49,] "character" "integer"
"integer" "numeric" "numeric"
> [50,] "character" "integer" "numeric"
"numeric" "numeric"
> [51,] "character" "integer" "integer" "numeric"
"numeric"
> And there's the problem: in the problem list data.tables
column 2 or 3 is numeric instead of integer.
> It does sound like
rbli
>
>> ;
>> Hm. It seems I put it in but commented it out :
>> if
(TYPEOF(thiscol) != TYPEOF(target)) {
>> thiscol =
PROTECT(coerceVector(thiscol, TYPEOF(target)));
>> coerced = TRUE;
>>
// TO DO: options(datatable.pedantic=TRUE) to issue this warning :
>>
// warning("Column %d of item %d is type '%s', inconsistent with
>>
column %d of item %d's type
>>
('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));
>> }
>> Likely that coerce is creating the NA. Types are taken from
the first
>> item of L. If a column there is 'numeric' then in a later
item L it's
>> character, that'll give rise to an NA.
>> Thinking
about it, it can probably coerce the target to cope with the
>> later
item ...
>>> dtlist
>>> dt
>>
>>> dt[, lapply(.SD, function(x)
sum(
> a>(x))), .SDcols=c("land_area", "water_area")]
> land_area
water_area
> 1: 0 0
> And it's fixed.
>
> Thanks,
> Patrick
>
>
On Fri, Jan 4, 2013 at 4:52 AM, Matthew Dowle [via R] <[hidden email]>
wrote:
>
> Many thanks. I'll take a look. If you can find a way to
narrow
> down the problem then it
>
>> ist, the first
>> 10, which
one causes the NA? If each item is chopped to the
>> first 2 rows, does
it still happen?
>>
>> Also if the list of data.table/data.frame
passed to rbindlist
>> is called L, and rbindlist(L) returns an NA
column, does
>> lapply(L, sapply, class) reveal any type differences?
>>
>> It does sound like rblindlist should be issuing a warning or
>>
being more helpful at least, anyway.
>>
>> Hm. It seems I put it in
but commented it out :
>>
>> if (TYPEOF(thiscol) != TYPEOF(target)) {
>> thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target)));
>>
coerced = TRUE;
>> // TO DO: options(datatable.pedantic=TRUE) to issue
this warning :
>> // warning("Column %d of item %d is type '%s',
inconsistent with
>> column %d of item %d's type
>>
('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));
>> }
>>
>> Likely that coerce is creating the NA. Types are taken
from the first
>> item of L. If a column there is 'numeric' then in a
later item L it's
>> character, that'll give rise to an NA.
>>
>>
Thinking about it, it can probably coerce the target to cope with the
>> later item ...
>>
>> On 03.01.2013 20:30, patricknic wrote:
>>
>>> Apologies, I forgot to switch the directories in the code.
Corrected
>>> on
>>> nabble and below.
>>>
>>>
>>>
>>>
>>> #
Directories
>>> tempwd > setwd(tempwd)
>>>
>>> # Packages
>>>
library(dataframe)
>>> library(data.table)
>>> library(foreign)
>>>
>>> # Get blocks and coordinates
>>> state.fips > 15:42,
>>> 44:51,
53:56))
>>> tmpf > dtlist > cat("State", fips, ":t")
>>> nm > dbfname
> if (!file.exists(file.path(tempwd, dbfname))) {
>>>
cat("Downloading...t")
>>> url >
paste0("http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/ [1]",
>>>
nm, ".zip")
>>> download.file(url, destfile=tmp, quiet=FALSE)
>>>
unzip(tmp, exdir=tempwd)
>>> }
>>> del >
invisible(lapply(del[grep("dbf", del, invert=TRUE)], file.remove))
>>>
cat("Reading...t")
>>> df as.is=TRUE)
>>> dt > cat("Donen")
>>> dt[,
list(blockfips = GEOID, land_area = ALAND, water_area =
>>> AWATER,
long
>>> = as.numeric(INTPTLON),
>>> lat = as.numeric(INTPTLAT))]
>>>
})
>>> b >
>>> ### No NA problem:
>>> dtlist2 > b2 >
>>>
>>>
>>>
--
>>> View this message in context:
>>>
>>>
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html
[2] > Sent from the datatable-help mailing list archive at Nabble.com.
>>> _______________________________________________
>>> datatable-help
mailing list > [hidden email] [3]
>>>
>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[4] _______________________________________________
>> datatable-help
mailing list
>> [hidden email] [5]
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
>>
>> -------------------------
>>
>> If you reply to this email,
your message will be added to the discussion below:
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html
[7]
>> To unsubscribe from NAs introduced by coercion in rbindlist(),
click here.
>> NAML [8]
>>
>> -------------------------
>> View this
message in context: Re: NAs introduced by coercion
> )
> Sent from the
datatable-help mailing list archive [11] at Nabble.com.
Links:
------
[1]
http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/
[2]
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html
[3]
http://user/SendEmail.jtp?type=node&node=4654623&i=0
[4]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
http://user/SendEmail.jtp?type=node&node=4654623&i=1
[6]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html
[8]
http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
[9]
http://is.na
[10] http://is.na
[11]
http://r.789695.n4.nabble.com/datatable-help-f2315188.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130104/a0e0560e/attachment-0001.html>
More information about the datatable-help
mailing list