[datatable-help] NAs introduced by coercion in rbindlist()

Matthew Dowle mdowle at mdowle.plus.com
Sat Jan 5 00:54:47 CET 2013


 

Excellent, thanks for confirming. Thinking about it now, with fresh
eyes, new feature request raised : 

 FR#2456 rbindlist should choose
the highest type per column, not the first 


https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2456&group_id=240&atid=978


where 'highest' means in this hierarchy: LGLSXP < INTSXP < REALSXP <
CPLXSXP < STRSXP 

That would be easy and wouldn't hurt performance at
all. 

On 04.01.2013 23:18, patricknic wrote: 

> Some output: 
> 
> ##
NAs in bound data 
>> dt 
> Warning messages: 
> 1: In rbindlist(dtlist)
: NAs introduced by coercion 
> 2: In rbindlist(dtlist) : NAs introduced
by coercion 
> 3: In rbindlist(dtlist) : NAs introduced by coercion 
>
4: In rbindlist(dtlist) : NAs introduced by coercion 
> 5: In
rbindlist(dtlist) : NAs introduced by coercion 
> 6: In
rbindlist(dtlist) : NAs introduced by coercion 
> ## No NAs in list of
data.tables 
>> sapply(dtlist, function(x) sum(is.na [9](x))) 
> [1] 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> [32] 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
> ## Summary of bound data.table 
>>
summary(dt) 
> blockfips land_area water_area 
> Length:11083767 Min.
:0.000e+00 Min. :0.000e+00 
> Class :character 1st Qu.:8.098e+03 1st
Qu.:0.000e+00 
> Mode :character Median :2.478e+04 Median :0.000e+00 
>
Mean :7.470e+05 Mean :5.782e+04 
> 3rd Qu.:1.788e+05 3rd Qu.:0.000e+00

> Max. :2.133e+09 Max. :2.112e+09 
> NA's :183 NA's :14 
> long lat 
>
Min. :-179.13 Min. :18.91 
> 1st Qu.: -99.74 1st Qu.:34.18 
> Median :
-90.09 Median :38.64 
> Mean : -93.01 Mean :38.11 
> 3rd Qu.: -82.07 3rd
Qu.:41.73 
> Max. : 179.75 Max. :71.40 
> 
>> Many thanks. I'll take a
look. If you can find a way to narrow 
>> down the problem then it might
be quicker to resolve. Does it 
>> happen with the first 2 items passed
to rblindlist, the first 
>> 10, which one causes the NA? If each item
is chopped to the 
>> first 2 rows, does it still happen?
> 
>>
lapply(seq_along(dtlist), function(x) dtlist[[x]][, tab := x]) 
>> dt2

> Warning messages: 
> 1: In rbindlist(dtlist) : NAs introduced by
coercion 
> 2: In rbindlist(dtlist) : NAs introduced by coercion 
> 3:
In rbindlist(dtlist) : NAs introduced by coercion 
> 4: In
rbindlist(dtlist) : NAs introduced by coercion 
> 5: In
rbindlist(dtlist) : NAs introduced by coercion 
> 6: In
rbindlist(dtlist) : NAs introduced by coercion 
>> dt2[which(apply(is.na
[10](dt2), 1, any)), table(tab)] 
> tab 
> 2 13 23 45 50 
> 183 1 10 1 2

> So, for the most part it's coming from the second list data.table.

>> dtlist.first2 
> 
>> dtlist.first10 
>> dtlist.first100 
>>
dtlist.first1000 
>> dt.first2 
>> dt.first10 
>> dt.first100 
> Warning
message: 
> In rbindlist(dtlist.first100) : NAs introduced by coercion

>> dt.first1000 
> Warning messages: 
> 1: In
rbindlist(dtlist.first1000) : NAs introduced by coercion 
> 2: In
rbindlist(dtlist.first1000) : NAs introduced by coercion 
> And NAs
start getting introduced somewhere between 10 and 100 row data.tables,
which seems really low. 
> 
>> Also if the list of data.table/data.frame
passed to rbindlist 
>> is called L, and rbindlist(L) returns an NA
column, does 
>> lapply(L, sapply, class) reveal any type differences?
>

>> do.call("rbind", lapply(dtlist, sapply, class)) 
> blockfips
land_area water_area long lat 
> [1,] "character" "integer" "integer"
"numeric" "numeric" 
> [2,] "character" "numeric" "numeric" "numeric"
"numeric" 
> [3,] "character" "integer" "integer" "numeric" "numeric" 
>
[4,] "character" "integer" "integer" "numeric" "numeric" 
> [5,]
"character" "integer" "integer" "numeric" "numeric" 
> [6,] "character"
"integer" "integer" "numeric" "numeric" 
> [7,] "character" "integer"
"integer" "numeric" "numeric" 
> [8,] "character" "integer" "integer"
"numeric" "numeric" 
> [9,] "character" "integer" "integer" "numeric"
"numeric" 
> [10,] "character" "integer" "integer" "numeric" "numeric"

> [11,] "character" "integer" "integer" "numeric" "numeric" 
> [12,]
"character" "integer" "integer" "numeric" "numeric" 
> [13,] "character"
"numeric" "integer" "numeric" "numeric" 
> [14,] "character" "integer"
"integer" "numeric" "numeric" 
> [15,] "character" "integer" "integer"
"numeric" "numeric" 
> [16,] "character" "integer" "integer" "numeric"
"numeric" 
> [17,] "character" "integer" "integer" "numeric" "numeric"

> [18,] "character" "integer" "integer" "numeric" "numeric" 
> [19,]
"character" "integer" "integer" "numeric" "numeric" 
> [20,] "character"
"integer" "integer" "numeric" "numeric" 
> [21,] "character" "integer"
"integer" "numeric" "numeric" 
> [22,] "character" "integer" "integer"
"numeric" "numeric" 
> [23,] "character" "integer" "numeric" "numeric"
"numeric" 
> [24,] "character" "integer" "integer" "numeric" "numeric"

> [25,] "character" "integer" "integer" "numeric" "numeric" 
> [26,]
"character" "integer" "integer" "numeric" "numeric" 
> [27,] "character"
"integer" "integer" "numeric" "numeric" 
> [28,] "character" "integer"
"integer" "numeric" "numeric" 
> [29,] "character" "integer" "integer"
"numeric" "numeric" 
> [30,] "character" "integer" "integer" "numeric"
"numeric" 
> [31,] "character" "integer" "integer" "numeric" "numeric"

> [32,] "character" "integer" "integer" "numeric" "numeric" 
> [33,]
"character" "integer" "integer" "numeric" "numeric" 
> [34,] "character"
"integer" "integer" "numeric" "numeric" 
> [35,] "character" "integer"
"integer" "numeric" "numeric" 
> [36,] "character" "integer" "integer"
"numeric" "numeric" 
> [37,] "character" "integer" "integer" "numeric"
"numeric" 
> [38,] "character" "integer" "integer" "numeric" "numeric"

> [39,] "character" "integer" "integer" "numeric" "numeric" 
> [40,]
"character" "integer" "integer" "numeric" "numeric" 
> [41,] "character"
"integer" "integer" "numeric" "numeric" 
> [42,] "character" "integer"
"integer" "numeric" "numeric" 
> [43,] "character" "integer" "integer"
"numeric" "numeric" 
> [44,] "character" "integer" "integer" "numeric"
"numeric" 
> [45,] "character" "numeric" "integer" "numeric" "numeric"

> [46,] "character" "integer" "integer" "numeric" "numeric" 
> [47,]
"character" "integer" "integer" "numeric" "numeric" 
> [48,] "character"
"integer" "integer" "numeric" "numeric" 
> [49,] "character" "integer"
"integer" "numeric" "numeric" 
> [50,] "character" "integer" "numeric"
"numeric" "numeric" 
> [51,] "character" "integer" "integer" "numeric"
"numeric" 
> And there's the problem: in the problem list data.tables
column 2 or 3 is numeric instead of integer. 
> It does sound like
rbli
> 
>> ;
>> Hm. It seems I put it in but commented it out : 
>> if
(TYPEOF(thiscol) != TYPEOF(target)) { 
>> thiscol =
PROTECT(coerceVector(thiscol, TYPEOF(target))); 
>> coerced = TRUE; 
>>
// TO DO: options(datatable.pedantic=TRUE) to issue this warning : 
>>
// warning("Column %d of item %d is type '%s', inconsistent with 
>>
column %d of item %d's type 
>>
('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));

>> } 
>> Likely that coerce is creating the NA. Types are taken from
the first 
>> item of L. If a column there is 'numeric' then in a later
item L it's 
>> character, that'll give rise to an NA. 
>> Thinking
about it, it can probably coerce the target to cope with the 
>> later
item ... 
>>> dtlist 
>>> dt 
>> 
>>> dt[, lapply(.SD, function(x)
sum(
> a>(x))), .SDcols=c("land_area", "water_area")] 
> land_area
water_area 
> 1: 0 0 
> And it's fixed. 
> 
> Thanks, 
> Patrick 
> 
>
On Fri, Jan 4, 2013 at 4:52 AM, Matthew Dowle [via R] <[hidden email]>
wrote:
> 
> Many thanks. I'll take a look. If you can find a way to
narrow 
> down the problem then it
> 
>> ist, the first 
>> 10, which
one causes the NA? If each item is chopped to the 
>> first 2 rows, does
it still happen? 
>> 
>> Also if the list of data.table/data.frame
passed to rbindlist 
>> is called L, and rbindlist(L) returns an NA
column, does 
>> lapply(L, sapply, class) reveal any type differences?

>> 
>> It does sound like rblindlist should be issuing a warning or 
>>
being more helpful at least, anyway. 
>> 
>> Hm. It seems I put it in
but commented it out : 
>> 
>> if (TYPEOF(thiscol) != TYPEOF(target)) {

>> thiscol = PROTECT(coerceVector(thiscol, TYPEOF(target))); 
>>
coerced = TRUE; 
>> // TO DO: options(datatable.pedantic=TRUE) to issue
this warning : 
>> // warning("Column %d of item %d is type '%s',
inconsistent with 
>> column %d of item %d's type 
>>
('%s')",j+1,i+1,type2char(TYPEOF(thiscol)),j+1,first+1,type2char(TYPEOF(target)));

>> } 
>> 
>> Likely that coerce is creating the NA. Types are taken
from the first 
>> item of L. If a column there is 'numeric' then in a
later item L it's 
>> character, that'll give rise to an NA. 
>> 
>>
Thinking about it, it can probably coerce the target to cope with the

>> later item ... 
>> 
>> On 03.01.2013 20:30, patricknic wrote: 
>>

>>> Apologies, I forgot to switch the directories in the code.
Corrected 
>>> on 
>>> nabble and below. 
>>> 
>>> 
>>> 
>>> 
>>> #
Directories 
>>> tempwd > setwd(tempwd) 
>>> 
>>> # Packages 
>>>
library(dataframe) 
>>> library(data.table) 
>>> library(foreign) 
>>>

>>> # Get blocks and coordinates 
>>> state.fips > 15:42, 
>>> 44:51,
53:56)) 
>>> tmpf > dtlist > cat("State", fips, ":t") 
>>> nm > dbfname
> if (!file.exists(file.path(tempwd, dbfname))) { 
>>>
cat("Downloading...t") 
>>> url >
paste0("http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/ [1]", 
>>>
nm, ".zip") 
>>> download.file(url, destfile=tmp, quiet=FALSE) 
>>>
unzip(tmp, exdir=tempwd) 
>>> } 
>>> del >
invisible(lapply(del[grep("dbf", del, invert=TRUE)], file.remove)) 
>>>
cat("Reading...t") 
>>> df as.is=TRUE) 
>>> dt > cat("Donen") 
>>> dt[,
list(blockfips = GEOID, land_area = ALAND, water_area = 
>>> AWATER,
long 
>>> = as.numeric(INTPTLON), 
>>> lat = as.numeric(INTPTLAT))] 
>>>
}) 
>>> b > 
>>> ### No NA problem: 
>>> dtlist2 > b2 > 
>>> 
>>> 
>>>
-- 
>>> View this message in context: 
>>> 
>>>
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html
[2] > Sent from the datatable-help mailing list archive at Nabble.com.

>>> _______________________________________________ 
>>> datatable-help
mailing list > [hidden email] [3] 
>>> 
>>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[4] _______________________________________________ 
>> datatable-help
mailing list 
>> [hidden email] [5] 
>>
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[6]
>> 
>> -------------------------
>> 
>> If you reply to this email,
your message will be added to the discussion below:
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html
[7] 
>> To unsubscribe from NAs introduced by coercion in rbindlist(),
click here.
>> NAML [8] 
>> 
>> -------------------------
>> View this
message in context: Re: NAs introduced by coercion
> )
> Sent from the
datatable-help mailing list archive [11] at Nabble.com.




Links:
------
[1]
http://www2.census.gov/geo/tiger/TIGER2011/TABBLOCK/
[2]
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654577.html
[3]
http://user/SendEmail.jtp?type=node&node=4654623&i=0
[4]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[5]
http://user/SendEmail.jtp?type=node&node=4654623&i=1
[6]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[7]
http://r.789695.n4.nabble.com/NAs-introduced-by-coercion-in-rbindlist-tp4654576p4654623.html
[8]
http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
[9]
http://is.na
[10] http://is.na
[11]
http://r.789695.n4.nabble.com/datatable-help-f2315188.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130104/a0e0560e/attachment-0001.html>


More information about the datatable-help mailing list