[datatable-help] Column-wise value replacement

Matthew Dowle mdowle at mdowle.plus.com
Tue Apr 17 17:53:06 CEST 2012


Hi,

How about this :

http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-table

Skip to the end marked EDIT and reversing 0 and NA for your case:

DT = data.table(a=c(1L,0L,3L),b=c(0L,5:6),c=c(7:8,0L))
     a b c
[1,] 1 0 7
[2,] 0 5 8
[3,] 3 6 0

for (i in names(DT)[2:3])
    DT[get(i)==0L,i:=NA_integer_,with=FALSE]

> DT
     a  b  c
[1,] 1 NA  7
[2,] 0  5  8
[3,] 3  6 NA
>

If you have 000's of columns, then this sort of thing is just what the new
set() is for, to avoid the overhead of repeatedly calling [.data.table,
e.g. :

for (i in 2:3)
    set(DT,which(DT[[i]]==0L),i,NA_integer_)

Without the 'which' you get an error "i is type 'logical'. Must be
integer, or numeric is coerced with warning.". I'll add to that message
something like "logical isn't accepted as i for speed since set() is
intended for inside loops; checking and coercing logical to integer row
positions takes time. Wrap logical i with which() if required".

One reason for liking for() loops with data.table is that working on one
column at a time makes sense for large tables (less working memory
needed). I thought about constructing a list() RHS of := with a vector of
column names on the LHS of :=, but that approach doesn't scale as the
number of columns grows, due to needed space for the entire RHS.  The
for() loop above is much better for that reason.  [Multiple LHS of := is
more for when the RHS is a single value repeated for all the columns, such
as 0L or NA, or, NULL to delete multiple columns in one step.]

Finally, the .SD approach should work when bug #1732 is fixed (".SD, .N
and .BY should be available when by="" and by=NULL"), but still not as
efficient as the for loop above.

Matthew


> Dear all,
>
> I have a large data.table and I am simply trying to replace all zeros
> with NAs in a subset of columns. So I tried first:
>
>> dt[, lapply(.SD, function(x) x := ifelse(x==0, NA, x)), .SDcols=3:30]
> Error in lapply(.SD, function(x) `:=`(x, ifelse(x == 0, NA, x))) :
>    object '.SD' not found
>
> Clearly that's not the right approach...
>
> Then I tried:
>
>> for (i in names(dt)[3:30]) {
>    eval(parse(text=paste("dt[`", i, "`==0, `", i, "` := NA]", sep="")))
> }
> That worked but is rather ugly.
>
> would you recommend any better way to avoid the eval(parse()) to
> perform such simple tasks?
>
> Thanks in advance, --Mel.
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>




More information about the datatable-help mailing list