[datatable-help] Column-wise value replacement

Matthew Dowle mdowle at mdowle.plus.com
Wed Apr 18 10:55:48 CEST 2012


Melanie Bacou <mel <at> mbacou.com> writes:

> 
> Matthew,
> Thanks for the tips and comprehensive explanation!
> I like the set() approach, it's not very R-like but more elegant than eval() 
and get() -- I believe most data.table users might feel like the .SD option is 
more intuitive though (and many R users would tend to shy away from using 
loops)?

Maybe. But loops aren't always bad per se. Even some base functions use for() 
loops. Vectorization can sometimes make R code difficult to follow: the tail 
wags the dog. In this case the for() loop ticks all boxes: short code, 
intuitive, readable _and_ fast.  This isn't to say for() loops are back in 
vogue, just that they have a place, sometimes and in particular with data.table 
because assignment by reference with no copies at all allows for() loops to be 
fast.

> I wonder if there's a rational for using set() behind the scenes, once 
the .SD-without-by feature is implemented...

Yes. set() changes the columns (the existing vectors) in place by reference. In 
contrast to any method using ifelse() or similar which allocates and populates 
a new vector for every column and replaces the whole vector. The memory usage 
profile is quite different. Worse, if the RHS of whichever assignment operator 
is used contains all the columns, then memory for a copy of all those columns 
needs to be allocated first, before the assignment can start (the dreaded out 
of memory error). A for loop using set() with data.table uses far 
less "working" memory as it loops through each column by reference.

> By the way I also vote in favor of the ":= with by" feature, that seems like 
a worthwhile addition to data.table. Again I'm finding that syntax very 
intuitive.

Ok, great. Will get to it soon.

> --Mel.
>  
> On 2012-04-17 08:53, Matthew Dowle wrote:
> 
> Hi,
> 
> How about this :
> 
> http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-
large-data-tableSkip to the end marked EDIT and reversing 0 and NA for your 
case:
> 
> DT = data.table(a=c(1L,0L,3L),b=c(0L,5:6),c=c(7:8,0L))
>      a b c
> [1,] 1 0 7
> [2,] 0 5 8
> [3,] 3 6 0
> 
> for (i in names(DT)[2:3])
>     DT[get(i)==0L,i:=NA_integer_,with=FALSE]
> DT
> a  b  c
> [1,] 1 NA  7
> [2,] 0  5  8
> [3,] 3  6 NAIf you have 000's of columns, then this sort of thing is just 
what the new
> set() is for, to avoid the overhead of repeatedly calling [.data.table,
> e.g. :
> 
> for (i in 2:3)
>     set(DT,which(DT[[i]]==0L),i,NA_integer_)
> 
> Without the 'which' you get an error "i is type 'logical'. Must be
> integer, or numeric is coerced with warning.". I'll add to that message
> something like "logical isn't accepted as i for speed since set() is
> intended for inside loops; checking and coercing logical to integer row
> positions takes time. Wrap logical i with which() if required".
> 
> One reason for liking for() loops with data.table is that working on one
> column at a time makes sense for large tables (less working memory
> needed). I thought about constructing a list() RHS of := with a vector of
> column names on the LHS of :=, but that approach doesn't scale as the
> number of columns grows, due to needed space for the entire RHS.  The
> for() loop above is much better for that reason.  [Multiple LHS of := is
> more for when the RHS is a single value repeated for all the columns, such
> as 0L or NA, or, NULL to delete multiple columns in one step.]
> 
> Finally, the .SD approach should work when bug #1732 is fixed (".SD, .N
> and .BY should be available when by="" and by=NULL"), but still not as
> efficient as the for loop above.
> 
> Matthew
> Dear all, I have a large data.table and I am simply trying to replace all 
zeros with NAs in a subset of columns. So I tried first:
> dt[, lapply(.SD, function(x) x := ifelse(x==0, NA, x)), .SDcols=3:30]
> Error in lapply(.SD, function(x) `:=`(x, ifelse(x == 0, NA, x))) : 
object '.SD' not found Clearly that's not the right approach... Then I tried:
> for (i in names(dt)[3:30]) {
> eval(parse(text=paste("dt[`", i, "`==0, `", i, "` := NA]", sep=""))) } That 
worked but is rather ugly. would you recommend any better way to avoid the eval
(parse()) to perform such simple tasks? Thanks in advance, --Mel. 
_______________________________________________ datatable-help mailing list 
datatable-help <at> lists.r-forge.r-project.org https://lists.r-forge.r-
project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 
>  
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help <at> lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help






More information about the datatable-help mailing list