[datatable-help] Column-wise value replacement

Melanie Bacou mel at mbacou.com
Wed Apr 18 09:10:01 CEST 2012


 

Matthew, 

Thanks for the tips and comprehensive explanation! 

I
like the set() approach, it's not very R-like but more elegant than
eval() and get() -- I believe most data.table users might feel like the
.SD option is more intuitive though (and many R users would tend to shy
away from using loops)? 

I wonder if there's a rational for using set()
behind the scenes, once the .SD-without-by feature is implemented...


By the way I also vote in favor of the ":= with by" feature, that
seems like a worthwhile addition to data.table. Again I'm finding that
syntax very intuitive. 

--Mel. 

On 2012-04-17 08:53, Matthew Dowle
wrote: 

> Hi,
> 
> How about this :
> 
>
http://stackoverflow.com/questions/7235657/fastest-way-to-replace-nas-in-a-large-data-tableSkip
to the end marked EDIT and reversing 0 and NA for your case:
> 
> DT =
data.table(a=c(1L,0L,3L),b=c(0L,5:6),c=c(7:8,0L))
> a b c
> [1,] 1 0 7
>
[2,] 0 5 8
> [3,] 3 6 0
> 
> for (i in names(DT)[2:3])
>
DT[get(i)==0L,i:=NA_integer_,with=FALSE]
> 
>> DT
> 
> a b c
> [1,] 1 NA
7
> [2,] 0 5 8
> [3,] 3 6 NAIf you have 000's of columns, then this sort
of thing is just what the new
> set() is for, to avoid the overhead of
repeatedly calling [.data.table,
> e.g. :
> 
> for (i in 2:3)
>
set(DT,which(DT[[i]]==0L),i,NA_integer_)
> 
> Without the 'which' you
get an error "i is type 'logical'. Must be
> integer, or numeric is
coerced with warning.". I'll add to that message
> something like
"logical isn't accepted as i for speed since set() is
> intended for
inside loops; checking and coercing logical to integer row
> positions
takes time. Wrap logical i with which() if required".
> 
> One reason
for liking for() loops with data.table is that working on one
> column
at a time makes sense for large tables (less working memory
> needed). I
thought about constructing a list() RHS of := with a vector of
> column
names on the LHS of :=, but that approach doesn't scale as the
> number
of columns grows, due to needed space for the entire RHS. The
> for()
loop above is much better for that reason. [Multiple LHS of := is
> more
for when the RHS is a single value repeated for all the columns, such
>
as 0L or NA, or, NULL to delete multiple columns in one step.]
> 
>
Finally, the .SD approach should work when bug #1732 is fixed (".SD,
.N
> and .BY should be available when by="" and by=NULL"), but still not
as
> efficient as the for loop above.
> 
> Matthew
> 
>> Dear all, I
have a large data.table and I am simply trying to replace all zeros with
NAs in a subset of columns. So I tried first: 
>> 
>>> dt[, lapply(.SD,
function(x) x := ifelse(x==0, NA, x)), .SDcols=3:30]
>> Error in
lapply(.SD, function(x) `:=`(x, ifelse(x == 0, NA, x))) : object '.SD'
not found Clearly that's not the right approach... Then I tried: 
>>

>>> for (i in names(dt)[3:30]) {
>> eval(parse(text=paste("dt[`", i,
"`==0, `", i, "` := NA]", sep=""))) } That worked but is rather ugly.
would you recommend any better way to avoid the eval(parse()) to perform
such simple tasks? Thanks in advance, --Mel.
_______________________________________________ datatable-help mailing
list datatable-help at lists.r-forge.r-project.org [1]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
[2]

 

Links:
------
[1]
mailto:datatable-help at lists.r-forge.r-project.org
[2]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20120418/6474ddd6/attachment-0001.html>


More information about the datatable-help mailing list