[datatable-help] Wonder whether there is an easier way to changepart of data.table values

Matthew Dowle mdowle at mdowle.plus.com
Wed Aug 11 02:18:42 CEST 2010


On Fri, 2010-08-06 at 01:13 -0700, Harish wrote:
> Thought #1 -- Consistency
> ----------
> Quick thought -- would it be more consistent if the interface took in the name of the variable directly rather than a string?
> In other words, would
>    dt[,a] <- "b"
> be more consistent with the existing data.table interface than
>    dt[,"a"] <- "b"
> ?
> Then again, taking in a string of the column name avoids the temptation to having complex expressions (like a+23*c) as the name of the column.  Although I just argued against my own point, I am curious what other people think.

My current thought is to agree that dt[,"a"] is easier, the way it is
currently implemented. This way a variable LOC can be used, holding
either a character column name or integer location, in the usual
data.frame way :  dt[,LOC]<-"b".
If dt[,a] then we might need dt[,eval(LOC)]<-"b" which is getting
onerous.
As you say, I can't think why we might want an expression in the j in
the LHS of [<-.

> 
> 
> Thought #2 -- Alternative interfaces suggested in past
> ----------
> 
> Also, there were some other messages in the past discussing having an interface similar to:
>    dt[,{a = "b"}]
>    dt[,{T(temp_a) = "b"; a = temp_a}]
>    etc.

That was here :
http://r.789695.n4.nabble.com/convenience-function-for-transforming-variables-and-adding-them-to-the-table-tp2315324p2315326.html

> 
> What are the merits of having the ability to set values integrated into the "[.data.table" function?

Pros :
1) what you assign to the column can be an expression of columns too, in
the usual way.  When the RHS is on the RHS of [<-., it doesn't see the
columns as variables.
2) you could do several column updates in one query, different
expressions for each column
3) that multi-column update could be 'by' group
4) the assignments appearing later in the j expression could use results
from the previous assignments, and do that within group.
5) could update very quickly a small subset of rows or groups, using i
and mult='all'; => "update by without by"
6) consistency with SQL 'update'. An update is very similar to a select,
in syntax. So j seems a natural place to put the update.
7) potentially much less working memory would be needed, as the
assignment would be done within by rather than building up a large
result to be assigned to the entire column(s).
8) Combination of update and select in one simple short query; e.g.,
DT[,{z=x*mean(y);sum(z)},by=id]

Cons :
1) It isn't the norm for "=" within seemingly local scope {} to go and
change a table by reference, by default. The T() syntax is ugly and
potentially confusing.
2) It may be getting quite far from R style programming.
3) It might be tricky to implement
4) Might be too easy to change data by accident. To tackle that, tables
could have a read-only attribute, or you could only update tables in
this way when they are in local scope, perhaps.

Note that update syntax within j expressions would be additional to [<-
syntax; the user could choose which one they preferred in different
situations. It could be an optional feature too, off by default, if some
users were worried about the cons.

I'm interested in views on this too. Especially any aspects missed above
and potential pitfalls. Feel free to throw in thoughts.

Matthew

> 
> Regards,
> Harish
> 
> 
> --- On Mon, 8/2/10, Short, Tom <TShort at epri.com> wrote:
> 
> > From: Short, Tom <TShort at epri.com>
> > Subject: Re: [datatable-help] Wonder whether there is an easier way to changepart of data.table values
> > To: "Branson Owen" <branson.owen at gmail.com>, datatable-help at lists.r-forge.r-project.org
> > Date: Monday, August 2, 2010, 7:28 PM
> > I've just checked in versions of
> > [<-.data.table and $<-.data.table that
> > check for the columns adjusted and reset the key if
> > appropriate. This
> > brings up some incompatibilities:
> > 
> > (*) KEYS -- Before, you could do:
> > 
> > dt$key_column = anything
> > 
> > And it wouldn't change the status of the key. Now, the key
> > will be
> > nullified.
> > 
> > (*) ASSIGNMENT DIFFERENCE
> > 
> > Before: dt["a"] <- "b" meant change column a.
> > Now:    dt[,"a"] <- "b" 
> > 
> > Now, you can do 
> > dt[J("a"), "somecol"] <- 33 means assign 33 to the
> > column "somecol"
> > based on the key being equal to "a".
> > 
> > (*) QUESTIONS
> > 
> > - Do we need a "keep.key" argument for cases where we don't
> > want the key
> > nullified (the user knows the order is unaffected). This
> > isn't really
> > possible for dt$a[1:4] <- something.
> > 
> > - Is it a good idea to use data.table-style indexing for
> > the i part of
> > [<-.data.table? I was skeptical when Branson first asked
> > (prefering
> > data.frame compatibility), but it makes more sense now that
> > I think
> > about it.
> > 
> > Finally, we should put a warning in the documentation
> > somewhere that
> > functions that re-arrange or assign values to a data frame
> > may "corrupt"
> > the key of a data.table. Data.table-aware functions should
> > account for
> > this, but other functions may not.
> > 
> > - Tom
> >  
> 
> 
> 
>       
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list