[datatable-help] Wonder whether there is an easier way to changepart of data.table values

Short, Tom TShort at epri.com
Wed Aug 11 02:45:26 CEST 2010


> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org 
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org] 
> On Behalf Of Matthew Dowle
> Sent: Tuesday, August 10, 2010 20:19
> To: Harish
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Wonder whether there is an 
> easier way to changepart of data.table values
> 
> 
> On Fri, 2010-08-06 at 01:13 -0700, Harish wrote:
> > Thought #1 -- Consistency
> > ----------
> > Quick thought -- would it be more consistent if the 
> interface took in the name of the variable directly rather 
> than a string?
> > In other words, would
> >    dt[,a] <- "b"
> > be more consistent with the existing data.table interface than
> >    dt[,"a"] <- "b"
> > ?
> > Then again, taking in a string of the column name avoids 
> the temptation to having complex expressions (like a+23*c) as 
> the name of the column.  Although I just argued against my 
> own point, I am curious what other people think.
> 
> My current thought is to agree that dt[,"a"] is easier, the 
> way it is currently implemented. This way a variable LOC can 
> be used, holding either a character column name or integer 
> location, in the usual data.frame way :  dt[,LOC]<-"b".
> If dt[,a] then we might need dt[,eval(LOC)]<-"b" which is 
> getting onerous.
> As you say, I can't think why we might want an expression in 
> the j in the LHS of [<-.

I like the existing notation better: dt[,"a"] <- "b"
 
> > 
> > 
> > Thought #2 -- Alternative interfaces suggested in past
> > ----------
> > 
> > Also, there were some other messages in the past discussing 
> having an interface similar to:
> >    dt[,{a = "b"}]
> >    dt[,{T(temp_a) = "b"; a = temp_a}]
> >    etc.
> 
> That was here :
> http://r.789695.n4.nabble.com/convenience-function-for-transfo
rming-variables-and-adding-them-to-the-table-> tp2315324p2315326.html
> 
> > 
> > What are the merits of having the ability to set values 
> integrated into the "[.data.table" function?
> 
> Pros :
> 1) what you assign to the column can be an expression of 
> columns too, in the usual way.  When the RHS is on the RHS of 
> [<-., it doesn't see the columns as variables.
> 2) you could do several column updates in one query, 
> different expressions for each column
> 3) that multi-column update could be 'by' group
> 4) the assignments appearing later in the j expression could 
> use results from the previous assignments, and do that within group.
> 5) could update very quickly a small subset of rows or 
> groups, using i and mult='all'; => "update by without by"
> 6) consistency with SQL 'update'. An update is very similar 
> to a select, in syntax. So j seems a natural place to put the update.
> 7) potentially much less working memory would be needed, as 
> the assignment would be done within by rather than building 
> up a large result to be assigned to the entire column(s).
> 8) Combination of update and select in one simple short 
> query; e.g., DT[,{z=x*mean(y);sum(z)},by=id]
> 
> Cons :
> 1) It isn't the norm for "=" within seemingly local scope {} 
> to go and change a table by reference, by default. The T() 
> syntax is ugly and potentially confusing.
> 2) It may be getting quite far from R style programming.
> 3) It might be tricky to implement
> 4) Might be too easy to change data by accident. To tackle 
> that, tables could have a read-only attribute, or you could 
> only update tables in this way when they are in local scope, perhaps.
> 
> Note that update syntax within j expressions would be 
> additional to [<- syntax; the user could choose which one 
> they preferred in different situations. It could be an 
> optional feature too, off by default, if some users were 
> worried about the cons.
> 
> I'm interested in views on this too. Especially any aspects 
> missed above and potential pitfalls. Feel free to throw in thoughts.

For me, data tables are mostly read only except for adding columns. I
don't like the idea of accidently changing a portion of my data table
that took an hour to generate. [.data.table is pretty complex already,
and this will just make it worse. One way to add this type of capability
may be to create a new class, say "writeable.data.table" that inherits
from both data.table and data.frame. Another option is to write a
function that does something like transform that will operate on the
data table in place. I'd prefer one of these to messing with the
[.data.table syntax.

- Tom



More information about the datatable-help mailing list