[datatable-help] Wonder whether there is an easier way to changepart of data.table values
Short, Tom
TShort at epri.com
Wed Aug 11 02:45:26 CEST 2010
> -----Original Message-----
> From: datatable-help-bounces at lists.r-forge.r-project.org
> [mailto:datatable-help-bounces at lists.r-forge.r-project.org]
> On Behalf Of Matthew Dowle
> Sent: Tuesday, August 10, 2010 20:19
> To: Harish
> Cc: datatable-help at lists.r-forge.r-project.org
> Subject: Re: [datatable-help] Wonder whether there is an
> easier way to changepart of data.table values
>
>
> On Fri, 2010-08-06 at 01:13 -0700, Harish wrote:
> > Thought #1 -- Consistency
> > ----------
> > Quick thought -- would it be more consistent if the
> interface took in the name of the variable directly rather
> than a string?
> > In other words, would
> > dt[,a] <- "b"
> > be more consistent with the existing data.table interface than
> > dt[,"a"] <- "b"
> > ?
> > Then again, taking in a string of the column name avoids
> the temptation to having complex expressions (like a+23*c) as
> the name of the column. Although I just argued against my
> own point, I am curious what other people think.
>
> My current thought is to agree that dt[,"a"] is easier, the
> way it is currently implemented. This way a variable LOC can
> be used, holding either a character column name or integer
> location, in the usual data.frame way : dt[,LOC]<-"b".
> If dt[,a] then we might need dt[,eval(LOC)]<-"b" which is
> getting onerous.
> As you say, I can't think why we might want an expression in
> the j in the LHS of [<-.
I like the existing notation better: dt[,"a"] <- "b"
> >
> >
> > Thought #2 -- Alternative interfaces suggested in past
> > ----------
> >
> > Also, there were some other messages in the past discussing
> having an interface similar to:
> > dt[,{a = "b"}]
> > dt[,{T(temp_a) = "b"; a = temp_a}]
> > etc.
>
> That was here :
> http://r.789695.n4.nabble.com/convenience-function-for-transfo
rming-variables-and-adding-them-to-the-table-> tp2315324p2315326.html
>
> >
> > What are the merits of having the ability to set values
> integrated into the "[.data.table" function?
>
> Pros :
> 1) what you assign to the column can be an expression of
> columns too, in the usual way. When the RHS is on the RHS of
> [<-., it doesn't see the columns as variables.
> 2) you could do several column updates in one query,
> different expressions for each column
> 3) that multi-column update could be 'by' group
> 4) the assignments appearing later in the j expression could
> use results from the previous assignments, and do that within group.
> 5) could update very quickly a small subset of rows or
> groups, using i and mult='all'; => "update by without by"
> 6) consistency with SQL 'update'. An update is very similar
> to a select, in syntax. So j seems a natural place to put the update.
> 7) potentially much less working memory would be needed, as
> the assignment would be done within by rather than building
> up a large result to be assigned to the entire column(s).
> 8) Combination of update and select in one simple short
> query; e.g., DT[,{z=x*mean(y);sum(z)},by=id]
>
> Cons :
> 1) It isn't the norm for "=" within seemingly local scope {}
> to go and change a table by reference, by default. The T()
> syntax is ugly and potentially confusing.
> 2) It may be getting quite far from R style programming.
> 3) It might be tricky to implement
> 4) Might be too easy to change data by accident. To tackle
> that, tables could have a read-only attribute, or you could
> only update tables in this way when they are in local scope, perhaps.
>
> Note that update syntax within j expressions would be
> additional to [<- syntax; the user could choose which one
> they preferred in different situations. It could be an
> optional feature too, off by default, if some users were
> worried about the cons.
>
> I'm interested in views on this too. Especially any aspects
> missed above and potential pitfalls. Feel free to throw in thoughts.
For me, data tables are mostly read only except for adding columns. I
don't like the idea of accidently changing a portion of my data table
that took an hour to generate. [.data.table is pretty complex already,
and this will just make it worse. One way to add this type of capability
may be to create a new class, say "writeable.data.table" that inherits
from both data.table and data.frame. Another option is to write a
function that does something like transform that will operate on the
data table in place. I'd prefer one of these to messing with the
[.data.table syntax.
- Tom
More information about the datatable-help
mailing list