[datatable-help] Unexpected behavior in setnames()

Fri Nov 8 16:09:12 CET 2013

Simon, 
I've replied your last post inline:

> 1) data.table ignores dupe names and uses the first such matching column up
> to the number of times that name appears and gives no warning (as I
> understand it, current behaviour and probably least desirable IMHO).

FYI, this is what data.frame does:

DF <- data.frame(x=1:5, x=6:10, check.names=FALSE)
DF[, c("x")]
DF[, c("x", "x")]

In fact, while doing this subsetting, it automatically makes the column names unique.

*Admittedly, DF[, 1:2] gives the right columns, but still the names are made unique.*

> 3) both columns are used piece-wise from left to right and have a unique
> suffix appended with a warning that this occured due to duplicate column
> names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however
> there is the complication that you then need to ensure you are not creating
> a new duplicate from an existing column name). This precludes you from
> referring to a specific column name in the j function though (but this could
> be part of the warning forcing a user to give a column a unique name if
> they want to refer to it directly)

This'll be a problem to evaluate expressions in `j`. Suppose you've:

DT <- data.table(x=1:5, x=6:10, ID=1:5)

And you do: DT[, list(x=x*2), by=ID], then, while creating the data.table DT, the names are not changed (or so far, the consensus is not to). So, if during an operation, we were to change the dup names to unique names, we'll have trouble in mapping expressions in `j` accordingly. Note that even if we dint, this expression is ill-posed. 

Also think about `setkey` function.

> 4) Most work/most flexible(?); On instantiation all columns in a data.table
> have an hidden attribute created that is a unique column name, which may be
> referred to in the j with an accessor function, for example "a" and "a"
> could be differentiated as .(a.1) and .(a.2) but return results under "a"
> and "a". There would also need to be a function to view the mapping of
> printed names to the unique attribute names, e.g. colnames( dt ,
> include.hidden = TRUE ) then returns a list of the column names and the
> underlying unique names allowing a 'power-user' to refer to duplicate column
> names with a unique identifier using the accessor function. IMHO opinion
> this is a huge amount of work, probably unsafe and prone to many bugs. Not
> sure I'd even attempt it, but thought it worth bringing up.

I agree with your conclusion. This is not feasible even, as the mapping is ill-posed. IF the expression in `j` contains only one of the duplicate columns, which one would you map to (.a.1) or (.a.2)? 

> In conclusion my vote would be for current behaviour but with a warning
> about needing to set unique column names for calculations, or using numeric
> indices, in which case the handling of numeric indices should probably be
> "fixed" (I use that loosely because one might argue that it is not broken it
> just doesn't do what one might intuitively expect!).

In my opinion, the dup-names should be allowed *only* during creation of data.table, and setting names (using `setnames`, `setattr` or the bad form `names(dt) <- `). Other than that, *ALL* operations should fail (end up in error), and that includes subsetting operation. The `setnames` gives the option for the user to set the names back before writing to a file, should he choose to keep it at the end. 

I think it's much better this way (strict, but avoids confusion). For example, in data.frames, doing DF$x (when x occurs twice) implicitly prints only the first (no warning/error). Also, split(DF$x, DF$x) uses the first column and so does split(DF, DF$x).

Arun

On Friday, November 8, 2013 at 3:47 PM, Simon O\'Hanlon wrote:

> Steve Lianoglou <lianoglou.steve <at> gene.com (http://gene.com)> writes:
> 
> > > As a user I'd expect the following to return two columns with the values 
> 2
> > > and 6 respectively:
> > > 
> > > Example:
> > > 
> > > dt <- data.table( 1,2,3,4 )
> > > setnames(dt , rep( c("a", "b") , 2 ) )
> > > a b a b
> > > 1: 1 2 3 4
> > > 
> > > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
> > > a a
> > > 1: 2 2
> > > 
> > 
> 
> 
> > I agree -- when using numeric columns, this is clearly wrong and I
> > would expect an answer of 2 and 6.
> > 
> > I'm curious what you think, however, when you use the names of the
> > columns in .SDcols
> > 
> > If you ask .SDcols="a" would you expect the first "a" column to be
> > used, or all of them? To use all of them, would you expect to use
> > .SDcols=c('a', 'a')?
> > 
> > -steve
> 
> Hi Steve,
> That I guess is the big question. Approaching it from the point of view that 
> duplicate column names are allowed... If I use from the above example, 
> .SDcols = "a" there are a number of things that *could* happen:
> 
> 1) data.table ignores dupe names and uses the first such matching column up 
> to the number of times that name appears and gives no warning (as I 
> understand it, current behaviour and probably least desirable IMHO).
> 
> 2) As above with a warning - least work from a developer standpoint I guess!
> 
> 3) both columns are used piece-wise from left to right and have a unique 
> suffix appended with a warning that this occured due to duplicate column 
> names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however 
> there is the complication that you then need to ensure you are not creating 
> a new duplicate from an existing column name). This precludes you from 
> referring to a specific column name in the j function though (but this could 
> be part of the warning forcing a user to give a column a unique name if 
> they want to refer to it directly)
> 
> 4) Most work/most flexible(?); On instantiation all columns in a data.table 
> have an hidden attribute created that is a unique column name, which may be 
> referred to in the j with an accessor function, for example "a" and "a" 
> could be differentiated as .(a.1) and .(a.2) but return results under "a" 
> and "a". There would also need to be a function to view the mapping of 
> printed names to the unique attribute names, e.g. colnames( dt , 
> include.hidden = TRUE ) then returns a list of the column names and the 
> underlying unique names allowing a 'power-user' to refer to duplicate column 
> names with a unique identifier using the accessor function. IMHO opinion 
> this is a huge amount of work, probably unsafe and prone to many bugs. Not 
> sure I'd even attempt it, but thought it worth bringing up.
> 
> In conclusion my vote would be for current behaviour but with a warning 
> about needing to set unique column names for calculations, or using numeric 
> indices, in which case the handling of numeric indices should probably be 
> "fixed" (I use that loosely because one might argue that it is not broken it 
> just doesn't do what one might intuitively expect!).
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/77934922/attachment-0001.html>