[datatable-help] Unexpected behavior in setnames()

Fri Nov 8 15:47:13 CET 2013

Steve Lianoglou <lianoglou.steve <at> gene.com> writes:

> > As a user I'd expect the following to return two columns with the values 
2
> > and 6 respectively:
> >
> > Example:
> >
> > dt <- data.table( 1,2,3,4 )
> > setnames(dt , rep( c("a", "b") , 2 ) )
> >    a b a b
> > 1: 1 2 3 4
> >
> > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
> >    a a
> > 1: 2 2

> I agree -- when using numeric columns, this is clearly wrong and I
> would expect an answer of 2 and 6.
> 
> I'm curious what you think, however, when you use the names of the
> columns in .SDcols
> 
> If you ask .SDcols="a" would you expect the first "a" column to be
> used, or all of them? To use all of them, would you expect to use
> .SDcols=c('a', 'a')?
> 
> -steve

Hi Steve,
That I guess is the big question. Approaching it from the point of view that 
duplicate column names are allowed... If I use from the above example, 
.SDcols = "a" there are a number of things that *could* happen:

1) data.table ignores dupe names and uses the first such matching column up 
to the number of times that name appears and gives no warning (as I 
understand it, current behaviour and probably least desirable IMHO).

2) As above with a warning - least work from a developer standpoint I guess!

3) both columns are used piece-wise from left to right and have a unique 
suffix appended with a warning that this occured due to duplicate column 
names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however 
there is the complication that you then need to ensure you are not creating 
a new duplicate from an existing column name). This precludes you from 
referring to a specific column name in the j function though (but this could 
be part of the warning forcing a user to give  a column a unique name if 
they want to refer to it directly)

4) Most work/most flexible(?); On instantiation all columns in a data.table 
have an hidden attribute created that is a unique column name, which may be 
referred to in the j with an accessor function, for example "a" and "a" 
could be differentiated as .(a.1) and .(a.2) but return results under "a" 
and "a". There would also need to be a function to view the mapping of 
printed names to the unique attribute names, e.g. colnames( dt , 
include.hidden = TRUE ) then returns a list of the column names and the 
underlying unique names allowing a 'power-user' to refer to duplicate column 
names with a  unique identifier using the accessor function. IMHO opinion 
this is a huge amount of work, probably unsafe and prone to many bugs. Not 
sure I'd even attempt it, but thought it worth bringing up.

In conclusion my vote would be for current behaviour but with a warning 
about needing to set unique column names for calculations, or using numeric 
indices, in which case the handling of numeric indices should probably be 
"fixed" (I use that loosely because one might argue that it is not broken it 
just doesn't do what one might intuitively expect!).