[datatable-help] Unexpected behavior in setnames()
Simon O\'Hanlon
simon.ohanlon at imperial.ac.uk
Fri Nov 8 15:47:13 CET 2013
Steve Lianoglou <lianoglou.steve <at> gene.com> writes:
> > As a user I'd expect the following to return two columns with the values
2
> > and 6 respectively:
> >
> > Example:
> >
> > dt <- data.table( 1,2,3,4 )
> > setnames(dt , rep( c("a", "b") , 2 ) )
> > a b a b
> > 1: 1 2 3 4
> >
> > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
> > a a
> > 1: 2 2
> I agree -- when using numeric columns, this is clearly wrong and I
> would expect an answer of 2 and 6.
>
> I'm curious what you think, however, when you use the names of the
> columns in .SDcols
>
> If you ask .SDcols="a" would you expect the first "a" column to be
> used, or all of them? To use all of them, would you expect to use
> .SDcols=c('a', 'a')?
>
> -steve
Hi Steve,
That I guess is the big question. Approaching it from the point of view that
duplicate column names are allowed... If I use from the above example,
.SDcols = "a" there are a number of things that *could* happen:
1) data.table ignores dupe names and uses the first such matching column up
to the number of times that name appears and gives no warning (as I
understand it, current behaviour and probably least desirable IMHO).
2) As above with a warning - least work from a developer standpoint I guess!
3) both columns are used piece-wise from left to right and have a unique
suffix appended with a warning that this occured due to duplicate column
names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however
there is the complication that you then need to ensure you are not creating
a new duplicate from an existing column name). This precludes you from
referring to a specific column name in the j function though (but this could
be part of the warning forcing a user to give a column a unique name if
they want to refer to it directly)
4) Most work/most flexible(?); On instantiation all columns in a data.table
have an hidden attribute created that is a unique column name, which may be
referred to in the j with an accessor function, for example "a" and "a"
could be differentiated as .(a.1) and .(a.2) but return results under "a"
and "a". There would also need to be a function to view the mapping of
printed names to the unique attribute names, e.g. colnames( dt ,
include.hidden = TRUE ) then returns a list of the column names and the
underlying unique names allowing a 'power-user' to refer to duplicate column
names with a unique identifier using the accessor function. IMHO opinion
this is a huge amount of work, probably unsafe and prone to many bugs. Not
sure I'd even attempt it, but thought it worth bringing up.
In conclusion my vote would be for current behaviour but with a warning
about needing to set unique column names for calculations, or using numeric
indices, in which case the handling of numeric indices should probably be
"fixed" (I use that loosely because one might argue that it is not broken it
just doesn't do what one might intuitively expect!).
More information about the datatable-help
mailing list