[datatable-help] Unexpected behavior in setnames()

Steve Lianoglou lianoglou.steve at gene.com
Fri Nov 8 21:02:07 CET 2013


Hi,

I wanted to point out that I'm in Arun's camp on this one:

On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:

> In my opinion, the dup-names should be allowed *only* during creation of
> data.table, and setting names (using `setnames`, `setattr` or the bad form
> `names(dt) <- `). Other than that, *ALL* operations should fail (end up in
> error), and that includes subsetting operation. The `setnames` gives the
> option for the user to set the names back before writing to a file, should
> he choose to keep it at the end.
>
> I think it's much better this way (strict, but avoids confusion). For
> example, in data.frames, doing DF$x (when x occurs twice) implicitly prints
> only the first (no warning/error). Also, split(DF$x, DF$x) uses the first
> column and so does split(DF, DF$x).

As an opinionated footnote: I can acquiesce that since data.frames
allow duplicated column names, I *guess* data.table should *allow*
them, however as is clear (to me) from this long chain of
"possibilities" that one can do, I strongly feel that computing over a
data.table w/ duplicated columns is a fundamentally broken idea as it
is ambiguous as to what the right behavior should be ... forget about
even the (surely fun) book-keeping code required to make it happen.

You want to import a table with duplicate names? Fine (we should warn
on import if it was `fread` or `as.data.table`d).

You want to set some names to duplicates? Fine -- warn there too.

Want to do any computation inside the data.table via `j` or as a
column in `by`? Throw an error and punt the problem to the user to
figure out how they would like to disambiguate the first column named
"a" from the 10th one -- I don't think we need another FAQ explaining
what "the right" way that this should be done is, and why we picked
it.

Or if you really want to compute over a data.table with duplicate
names, you might be better served by having the table in "long" format
-- perhaps that's why there are duplicate column names to begin with
(I'm guessing -- I still don't think I would ever want to have duped
names on purpose)

My two cents,

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


More information about the datatable-help mailing list