[datatable-help] Unexpected behavior in setnames()

Eduard Antonyan eduard.antonyan at gmail.com
Fri Nov 8 21:08:05 CET 2013


Ditto - having dups, but spitting out an error on all ambiguous operations
seems like a robust strategy.


On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:

> Hi,
>
> I wanted to point out that I'm in Arun's camp on this one:
>
> On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> > In my opinion, the dup-names should be allowed *only* during creation of
> > data.table, and setting names (using `setnames`, `setattr` or the bad
> form
> > `names(dt) <- `). Other than that, *ALL* operations should fail (end up
> in
> > error), and that includes subsetting operation. The `setnames` gives the
> > option for the user to set the names back before writing to a file,
> should
> > he choose to keep it at the end.
> >
> > I think it's much better this way (strict, but avoids confusion). For
> > example, in data.frames, doing DF$x (when x occurs twice) implicitly
> prints
> > only the first (no warning/error). Also, split(DF$x, DF$x) uses the first
> > column and so does split(DF, DF$x).
>
> As an opinionated footnote: I can acquiesce that since data.frames
> allow duplicated column names, I *guess* data.table should *allow*
> them, however as is clear (to me) from this long chain of
> "possibilities" that one can do, I strongly feel that computing over a
> data.table w/ duplicated columns is a fundamentally broken idea as it
> is ambiguous as to what the right behavior should be ... forget about
> even the (surely fun) book-keeping code required to make it happen.
>
> You want to import a table with duplicate names? Fine (we should warn
> on import if it was `fread` or `as.data.table`d).
>
> You want to set some names to duplicates? Fine -- warn there too.
>
> Want to do any computation inside the data.table via `j` or as a
> column in `by`? Throw an error and punt the problem to the user to
> figure out how they would like to disambiguate the first column named
> "a" from the 10th one -- I don't think we need another FAQ explaining
> what "the right" way that this should be done is, and why we picked
> it.
>
> Or if you really want to compute over a data.table with duplicate
> names, you might be better served by having the table in "long" format
> -- perhaps that's why there are duplicate column names to begin with
> (I'm guessing -- I still don't think I would ever want to have duped
> names on purpose)
>
> My two cents,
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/ebf6147d/attachment.html>


More information about the datatable-help mailing list