[datatable-help] Unexpected behavior in setnames()

Steve Lianoglou lianoglou.steve at gene.com
Fri Nov 8 21:16:05 CET 2013


Wow ... did we just reach a consensus? :-)

-steve

On Fri, Nov 8, 2013 at 12:08 PM, Eduard Antonyan
<eduard.antonyan at gmail.com> wrote:
> Ditto - having dups, but spitting out an error on all ambiguous operations
> seems like a robust strategy.
>
>
> On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <lianoglou.steve at gene.com>
> wrote:
>>
>> Hi,
>>
>> I wanted to point out that I'm in Arun's camp on this one:
>>
>> On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
>> <aragorn168b at gmail.com> wrote:
>>
>> > In my opinion, the dup-names should be allowed *only* during creation of
>> > data.table, and setting names (using `setnames`, `setattr` or the bad
>> > form
>> > `names(dt) <- `). Other than that, *ALL* operations should fail (end up
>> > in
>> > error), and that includes subsetting operation. The `setnames` gives the
>> > option for the user to set the names back before writing to a file,
>> > should
>> > he choose to keep it at the end.
>> >
>> > I think it's much better this way (strict, but avoids confusion). For
>> > example, in data.frames, doing DF$x (when x occurs twice) implicitly
>> > prints
>> > only the first (no warning/error). Also, split(DF$x, DF$x) uses the
>> > first
>> > column and so does split(DF, DF$x).
>>
>> As an opinionated footnote: I can acquiesce that since data.frames
>> allow duplicated column names, I *guess* data.table should *allow*
>> them, however as is clear (to me) from this long chain of
>> "possibilities" that one can do, I strongly feel that computing over a
>> data.table w/ duplicated columns is a fundamentally broken idea as it
>> is ambiguous as to what the right behavior should be ... forget about
>> even the (surely fun) book-keeping code required to make it happen.
>>
>> You want to import a table with duplicate names? Fine (we should warn
>> on import if it was `fread` or `as.data.table`d).
>>
>> You want to set some names to duplicates? Fine -- warn there too.
>>
>> Want to do any computation inside the data.table via `j` or as a
>> column in `by`? Throw an error and punt the problem to the user to
>> figure out how they would like to disambiguate the first column named
>> "a" from the 10th one -- I don't think we need another FAQ explaining
>> what "the right" way that this should be done is, and why we picked
>> it.
>>
>> Or if you really want to compute over a data.table with duplicate
>> names, you might be better served by having the table in "long" format
>> -- perhaps that's why there are duplicate column names to begin with
>> (I'm guessing -- I still don't think I would ever want to have duped
>> names on purpose)
>>
>> My two cents,
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech


More information about the datatable-help mailing list