<div dir="ltr">Ditto - having dups, but spitting out an error on all ambiguous operations seems like a robust strategy.</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <span dir="ltr"><<a href="mailto:lianoglou.steve@gene.com" target="_blank">lianoglou.steve@gene.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

I wanted to point out that I'm in Arun's camp on this one:<br>

<div class="im"><br>

On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan<br>

<<a href="mailto:aragorn168b@gmail.com">aragorn168b@gmail.com</a>> wrote:<br>

<br>

> In my opinion, the dup-names should be allowed *only* during creation of<br>

> data.table, and setting names (using `setnames`, `setattr` or the bad form<br>

> `names(dt) <- `). Other than that, *ALL* operations should fail (end up in<br>

> error), and that includes subsetting operation. The `setnames` gives the<br>

> option for the user to set the names back before writing to a file, should<br>

> he choose to keep it at the end.<br>

><br>

> I think it's much better this way (strict, but avoids confusion). For<br>

> example, in data.frames, doing DF$x (when x occurs twice) implicitly prints<br>

> only the first (no warning/error). Also, split(DF$x, DF$x) uses the first<br>

> column and so does split(DF, DF$x).<br>

<br>

</div>As an opinionated footnote: I can acquiesce that since data.frames<br>

allow duplicated column names, I *guess* data.table should *allow*<br>

them, however as is clear (to me) from this long chain of<br>

"possibilities" that one can do, I strongly feel that computing over a<br>

data.table w/ duplicated columns is a fundamentally broken idea as it<br>

is ambiguous as to what the right behavior should be ... forget about<br>

even the (surely fun) book-keeping code required to make it happen.<br>

<br>

You want to import a table with duplicate names? Fine (we should warn<br>

on import if it was `fread` or `as.data.table`d).<br>

<br>

You want to set some names to duplicates? Fine -- warn there too.<br>

<br>

Want to do any computation inside the data.table via `j` or as a<br>

column in `by`? Throw an error and punt the problem to the user to<br>

figure out how they would like to disambiguate the first column named<br>

"a" from the 10th one -- I don't think we need another FAQ explaining<br>

what "the right" way that this should be done is, and why we picked<br>

it.<br>

<br>

Or if you really want to compute over a data.table with duplicate<br>

names, you might be better served by having the table in "long" format<br>

-- perhaps that's why there are duplicate column names to begin with<br>

(I'm guessing -- I still don't think I would ever want to have duped<br>

names on purpose)<br>

<br>

My two cents,<br>

<div class="im HOEnZb"><br>

-steve<br>

<br>

--<br>

Steve Lianoglou<br>

Computational Biologist<br>

Bioinformatics and Computational Biology<br>

Genentech<br>

</div><div class="HOEnZb"><div class="h5">_______________________________________________<br>

datatable-help mailing list<br>

<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>

<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>

</div></div></blockquote></div><br></div>