<div dir="ltr">Ditto - having dups, but spitting out an error on all ambiguous operations seems like a robust strategy.</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <span dir="ltr"><<a href="mailto:lianoglou.steve@gene.com" target="_blank">lianoglou.steve@gene.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
I wanted to point out that I'm in Arun's camp on this one:<br>
<div class="im"><br>
On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan<br>
<<a href="mailto:aragorn168b@gmail.com">aragorn168b@gmail.com</a>> wrote:<br>
<br>
> In my opinion, the dup-names should be allowed *only* during creation of<br>
> data.table, and setting names (using `setnames`, `setattr` or the bad form<br>
> `names(dt) <- `). Other than that, *ALL* operations should fail (end up in<br>
> error), and that includes subsetting operation. The `setnames` gives the<br>
> option for the user to set the names back before writing to a file, should<br>
> he choose to keep it at the end.<br>
><br>
> I think it's much better this way (strict, but avoids confusion). For<br>
> example, in data.frames, doing DF$x (when x occurs twice) implicitly prints<br>
> only the first (no warning/error). Also, split(DF$x, DF$x) uses the first<br>
> column and so does split(DF, DF$x).<br>
<br>
</div>As an opinionated footnote: I can acquiesce that since data.frames<br>
allow duplicated column names, I *guess* data.table should *allow*<br>
them, however as is clear (to me) from this long chain of<br>
"possibilities" that one can do, I strongly feel that computing over a<br>
data.table w/ duplicated columns is a fundamentally broken idea as it<br>
is ambiguous as to what the right behavior should be ... forget about<br>
even the (surely fun) book-keeping code required to make it happen.<br>
<br>
You want to import a table with duplicate names? Fine (we should warn<br>
on import if it was `fread` or `as.data.table`d).<br>
<br>
You want to set some names to duplicates? Fine -- warn there too.<br>
<br>
Want to do any computation inside the data.table via `j` or as a<br>
column in `by`? Throw an error and punt the problem to the user to<br>
figure out how they would like to disambiguate the first column named<br>
"a" from the 10th one -- I don't think we need another FAQ explaining<br>
what "the right" way that this should be done is, and why we picked<br>
it.<br>
<br>
Or if you really want to compute over a data.table with duplicate<br>
names, you might be better served by having the table in "long" format<br>
-- perhaps that's why there are duplicate column names to begin with<br>
(I'm guessing -- I still don't think I would ever want to have duped<br>
names on purpose)<br>
<br>
My two cents,<br>
<div class="im HOEnZb"><br>
-steve<br>
<br>
--<br>
Steve Lianoglou<br>
Computational Biologist<br>
Bioinformatics and Computational Biology<br>
Genentech<br>
</div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br>
</div></div></blockquote></div><br></div>