<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style="font-family:arial,sans-serif;font-size:13px">As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/</span><span style="font-family:arial,sans-serif;font-size:13px">subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this?</span></blockquote>
<div><br></div><div>Sounds good to me. </div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan <span dir="ltr"><<a href="mailto:aragorn168b@gmail.com" target="_blank">aragorn168b@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>
Eddi,
</div><div><br></div><div>1) We can still allow duplicate names in "fread" and during creation of data.table with the data.table() command.</div><div>2) There's really no loss of data as we can allow "setnames" to set duplicate names/unduplicate them (and they anyways have the data as they load that into R using fread). There's therefore no *real* loss of data.</div>
<div>3) The point is to decide upon where duplicate names are allowed and where it should give an error… </div><div><br></div><div>As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this?</div>
<div><div><br></div><div>Arun</div><div><br></div></div><div class="HOEnZb"><div class="h5">
<p style="color:#a0a0a8">On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote:</p>
<blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px">
<span><div><div><div dir="ltr"><div>You mean what would be the problem?</div><div><br></div>Well, if the user fread's that data, then modifies e.g. non-duplicate columns and then tries to write.csv it back - how would the user recover the original names for correctly writing the data back if we renamed the columns?</div>
<div><br><br><div>On Wed, Nov 6, 2013 at 10:10 AM, <span dir="ltr"><<a href="mailto:aragorn168b@gmail.com" target="_blank">aragorn168b@gmail.com</a>></span> wrote:<br><blockquote type="cite"><div>
<div>
Eddi,
</div><div>Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using “freed”) (and issue a warning)??</div>
<div><div><br></div><span style="font-size:10pt">Arun</span><div><br></div></div><div><div>
<p style="color:#a0a0a8">On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:</p><blockquote type="cite"><div>
<span><div><div><div dir="ltr">Last comment here has an example of using duplicated names - <a href="http://stackoverflow.com/a/19809942/817778" target="_blank">http://stackoverflow.com/a/19809942/817778</a> - it's very similar to the one I mentioned earlier.<br>
</div><div><br><br><div>On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <span dir="ltr"><<a href="mailto:chinmay.patil@gmail.com" target="_blank">chinmay.patil@gmail.com</a>></span> wrote:<br><blockquote type="cite">
<div>
<div><div dir="ltr"><div>FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame.</div>
</div><div>
<br><br><div><div><div>On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <span dir="ltr"><<a href="mailto:eduard.antonyan@gmail.com" target="_blank">eduard.antonyan@gmail.com</a>></span> wrote:<br>
</div></div><blockquote type="cite"><div><div><div>
<div dir="ltr"><div>@Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere.<br>
<br></div>@Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here.<br>
</div><div><div><div><br><br><div>On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <span dir="ltr"><<a href="mailto:lianoglou.steve@gene.com" target="_blank">lianoglou.steve@gene.com</a>></span> wrote:<br><blockquote type="cite">
<div>
<div><div>On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan<br>
<<a href="mailto:eduard.antonyan@gmail.com" target="_blank">eduard.antonyan@gmail.com</a>> wrote:<br>
> Tbh I don't see why data presentation and preservation (i.e. if you're<br>
> reading in data with duplicated columns) is not enough of a use case -<br>
> that's the only reason we allow arbitrary symbols in column names.<br>
><br>
> So, instead of giving you another use case, how about you tell me instead<br>
> what do you propose should happen here (instead of what happens now):<br>
><br>
>> dt = data.table(1, 2)<br>
>> dt<br>
> V1 V2<br>
> 1: 1 2<br>
>> dt[, sum(V2), by = V1]<br>
> V1 V1<br>
> 1: 1 2<br>
<br>
</div>Only Matthew could say for sure, but if I were a gambling man I'd bet<br>
that this was likely something that slipped through the cracks and<br>
sleeping dogs were left to lie. I'd be curious to see what his<br>
opinions on this are.<br>
<br>
IMHO the "data presentation" argument doesn't really hold much water.<br>
<br>
As for "data preservation," I rather see it as imposing structure on<br>
it to enable efficient -- and sane/unambigous -- computation over it.<br>
Further, I don't think is a preservation issue at all -- no data is<br>
lost. The original data is still there in the file that was loaded<br>
into R. The name of a column is changed when imported (with adequate<br>
warning) into a data.table so that the user can slice and dice it. I'd<br>
also guess the user being warned by the duplicate names would most<br>
likely be happy to receive the warning, but the fact that you disagree<br>
suggests that this isn't an obvious conclusion ;-)<br>
<br>
I'm curious if you would argue for an SQL table to allow duplicate<br>
column names for the same reasons? I do know you can torture SQL to<br>
get two colnames to be the same by aliasing, but this also seems to<br>
have slipped through as an accident:<br>
<br>
<a href="http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf" target="_blank">http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf</a><br>
<br>
(which I found from here):<br>
<a href="http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table" target="_blank">http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table</a><br>
<br>
Perhaps we should email this guy Hugh to see what he thinks about this one :-)<br>
<div><div><br>
-steve<br>
<br>
--<br>
Steve Lianoglou<br>
Computational Biologist<br>
Bioinformatics and Computational Biology<br>
Genentech<br>
</div></div></div></div></blockquote></div><br></div>
</div></div><br></div></div><div>_______________________________________________<br>
datatable-help mailing list<br>
<a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a><br>
<a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a><br></div></div></blockquote></div><br>
</div>
</div></div></blockquote></div><br></div>
</div><div><div>_______________________________________________</div><div>datatable-help mailing list</div><div><a href="mailto:datatable-help@lists.r-forge.r-project.org" target="_blank">datatable-help@lists.r-forge.r-project.org</a></div>
<div><a href="https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help" target="_blank">https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help</a></div></div></div></span>
</div></blockquote><div>
<br>
</div>
</div></div></div></blockquote></div><br></div>
</div></div></span>
</blockquote>
<div>
<br>
</div>
</div></div></blockquote></div><br></div>