[datatable-help] Unexpected behavior in setnames()

Thu Nov 7 00:04:51 CET 2013

>
> As I said before, I think it's essential to allow duplicate names while
> loading a file (and therefore for consistency during creation of data.table
> as well). However, all grouping/aggregating/subsetting etc.. where
> ambiguity can arise should end in error. At least this is my stance so far.
> Are we agreeing on this?

Sounds good to me.

On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

>  Eddi,
>
> 1) We can still allow duplicate names in "fread" and during creation of
> data.table with the data.table() command.
> 2) There's really no loss of data as we can allow "setnames" to set
> duplicate names/unduplicate them (and they anyways have the data as they
> load that into R using fread). There's therefore no *real* loss of data.
> 3) The point is to decide upon where duplicate names are allowed and where
> it should give an error…
>
> As I said before, I think it's essential to allow duplicate names while
> loading a file (and therefore for consistency during creation of data.table
> as well). However, all grouping/aggregating/subsetting etc.. where
> ambiguity can arise should end in error. At least this is my stance so far.
> Are we agreeing on this?
>
> Arun
>
> On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote:
>
> You mean what would be the problem?
>
> Well, if the user fread's that data, then modifies e.g. non-duplicate
> columns and then tries to write.csv it back - how would the user recover
> the original names for correctly writing the data back if we renamed the
> columns?
>
>
> On Wed, Nov 6, 2013 at 10:10 AM, <aragorn168b at gmail.com> wrote:
>
>  Eddi,
> Nice! But what exactly will happen to that data, if we were to
> automatically set unique names while loading it (using “freed”) (and issue
> a warning)??
>
> Arun
>
> On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:
>
> Last comment here has an example of using duplicated names -
> http://stackoverflow.com/a/19809942/817778 - it's very similar to the one
> I mentioned earlier.
>
>
> On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com>wrote:
>
>  FWIW, data.frame does allow duplicate names as well. In the light that
> data.table inherits from data.frame, I would expect that it follows same
> convention as data.frame.
>
>
> On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com
> > wrote:
>
> @Arun: Ok. Thinking about it a bit - I don't like the continuing
> enumeration solution because it makes the results too unpredictable, but
> could live with adding a ".1" etc. Which I assume is the idea anyway for
> resolving duplicates elsewhere.
>
> @Steve: Not sure why you think it doesn't hold much water - I think I can
> draw a parallel argument that replicates all of the duplicated names
> concerns with a column that is called e.g. `dt$V1` (imagine forgetting the
> backticks there and the world of hurt that potentially awaits once you do
> that). I am also curious what Matthew would think about this. This is smth
> I've encountered and dealt with a lot, so I'm certainly not an unbiased
> party here.
>
>
> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:
>
>  On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > Tbh I don't see why data presentation and preservation (i.e. if you're
> > reading in data with duplicated columns) is not enough of a use case -
> > that's the only reason we allow arbitrary symbols in column names.
> >
> > So, instead of giving you another use case, how about you tell me instead
> > what do you propose should happen here (instead of what happens now):
> >
> >> dt = data.table(1, 2)
> >> dt
> >    V1 V2
> > 1:  1  2
> >> dt[, sum(V2), by = V1]
> >    V1 V1
> > 1:  1  2
>
> Only Matthew could say for sure, but if I were a gambling man I'd bet
> that this was likely something that slipped through the cracks and
> sleeping dogs were left to lie. I'd be curious to see what his
> opinions on this are.
>
> IMHO the "data presentation" argument doesn't really hold much water.
>
> As for "data preservation," I rather see it as imposing structure on
> it to enable efficient -- and sane/unambigous -- computation over it.
> Further, I don't think is a preservation issue at all -- no data is
> lost. The original data is still there in the file that was loaded
> into R. The name of a column is changed when imported (with adequate
> warning) into a data.table so that the user can slice and dice it. I'd
> also guess the user being warned by the duplicate names would most
> likely be happy to receive the warning, but the fact that you disagree
> suggests that this isn't an obvious conclusion ;-)
>
> I'm curious if you would argue for an SQL table to allow duplicate
> column names for the same reasons? I do know you can torture SQL to
> get two colnames to be the same by aliasing, but this also seems to
> have slipped through as an accident:
>
> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
>
> (which I found from here):
>
> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
>
> Perhaps we should email this guy Hugh to see what he thinks about this one
> :-)
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/9787e322/attachment-0001.html>