[datatable-help] Unexpected behavior in setnames()

Arunkumar Srinivasan aragorn168b at gmail.com
Wed Nov 6 23:50:39 CET 2013


Eddi,  

1) We can still allow duplicate names in "fread" and during creation of data.table with the data.table() command.
2) There's really no loss of data as we can allow "setnames" to set duplicate names/unduplicate them (and they anyways have the data as they load that into R using fread). There's therefore no *real* loss of data.
3) The point is to decide upon where duplicate names are allowed and where it should give an error…  

As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this?  

Arun


On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote:

> You mean what would be the problem?
>  
> Well, if the user fread's that data, then modifies e.g. non-duplicate columns and then tries to write.csv it back - how would the user recover the original names for correctly writing the data back if we renamed the columns?  
>  
>  
> On Wed, Nov 6, 2013 at 10:10 AM, <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Eddi,  
> > Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using “freed”) (and issue a warning)??
> >  
> > Arun
> >  
> >  
> > On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:
> >  
> > > Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier.
> > >  
> > >  
> > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com (mailto:chinmay.patil at gmail.com)> wrote:
> > > > FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame.  
> > > >  
> > > >  
> > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > @Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere.
> > > > >  
> > > > > @Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here.
> > > > >  
> > > > >  
> > > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)> wrote:
> > > > > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> > > > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > > > Tbh I don't see why data presentation and preservation (i.e. if you're
> > > > > > > reading in data with duplicated columns) is not enough of a use case -
> > > > > > > that's the only reason we allow arbitrary symbols in column names.
> > > > > > >
> > > > > > > So, instead of giving you another use case, how about you tell me instead
> > > > > > > what do you propose should happen here (instead of what happens now):
> > > > > > >
> > > > > > >> dt = data.table(1, 2)
> > > > > > >> dt
> > > > > > >    V1 V2
> > > > > > > 1:  1  2
> > > > > > >> dt[, sum(V2), by = V1]
> > > > > > >    V1 V1
> > > > > > > 1:  1  2
> > > > > >  
> > > > > > Only Matthew could say for sure, but if I were a gambling man I'd bet
> > > > > > that this was likely something that slipped through the cracks and
> > > > > > sleeping dogs were left to lie. I'd be curious to see what his
> > > > > > opinions on this are.
> > > > > >  
> > > > > > IMHO the "data presentation" argument doesn't really hold much water.
> > > > > >  
> > > > > > As for "data preservation," I rather see it as imposing structure on
> > > > > > it to enable efficient -- and sane/unambigous -- computation over it.
> > > > > > Further, I don't think is a preservation issue at all -- no data is
> > > > > > lost. The original data is still there in the file that was loaded
> > > > > > into R. The name of a column is changed when imported (with adequate
> > > > > > warning) into a data.table so that the user can slice and dice it. I'd
> > > > > > also guess the user being warned by the duplicate names would most
> > > > > > likely be happy to receive the warning, but the fact that you disagree
> > > > > > suggests that this isn't an obvious conclusion ;-)
> > > > > >  
> > > > > > I'm curious if you would argue for an SQL table to allow duplicate
> > > > > > column names for the same reasons? I do know you can torture SQL to
> > > > > > get two colnames to be the same by aliasing, but this also seems to
> > > > > > have slipped through as an accident:
> > > > > >  
> > > > > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
> > > > > >  
> > > > > > (which I found from here):
> > > > > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
> > > > > >  
> > > > > > Perhaps we should email this guy Hugh to see what he thinks about this one :-)
> > > > > >  
> > > > > > -steve
> > > > > >  
> > > > > > --
> > > > > > Steve Lianoglou
> > > > > > Computational Biologist
> > > > > > Bioinformatics and Computational Biology
> > > > > > Genentech
> > > > >  
> > > > >  
> > > > > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > >  
> > >  
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >  
> > >  
> > >  
> >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/f4375a4e/attachment.html>


More information about the datatable-help mailing list