[datatable-help] Unexpected behavior in setnames()

Simon O'Hanlon simon.ohanlon at imperial.ac.uk
Fri Nov 8 14:30:55 CET 2013


Eduard Antonyan <eduard.antonyan <at> gmail.com> writes:

> 
> 
> 
> 
> As I said before, I think it's essential to allow duplicate names while 
loading a file (and therefore for consistency during creation of data.table 
as well). However, all grouping/aggregating/subsetting etc.. where ambiguity 
can arise should end in error. At least this is my stance so far. Are we 
agreeing on this?
> 
> 
> 
> Sounds good to me. 
> 
> 
> On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan <aragorn168b <at> 
gmail.com> wrote:
>                 
>                     Eddi,
>                 
> 
> 1) We can still allow duplicate names in "fread" and during creation of 
data.table with the data.table() command.
> 2) There's really no loss of data as we can allow "setnames" to set 
duplicate names/unduplicate them (and they anyways have the data as they 
load that into R using fread). There's therefore no *real* loss of data.
> 
> 3) The point is to decide upon where duplicate names are allowed and where 
it should give an error… 
> 
> As I said before, I think it's essential to allow duplicate names while 
loading a file (and therefore for consistency during creation of data.table 
as well). However, all grouping/aggregating/subsetting etc.. where ambiguity 
can arise should end in error. At least this is my stance so far. Are we 
agreeing on this?
> 
> 
>                 
> 
> Arun
> 
> 
> 
>                  
>                 On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan 
wrote:
>                 
>                     
> 
> You mean what would be the problem?
> Well, if the user fread's that data, then modifies e.g. non-duplicate 
columns and then tries to write.csv it back - how would the user recover the 
original names for correctly writing the data back if we renamed the 
columns?
> 
> 
> 
> 
> On Wed, Nov 6, 2013 at 10:10 AM,  <aragorn168b <at> gmail.com> wrote:
> 
> 
>                 
>                     Eddi,
>                 
> Nice! But what exactly will happen to that data, if we were to 
automatically set unique names while loading it (using “freed”) (and issue a 
warning)??
>                 
> 
> Arun
> 
> 
>                   
>                 On Wednesday 6 November 2013 at 17:05, Eduard Antonyan 
wrote:
> 
>                     
> 
> Last comment here has an example of using duplicated names - 
http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I 
mentioned earlier.
> 
> 
> On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil <at> 
gmail.com> wrote:
> 
> 
> 
> 
> 
> FWIW, data.frame does allow duplicate names as well. In the light that 
data.table inherits from data.frame, I would expect that it follows same 
convention as data.frame.
> 
> 
> 
> 
> 
> On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan <at> 
gmail.com> wrote:
> 
> 
> 
> 
> 
>  <at> Arun: Ok. Thinking about it a bit - I don't like the continuing 
enumeration solution because it makes the results too unpredictable, but 
could live with adding a ".1" etc. Which I assume is the idea anyway for 
resolving duplicates elsewhere.
>  <at> Steve: Not sure why you think it doesn't hold much water - I think I 
can draw a parallel argument that replicates all of the duplicated names 
concerns with a column that is called e.g. `dt$V1` (imagine forgetting the 
backticks there and the world of hurt that potentially awaits once you do 
that). I am also curious what Matthew would think about this. This is smth 
I've encountered and dealt with a lot, so I'm certainly not an unbiased 
party here.
> 
> 
> 
> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve <at> 
gene.com> wrote:
> 
> 
> 
> 
> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> <eduard.antonyan <at> gmail.com> wrote:
> > Tbh I don't see why data presentation and preservation (i.e. if you're
> > reading in data with duplicated columns) is not enough of a use case -
> > that's the only reason we allow arbitrary symbols in column names.
> >
> > So, instead of giving you another use case, how about you tell me 
instead
> > what do you propose should happen here (instead of what happens now):
> >
> >> dt = data.table(1, 2)
> >> dt
> >    V1 V2
> > 1:  1  2
> >> dt[, sum(V2), by = V1]
> >    V1 V1
> > 1:  1  2
> Only Matthew could say for sure, but if I were a gambling man I'd bet
> that this was likely something that slipped through the cracks and
> sleeping dogs were left to lie. I'd be curious to see what his
> opinions on this are.
> IMHO the "data presentation" argument doesn't really hold much water.
> As for "data preservation," I rather see it as imposing structure on
> it to enable efficient -- and sane/unambigous -- computation over it.
> Further, I don't think is a preservation issue at all -- no data is
> lost. The original data is still there in the file that was loaded
> into R. The name of a column is changed when imported (with adequate
> warning) into a data.table so that the user can slice and dice it. I'd
> also guess the user being warned by the duplicate names would most
> likely be happy to receive the warning, but the fact that you disagree
> suggests that this isn't an obvious conclusion 
> I'm curious if you would argue for an SQL table to allow duplicate
> column names for the same reasons? I do know you can torture SQL to
> get two colnames to be the same by aliasing, but this also seems to
> have slipped through as an 
accident:http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-
Names.pdf
> (which I found from here):http://stackoverflow.com/questions/8797593/is-
there-any-use-to-duplicate-column-names-in-a-table
> Perhaps we should email this guy Hugh to see what he thinks about this one 
> 
> -steve
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing listdatatable-help <at> lists.r-forge.r-
project.orghttps://lists.r-forge.r-project.org/cgi-
bin/mailman/listinfo/datatable-help
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help <at> lists.r-forge.r-project.org
> 
> 
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
help
> 
> 
>                   
>                   
>                   
>                   
>                 
> 
>                     
> 
>             
> 
> 
> 
> 
> 
>                  
>                  
>                  
>                  
>                 
>                  
>                 
>                     
> 
>             
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help <at> lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
help

I am not particularly opposed or otherwise, to duplicate column names, 
although I do see the issues that creates.

I think that whatever you, as custodians of data.table decide with respect 
to column names, the behaviour of numeric indices to indicate columns 
included in .SD needs to be fixed when duplicate column names are present. 
As a user I'd expect the following to return two columns with the values 2 
and 6 respectively:

Example:

dt <- data.table( 1,2,3,4 )
setnames(dt , rep( c("a", "b") , 2 ) )
   a b a b
1: 1 2 3 4

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
   a a
1: 2 2

I hope that contributes in some small way to your decision making process. 
This is lifted from a question I asked on Stack Overflow here;

http://stackoverflow.com/questions/19811644/can-data-table-handle-identical-
column-names-when-using-sdcols



Thanks,


Simon



More information about the datatable-help mailing list