[datatable-help] Unexpected behavior in setnames()
Eduard Antonyan
eduard.antonyan at gmail.com
Sat Nov 2 16:30:17 CET 2013
Thanks Alexandre. I added (a non-committal) FR about this -
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978,
which will likely go in the direction this thread goes.
To address your points:
1. If user decides to have column with duplicate names, yes, their job
will become harder, but that's a user decision and everyone else who
doesn't use duplicate names does not lose flexibility and doesn't need to
use column numbers or whatnot.
2. I agree that this should be documented better and appropriate
warnings should be added.
One of the cool things about data.table that's very different from
data.frame is that you can have arbitrary column names. Whether they
include spaces, crazy symbols or are duplicate - it'll all be valid. This
is very useful for reading and writing/presenting arbitrary data.
This does mean though that if (and *only* if) you choose to use non
standard names you'll need to do more work.
Now the issue you ran into is that you didn't realize that you were using
non-standard naming (or even wanted to, but we can't guess what you want
:)). And a warning in the right place can help you out and also let
non-standard users proceed.
Once you understand that there is nothing wrong with duplicate names, it
should be clear that the appropriate warning spot is when you use them
potentially incorrectly, and not when you set them.
For reference there are a *lot* of different ways to get duplicate names,
to name a few besides setnames and creating one straight up - cbinding
similarly named data.tables, merging, having default named columns and
grouping (e.g. dt[, sum(smth), by = V1]), freading, etc.
My 2 cents here.
There are several reasons why I don’t think, IMHO, allowing multiple
columns with the same name is a good idea:
- It will force the code to use column numbers to access all the data in a
predictable fashion (since depending on your code you might now know which
of the two columns with the same name will be the first), so we’ll lose all
the delicious syntactic sugar painstakingly added to data.table.
- For people learning data.table and having data.frame or even the concept
of a relational table as a reference, this is a definite WTF and will cause
confusion and complicate troubleshooting. I speak from experience on this
matter. :)
Even though there might be some situations where this might be a plus, I
imagine they are few and far between and could be worked around. I could be
wrong, it’s been know to happen :) - but I have never seen and can’t even
imagine a situation where multiple columns with the same name would be
essential. So in the balance I consider keeping this behavior as a bad
trade-off for most users.
Having said that, this is a design decision and it's up to the data.table
demigods to decide. :)
BTW, is there any part of the data.table documentation that covers this? If
you choose to maintain this property, I strongly suggest it be documented
somewhere that most beginners would read.
In my personal example, I ran into this problem after a rather long
troubleshooting of a very esoteric problem that was happening in my code.
I was renaming a column to a name that already existed, and this broke
things in a completely different part of my code. If ‘setnames()’ had at
least warned me that a duplicate column name was created, I would have been
able to detect the source cause much faster.
--
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor
"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan (
aragorn168b at gmail.com <//aragorn168b at gmail.com>) wrote:
Hm, I've not encountered that use myself, can't comment there. Probably
then it should be allowed everywhere except where deciding which column
could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc..
should result in error (if one has the time, one could do this by checking
if the duplicate column is in use actually or not and then issue an
error/warning).
At the moment, I'm not convinced that it's worth that much trouble to help
data presentation.
Arun
On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:
Because it's very useful for e.g. data presentation purposes.
On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:
Yes, it chooses the first. But we won't be able to perform any operation
as intended. So why allow duplicate names (ex: in `setnames` as Alexandre
asks)?
Arun
On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:
I think currently it chooses the first "x", but it's definitely a good
idea to add a warning there.
On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:
Ricardo added a bug report here on this topic:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
But I don't think having duplicate names is an easy-to-implement concept.
For ex:
dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
dt[, print(.SD), by=y]
x
1: 1
2: 2
x
1: 3
.SD loses the second "x". Also, some other questions become difficult to
handle. Ex:
dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
dt[, list(x=x/x[1], y=y), by=x]
Which "x" should be choose for which operation?
Arun
On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
Having duplicate names is allowed and not that unusual in data.table
framework, so there is no need to signal anything here.
A different question is whether there should be a warning here:
dt = data.table(a = 1, a = 2)
dt[, a]
and I think that'd be a pretty good FR to have.
On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com
> wrote:
I found this behavior during a debugging session:
> d = data.table(a=1, b=2, c=3)
> setnames(d, "a", "b")
> d
b b c
1: 1 2 3
Shouldn’t setnames() check if the new column names already exist before
renaming, and signal an error or at least a warning if they do?
--
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor
"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/33986a2b/attachment-0001.html>
More information about the datatable-help
mailing list