From alexandre.sieira at gmail.com Fri Nov 1 22:49:07 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Fri, 1 Nov 2013 19:49:07 -0200 Subject: [datatable-help] Unexpected behavior in setnames() Message-ID: I found this behavior during a debugging session:? > d = data.table(a=1, b=2, c=3) > setnames(d, "a", "b") > d ? ?b b c 1: 1 2 3 Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri Nov 1 22:59:57 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 1 Nov 2013 16:59:57 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: Message-ID: Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. A different question is whether there should be a warning here: dt = data.table(a = 1, a = 2) dt[, a] and I think that'd be a pretty good FR to have. On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > I found this behavior during a debugging session: > > > d = data.table(a=1, b=2, c=3) > > setnames(d, "a", "b") > > d > b b c > 1: 1 2 3 > > Shouldn?t setnames() check if the new column names already exist before > renaming, and signal an error or at least a warning if they do? > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Fri Nov 1 23:51:18 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 1 Nov 2013 23:51:18 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: Message-ID: <957B1243714142278898647650EBF386@gmail.com> Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 But I don't think having duplicate names is an easy-to-implement concept. For ex: dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) dt[, print(.SD), by=y] x 1: 1 2: 2 x 1: 3 .SD loses the second "x". Also, some other questions become difficult to handle. Ex: dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) dt[, list(x=x/x[1], y=y), by=x] Which "x" should be choose for which operation? Arun On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. > > A different question is whether there should be a warning here: > > dt = data.table(a = 1, a = 2) > dt[, a] > > and I think that'd be a pretty good FR to have. > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > > I found this behavior during a debugging session: > > > > > d = data.table(a=1, b=2, c=3) > > > setnames(d, "a", "b") > > > d > > b b c > > 1: 1 2 3 > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? > > -- > > Alexandre Sieira > > CISA, CISSP, ISO 27001 Lead Auditor > > > > "The truth is rarely pure and never simple." > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri Nov 1 23:57:51 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 1 Nov 2013 17:57:51 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <957B1243714142278898647650EBF386@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> Message-ID: I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: > Ricardo added a bug report here on this topic: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > But I don't think having duplicate names is an easy-to-implement concept. > For ex: > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > dt[, print(.SD), by=y] > x > 1: 1 > 2: 2 > x > 1: 3 > > .SD loses the second "x". Also, some other questions become difficult to > handle. Ex: > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > dt[, list(x=x/x[1], y=y), by=x] > > Which "x" should be choose for which operation? > > Arun > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > Having duplicate names is allowed and not that unusual in data.table > framework, so there is no need to signal anything here. > > A different question is whether there should be a warning here: > > dt = data.table(a = 1, a = 2) > dt[, a] > > and I think that'd be a pretty good FR to have. > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira < > alexandre.sieira at gmail.com> wrote: > > I found this behavior during a debugging session: > > > d = data.table(a=1, b=2, c=3) > > setnames(d, "a", "b") > > d > b b c > 1: 1 2 3 > > Shouldn?t setnames() check if the new column names already exist before > renaming, and signal an error or at least a warning if they do? > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 2 00:02:38 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 2 Nov 2013 00:02:38 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> Message-ID: <5E98018F047943DE89849EC57A7CF72A@gmail.com> Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? Arun On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > > But I don't think having duplicate names is an easy-to-implement concept. For ex: > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > > dt[, print(.SD), by=y] > > x > > 1: 1 > > 2: 2 > > x > > 1: 3 > > > > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex: > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > > dt[, list(x=x/x[1], y=y), by=x] > > > > > > Which "x" should be choose for which operation? > > > > Arun > > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. > > > > > > A different question is whether there should be a warning here: > > > > > > dt = data.table(a = 1, a = 2) > > > dt[, a] > > > > > > and I think that'd be a pretty good FR to have. > > > > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > > > > I found this behavior during a debugging session: > > > > > > > > > d = data.table(a=1, b=2, c=3) > > > > > setnames(d, "a", "b") > > > > > d > > > > b b c > > > > 1: 1 2 3 > > > > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? > > > > -- > > > > Alexandre Sieira > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > "The truth is rarely pure and never simple." > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sat Nov 2 00:05:46 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 1 Nov 2013 18:05:46 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <5E98018F047943DE89849EC57A7CF72A@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> Message-ID: Because it's very useful for e.g. data presentation purposes. On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: > Yes, it chooses the first. But we won't be able to perform any operation > as intended. So why allow duplicate names (ex: in `setnames` as Alexandre > asks)? > > Arun > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > > I think currently it chooses the first "x", but it's definitely a good > idea to add a warning there. > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Ricardo added a bug report here on this topic: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > But I don't think having duplicate names is an easy-to-implement concept. > For ex: > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > dt[, print(.SD), by=y] > x > 1: 1 > 2: 2 > x > 1: 3 > > .SD loses the second "x". Also, some other questions become difficult to > handle. Ex: > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > dt[, list(x=x/x[1], y=y), by=x] > > Which "x" should be choose for which operation? > > Arun > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > Having duplicate names is allowed and not that unusual in data.table > framework, so there is no need to signal anything here. > > A different question is whether there should be a warning here: > > dt = data.table(a = 1, a = 2) > dt[, a] > > and I think that'd be a pretty good FR to have. > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira < > alexandre.sieira at gmail.com> wrote: > > I found this behavior during a debugging session: > > > d = data.table(a=1, b=2, c=3) > > setnames(d, "a", "b") > > d > b b c > 1: 1 2 3 > > Shouldn?t setnames() check if the new column names already exist before > renaming, and signal an error or at least a warning if they do? > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 2 00:10:41 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 2 Nov 2013 00:10:41 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> Message-ID: Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning). At the moment, I'm not convinced that it's worth that much trouble to help data presentation. Arun On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote: > Because it's very useful for e.g. data presentation purposes. > > > On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: > > Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? > > > > Arun > > > > > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > > > > > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. > > > > > > > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: > > > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > > > > But I don't think having duplicate names is an easy-to-implement concept. For ex: > > > > > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > > > > dt[, print(.SD), by=y] > > > > x > > > > 1: 1 > > > > 2: 2 > > > > x > > > > 1: 3 > > > > > > > > > > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex: > > > > > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > > > > dt[, list(x=x/x[1], y=y), by=x] > > > > > > > > > > > > Which "x" should be choose for which operation? > > > > > > > > Arun > > > > > > > > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > > > > > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. > > > > > > > > > > A different question is whether there should be a warning here: > > > > > > > > > > dt = data.table(a = 1, a = 2) > > > > > dt[, a] > > > > > > > > > > and I think that'd be a pretty good FR to have. > > > > > > > > > > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > > > > > > I found this behavior during a debugging session: > > > > > > > > > > > > > d = data.table(a=1, b=2, c=3) > > > > > > > setnames(d, "a", "b") > > > > > > > d > > > > > > b b c > > > > > > 1: 1 2 3 > > > > > > > > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? > > > > > > -- > > > > > > Alexandre Sieira > > > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > > > > > "The truth is rarely pure and never simple." > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > > > _______________________________________________ > > > > > > datatable-help mailing list > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Sat Nov 2 13:00:17 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Sat, 2 Nov 2013 10:00:17 -0200 Subject: [datatable-help] datatable-help Digest, Vol 45, Issue 2 In-Reply-To: References: Message-ID: My 2 cents here. There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea: ? - It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table. - For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :) Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users. Having said that, this is a design decision and it's up to the data.table demigods to decide. :) BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read. In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in ?my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster. --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 1 de novembro de 2013 at 21:10:54, datatable-help-request at lists.r-forge.r-project.org (datatable-help-request at lists.r-forge.r-project.org) wrote: Send datatable-help mailing list submissions to datatable-help at lists.r-forge.r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help or, via email, send a message with subject or body 'help' to datatable-help-request at lists.r-forge.r-project.org You can reach the person managing the list at datatable-help-owner at lists.r-forge.r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of datatable-help digest..." Today's Topics: 1. Re: Unexpected behavior in setnames() (Arunkumar Srinivasan) 2. Re: Unexpected behavior in setnames() (Eduard Antonyan) 3. Re: Unexpected behavior in setnames() (Arunkumar Srinivasan) ---------------------------------------------------------------------- Message: 1 Date: Sat, 2 Nov 2013 00:02:38 +0100 From: Arunkumar Srinivasan To: Eduard Antonyan Cc: "=?utf-8?Q?datatable-help=40lists.r-forge.r-project.org?=" , Alexandre Sieira Subject: Re: [datatable-help] Unexpected behavior in setnames() Message-ID: <5E98018F047943DE89849EC57A7CF72A at gmail.com> Content-Type: text/plain; charset="utf-8" Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? Arun On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > > But I don't think having duplicate names is an easy-to-implement concept. For ex: > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > > dt[, print(.SD), by=y] > > x > > 1: 1 > > 2: 2 > > x > > 1: 3 > > > > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex: > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > > dt[, list(x=x/x[1], y=y), by=x] > > > > > > Which "x" should be choose for which operation? > > > > Arun > > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. > > > > > > A different question is whether there should be a warning here: > > > > > > dt = data.table(a = 1, a = 2) > > > dt[, a] > > > > > > and I think that'd be a pretty good FR to have. > > > > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > > > > I found this behavior during a debugging session: > > > > > > > > > d = data.table(a=1, b=2, c=3) > > > > > setnames(d, "a", "b") > > > > > d > > > > b b c > > > > 1: 1 2 3 > > > > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? > > > > -- > > > > Alexandre Sieira > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > "The truth is rarely pure and never simple." > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 1 Nov 2013 18:05:46 -0500 From: Eduard Antonyan To: Arunkumar Srinivasan Cc: "datatable-help at lists.r-forge.r-project.org" , Alexandre Sieira Subject: Re: [datatable-help] Unexpected behavior in setnames() Message-ID: Content-Type: text/plain; charset="windows-1252" Because it's very useful for e.g. data presentation purposes. On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: > Yes, it chooses the first. But we won't be able to perform any operation > as intended. So why allow duplicate names (ex: in `setnames` as Alexandre > asks)? > > Arun > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > > I think currently it chooses the first "x", but it's definitely a good > idea to add a warning there. > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Ricardo added a bug report here on this topic: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > But I don't think having duplicate names is an easy-to-implement concept. > For ex: > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > dt[, print(.SD), by=y] > x > 1: 1 > 2: 2 > x > 1: 3 > > .SD loses the second "x". Also, some other questions become difficult to > handle. Ex: > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > dt[, list(x=x/x[1], y=y), by=x] > > Which "x" should be choose for which operation? > > Arun > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > Having duplicate names is allowed and not that unusual in data.table > framework, so there is no need to signal anything here. > > A different question is whether there should be a warning here: > > dt = data.table(a = 1, a = 2) > dt[, a] > > and I think that'd be a pretty good FR to have. > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira < > alexandre.sieira at gmail.com> wrote: > > I found this behavior during a debugging session: > > > d = data.table(a=1, b=2, c=3) > > setnames(d, "a", "b") > > d > b b c > 1: 1 2 3 > > Shouldn?t setnames() check if the new column names already exist before > renaming, and signal an error or at least a warning if they do? > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 3 Date: Sat, 2 Nov 2013 00:10:41 +0100 From: Arunkumar Srinivasan To: Eduard Antonyan Cc: "=?utf-8?Q?datatable-help=40lists.r-forge.r-project.org?=" , Alexandre Sieira Subject: Re: [datatable-help] Unexpected behavior in setnames() Message-ID: Content-Type: text/plain; charset="utf-8" Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning). At the moment, I'm not convinced that it's worth that much trouble to help data presentation. Arun On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote: > Because it's very useful for e.g. data presentation purposes. > > > On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: > > Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? > > > > Arun > > > > > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > > > > > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. > > > > > > > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: > > > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > > > > But I don't think having duplicate names is an easy-to-implement concept. For ex: > > > > > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > > > > dt[, print(.SD), by=y] > > > > x > > > > 1: 1 > > > > 2: 2 > > > > x > > > > 1: 3 > > > > > > > > > > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex: > > > > > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > > > > dt[, list(x=x/x[1], y=y), by=x] > > > > > > > > > > > > Which "x" should be choose for which operation? > > > > > > > > Arun > > > > > > > > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > > > > > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. > > > > > > > > > > A different question is whether there should be a warning here: > > > > > > > > > > dt = data.table(a = 1, a = 2) > > > > > dt[, a] > > > > > > > > > > and I think that'd be a pretty good FR to have. > > > > > > > > > > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > > > > > > I found this behavior during a debugging session: > > > > > > > > > > > > > d = data.table(a=1, b=2, c=3) > > > > > > > setnames(d, "a", "b") > > > > > > > d > > > > > > b b c > > > > > > 1: 1 2 3 > > > > > > > > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? > > > > > > -- > > > > > > Alexandre Sieira > > > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > > > > > "The truth is rarely pure and never simple." > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > > > _______________________________________________ > > > > > > datatable-help mailing list > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help End of datatable-help Digest, Vol 45, Issue 2 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.sieira at gmail.com Sat Nov 2 13:10:13 2013 From: alexandre.sieira at gmail.com (Alexandre Sieira) Date: Sat, 2 Nov 2013 10:10:13 -0200 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> Message-ID: My 2 cents here. There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea: ? - It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table. - For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :) Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users. Having said that, this is a design decision and it's up to the data.table demigods to decide. :) BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read. In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in ?my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster. --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan (aragorn168b at gmail.com) wrote: Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning).? At the moment, I'm not convinced that it's worth that much trouble to help data presentation. Arun On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote: Because it's very useful for e.g. data presentation purposes. On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? Arun On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: Ricardo added a bug report here on this topic:?https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 But I don't think having duplicate names is an easy-to-implement concept. For ex: dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) dt[, print(.SD), by=y] ? ?x 1: 1 2: 2 ? ?x 1: 3 .SD loses the second "x". Also, some other questions become difficult to handle. Ex:? dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) dt[, list(x=x/x[1], y=y), by=x] Which "x" should be choose for which operation? Arun On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. A different question is whether there should be a warning here: ? dt = data.table(a = 1, a = 2) ? dt[, a] and I think that'd be a pretty good FR to have. On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: I found this behavior during a debugging session:? > d = data.table(a=1, b=2, c=3) > setnames(d, "a", "b") > d ? ?b b c 1: 1 2 3 Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? --? Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sat Nov 2 16:30:17 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 2 Nov 2013 10:30:17 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> Message-ID: Thanks Alexandre. I added (a non-committal) FR about this - https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978, which will likely go in the direction this thread goes. To address your points: 1. If user decides to have column with duplicate names, yes, their job will become harder, but that's a user decision and everyone else who doesn't use duplicate names does not lose flexibility and doesn't need to use column numbers or whatnot. 2. I agree that this should be documented better and appropriate warnings should be added. One of the cool things about data.table that's very different from data.frame is that you can have arbitrary column names. Whether they include spaces, crazy symbols or are duplicate - it'll all be valid. This is very useful for reading and writing/presenting arbitrary data. This does mean though that if (and *only* if) you choose to use non standard names you'll need to do more work. Now the issue you ran into is that you didn't realize that you were using non-standard naming (or even wanted to, but we can't guess what you want :)). And a warning in the right place can help you out and also let non-standard users proceed. Once you understand that there is nothing wrong with duplicate names, it should be clear that the appropriate warning spot is when you use them potentially incorrectly, and not when you set them. For reference there are a *lot* of different ways to get duplicate names, to name a few besides setnames and creating one straight up - cbinding similarly named data.tables, merging, having default named columns and grouping (e.g. dt[, sum(smth), by = V1]), freading, etc. My 2 cents here. There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea: - It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table. - For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :) Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users. Having said that, this is a design decision and it's up to the data.table demigods to decide. :) BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read. In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster. -- Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan ( aragorn168b at gmail.com ) wrote: Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning). At the moment, I'm not convinced that it's worth that much trouble to help data presentation. Arun On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote: Because it's very useful for e.g. data presentation purposes. On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? Arun On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 But I don't think having duplicate names is an easy-to-implement concept. For ex: dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) dt[, print(.SD), by=y] x 1: 1 2: 2 x 1: 3 .SD loses the second "x". Also, some other questions become difficult to handle. Ex: dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) dt[, list(x=x/x[1], y=y), by=x] Which "x" should be choose for which operation? Arun On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. A different question is whether there should be a warning here: dt = data.table(a = 1, a = 2) dt[, a] and I think that'd be a pretty good FR to have. On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: I found this behavior during a debugging session: > d = data.table(a=1, b=2, c=3) > setnames(d, "a", "b") > d b b c 1: 1 2 3 Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? -- Alexandre Sieira CISA, CISSP, ISO 27001 Lead Auditor "The truth is rarely pure and never simple." Oscar Wilde, The Importance of Being Earnest, 1895, Act I _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 2 16:41:53 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 2 Nov 2013 16:41:53 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> Message-ID: <94F078AB544B4757A58049C7DB7433AB@gmail.com> > Now the issue you ran into is that you didn't realize that you were using non-standard naming (or even wanted to, but we can't guess what you want :)). I've to disagree here. This is one of the things `data.table` is *extremely* good at. The error/warning messages are precise and under most circumstances provides the solution (as to what you ought to do) by spotting the mistake exactly. > Once you understand that there is nothing wrong with duplicate names, it should be clear that the appropriate warning spot is when you use them potentially incorrectly, and not when you set them. This, I believe is also not entirely true, at least in this scenario. For example, an error happens when assigning a duplicate column using `:=`, for example. DT <- data.table(x=1:5, y=6:10) > DT[, c("y", "y") := 1L] Error in `[.data.table`(DT, , `:=`(c("y", "y"), 1L)) : Can't assign to the same column twice in the same query (duplicates detected). So, it's only natural to expect a warning/error in other cases as well. In general, prevention is better - it's nicer to catch it earlier, spit a warning/error rather than letting it on to only catch later. Overall, I agree keeping duplicate names may help some users. But then, the potential side-effects should be marked with warnings/errors distinctly, in all cases (and preferably documented). Ex: grouping/aggregating is once such scenario (Ricardo's bug report) where we can not possibly know which column to use.. Arun On Saturday, November 2, 2013 at 4:30 PM, Eduard Antonyan wrote: > Thanks Alexandre. I added (a non-committal) FR about this - https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978, which will likely go in the direction this thread goes. > To address your points: > 1. If user decides to have column with duplicate names, yes, their job will become harder, but that's a user decision and everyone else who doesn't use duplicate names does not lose flexibility and doesn't need to use column numbers or whatnot. > 2. I agree that this should be documented better and appropriate warnings should be added. > One of the cool things about data.table that's very different from data.frame is that you can have arbitrary column names. Whether they include spaces, crazy symbols or are duplicate - it'll all be valid. This is very useful for reading and writing/presenting arbitrary data. > This does mean though that if (and *only* if) you choose to use non standard names you'll need to do more work. > Now the issue you ran into is that you didn't realize that you were using non-standard naming (or even wanted to, but we can't guess what you want :)). And a warning in the right place can help you out and also let non-standard users proceed. > Once you understand that there is nothing wrong with duplicate names, it should be clear that the appropriate warning spot is when you use them potentially incorrectly, and not when you set them. > For reference there are a *lot* of different ways to get duplicate names, to name a few besides setnames and creating one straight up - cbinding similarly named data.tables, merging, having default named columns and grouping (e.g. dt[, sum(smth), by = V1]), freading, etc. > My 2 cents here. > > There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea: > > - It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table. > > - For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :) > > Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users. > > Having said that, this is a design decision and it's up to the data.table demigods to decide. :) > > BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read. > > In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster. > > -- > Alexandre Sieira > CISA, CISSP, ISO 27001 Lead Auditor > > "The truth is rarely pure and never simple." > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan (aragorn168b at gmail.com (mailto://aragorn168b at gmail.com)) wrote: > > > Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning). > > > > At the moment, I'm not convinced that it's worth that much trouble to help data presentation. > > > > Arun > > > > > > On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote: > > > > > Because it's very useful for e.g. data presentation purposes. > > > > > > > > > On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan wrote: > > > > Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)? > > > > > > > > Arun > > > > > > > > > > > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote: > > > > > > > > > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there. > > > > > > > > > > > > > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan wrote: > > > > > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975 > > > > > > But I don't think having duplicate names is an easy-to-implement concept. For ex: > > > > > > > > > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2)) > > > > > > dt[, print(.SD), by=y] > > > > > > x > > > > > > 1: 1 > > > > > > 2: 2 > > > > > > x > > > > > > 1: 3 > > > > > > > > > > > > > > > > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex: > > > > > > > > > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1)) > > > > > > dt[, list(x=x/x[1], y=y), by=x] > > > > > > > > > > > > > > > > > > Which "x" should be choose for which operation? > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here. > > > > > > > > > > > > > > A different question is whether there should be a warning here: > > > > > > > > > > > > > > dt = data.table(a = 1, a = 2) > > > > > > > dt[, a] > > > > > > > > > > > > > > and I think that'd be a pretty good FR to have. > > > > > > > > > > > > > > > > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira wrote: > > > > > > > > I found this behavior during a debugging session: > > > > > > > > > > > > > > > > > d = data.table(a=1, b=2, c=3) > > > > > > > > > setnames(d, "a", "b") > > > > > > > > > d > > > > > > > > b b c > > > > > > > > 1: 1 2 3 > > > > > > > > > > > > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do? > > > > > > > > -- > > > > > > > > Alexandre Sieira > > > > > > > > CISA, CISSP, ISO 27001 Lead Auditor > > > > > > > > > > > > > > > > "The truth is rarely pure and never simple." > > > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I > > > > > > > > _______________________________________________ > > > > > > > > datatable-help mailing list > > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > _______________________________________________ > > > > > > > datatable-help mailing list > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Sun Nov 3 01:10:07 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Sat, 2 Nov 2013 17:10:07 -0700 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <94F078AB544B4757A58049C7DB7433AB@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: Hi, On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan wrote: [snip] > Overall, I agree keeping duplicate names may help some users. But then, the > potential side-effects should be marked with warnings/errors distinctly, in > all cases (and preferably documented). [/snip] I guess I must have missed it, but has anyone anywhere (in this thread, a FR or something) actually present a (concrete) compelling situation where allowing duplicate column names was actually useful? I'm hard pressed to come up with any situation where (purposefully) keeping duplicate column names in a data.table has more benefit than downside. Seems to me that if this ever happens, it most certainly would be by mistake. Can someone help me out here? In the case of cbinding two data.tables together that end up having two duplicate names, I'd imagine unique-ing the names of the data.tables and firing a warning that this was done would be most useful (uniqueness priority would be from left to right as the data.tables are passed into the cbind call) -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Sun Nov 3 01:31:28 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 3 Nov 2013 01:31:28 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: > I guess I must have missed it, but has anyone anywhere (in this > thread, a FR or something) actually present a (concrete) compelling > situation where allowing duplicate column names was actually useful? True, Not quite compelling situations so far. The only example I've seen (in this thread) is reg. data presentation purpose (from eddi). I don't quite know exactly in what way, still. I can understand although, that the data by itself sometimes maybe available in such format. But one can always make unique names while loading. > I'm hard pressed to come up with any situation where (purposefully) > keeping duplicate column names in a data.table has more benefit than > downside. Seems to me that if this ever happens, it most certainly > would be by mistake. I agree. > In the case of cbinding two data.tables together that end up having > two duplicate names, I'd imagine unique-ing the names of the > data.tables and firing a warning that this was done would be most > useful (uniqueness priority would be from left to right as the > data.tables are passed into the cbind call) Unless there's a nice argument why this (unique-ing the names) would be bad or in which case keeping duplicate names would be good, I agree with you on this point as well. Arun On Sunday, November 3, 2013 at 1:10 AM, Steve Lianoglou wrote: > Hi, > > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan > wrote: > [snip] > > Overall, I agree keeping duplicate names may help some users. But then, the > > potential side-effects should be marked with warnings/errors distinctly, in > > all cases (and preferably documented). > > > > [/snip] > > I guess I must have missed it, but has anyone anywhere (in this > thread, a FR or something) actually present a (concrete) compelling > situation where allowing duplicate column names was actually useful? > > I'm hard pressed to come up with any situation where (purposefully) > keeping duplicate column names in a data.table has more benefit than > downside. Seems to me that if this ever happens, it most certainly > would be by mistake. > > Can someone help me out here? > > In the case of cbinding two data.tables together that end up having > two duplicate names, I'd imagine unique-ing the names of the > data.tables and firing a warning that this was done would be most > useful (uniqueness priority would be from left to right as the > data.tables are passed into the cbind call) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sun Nov 3 01:31:52 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 2 Nov 2013 19:31:52 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: The main usage case I've personally encountered is data presentation (for either self or others), where I would sometimes organize data like so: category1 name,colname1,colname2,category2 name,colname1,colname2 ....numbersandstuff.... Also, in general there are many cases I brought up above that generate duplicate names, and I definitely don't want either lost columns or renamed columns as a result - both are data loss that I don't appreciate. On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou wrote: > Hi, > > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan > wrote: > [snip] > > Overall, I agree keeping duplicate names may help some users. But then, > the > > potential side-effects should be marked with warnings/errors distinctly, > in > > all cases (and preferably documented). > [/snip] > > I guess I must have missed it, but has anyone anywhere (in this > thread, a FR or something) actually present a (concrete) compelling > situation where allowing duplicate column names was actually useful? > > I'm hard pressed to come up with any situation where (purposefully) > keeping duplicate column names in a data.table has more benefit than > downside. Seems to me that if this ever happens, it most certainly > would be by mistake. > > Can someone help me out here? > > In the case of cbinding two data.tables together that end up having > two duplicate names, I'd imagine unique-ing the names of the > data.tables and firing a warning that this was done would be most > useful (uniqueness priority would be from left to right as the > data.tables are passed into the cbind call) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Nov 3 01:36:35 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 3 Nov 2013 01:36:35 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: Eddi, While loading the data in, maybe, if it is essential to keep names intact, we can probably add an argument, "asis=TRUE" or something like that. But I don't see a reason for doing anything else in `data.table` using duplicate names and trying to catch errors when nothing meaningful can be done with them. Besides data presentation, can you tell any other use with them? Arun On Sunday, November 3, 2013 at 1:31 AM, Eduard Antonyan wrote: > The main usage case I've personally encountered is data presentation (for either self or others), where I would sometimes organize data like so: > > category1 name,colname1,colname2,category2 name,colname1,colname2 > ....numbersandstuff.... > > Also, in general there are many cases I brought up above that generate duplicate names, and I definitely don't want either lost columns or renamed columns as a result - both are data loss that I don't appreciate. > > > On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou wrote: > > Hi, > > > > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan > > wrote: > > [snip] > > > Overall, I agree keeping duplicate names may help some users. But then, the > > > potential side-effects should be marked with warnings/errors distinctly, in > > > all cases (and preferably documented). > > [/snip] > > > > I guess I must have missed it, but has anyone anywhere (in this > > thread, a FR or something) actually present a (concrete) compelling > > situation where allowing duplicate column names was actually useful? > > > > I'm hard pressed to come up with any situation where (purposefully) > > keeping duplicate column names in a data.table has more benefit than > > downside. Seems to me that if this ever happens, it most certainly > > would be by mistake. > > > > Can someone help me out here? > > > > In the case of cbinding two data.tables together that end up having > > two duplicate names, I'd imagine unique-ing the names of the > > data.tables and firing a warning that this was done would be most > > useful (uniqueness priority would be from left to right as the > > data.tables are passed into the cbind call) > > > > -steve > > > > -- > > Steve Lianoglou > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sun Nov 3 01:43:56 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 2 Nov 2013 19:43:56 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: Tbh I don't see why data presentation and preservation (i.e. if you're reading in data with duplicated columns) is not enough of a use case - that's the only reason we allow arbitrary symbols in column names. So, instead of giving you another use case, how about you tell me instead what do you propose should happen here (instead of what happens now): > dt = data.table(1, 2) > dt V1 V2 1: 1 2 > dt[, sum(V2), by = V1] V1 V1 1: 1 2 On Sat, Nov 2, 2013 at 7:36 PM, Arunkumar Srinivasan wrote: > Eddi, > While loading the data in, maybe, if it is essential to keep names intact, > we can probably add an argument, "asis=TRUE" or something like that. But I > don't see a reason for doing anything else in `data.table` using duplicate > names and trying to catch errors when nothing meaningful can be done with > them. Besides data presentation, can you tell any other use with them? > > Arun > > On Sunday, November 3, 2013 at 1:31 AM, Eduard Antonyan wrote: > > The main usage case I've personally encountered is data presentation (for > either self or others), where I would sometimes organize data like so: > > category1 name,colname1,colname2,category2 name,colname1,colname2 > ....numbersandstuff.... > > Also, in general there are many cases I brought up above that generate > duplicate names, and I definitely don't want either lost columns or renamed > columns as a result - both are data loss that I don't appreciate. > > > On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou wrote: > > Hi, > > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan > wrote: > [snip] > > Overall, I agree keeping duplicate names may help some users. But then, > the > > potential side-effects should be marked with warnings/errors distinctly, > in > > all cases (and preferably documented). > [/snip] > > I guess I must have missed it, but has anyone anywhere (in this > thread, a FR or something) actually present a (concrete) compelling > situation where allowing duplicate column names was actually useful? > > I'm hard pressed to come up with any situation where (purposefully) > keeping duplicate column names in a data.table has more benefit than > downside. Seems to me that if this ever happens, it most certainly > would be by mistake. > > Can someone help me out here? > > In the case of cbinding two data.tables together that end up having > two duplicate names, I'd imagine unique-ing the names of the > data.tables and firing a warning that this was done would be most > useful (uniqueness priority would be from left to right as the > data.tables are passed into the cbind call) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Nov 3 01:47:42 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 3 Nov 2013 01:47:42 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: > > dt[, sum(V2), by = V1] > V1 V1 > 1: 1 2 Eddi, the simplest explanation is, since we generate auto-names, we should check if the V-series names exist and if so, generate the next one automatically. That is, in this case, my thought process is, "V1" is the grouping column and it's going to be retained. "V2" is in "J", but it has no name. So, we should be able to decide that "V1" is already taken and assign "V2" automatically. At least, that's what I think *should* happen. We can check for the names to "list(?)" argument in "j" to do this, I think, not sure though. Arun On Sunday, November 3, 2013 at 1:43 AM, Eduard Antonyan wrote: > Tbh I don't see why data presentation and preservation (i.e. if you're reading in data with duplicated columns) is not enough of a use case - that's the only reason we allow arbitrary symbols in column names. > > So, instead of giving you another use case, how about you tell me instead what do you propose should happen here (instead of what happens now): > > > dt = data.table(1, 2) > > dt > V1 V2 > 1: 1 2 > > dt[, sum(V2), by = V1] > V1 V1 > 1: 1 2 > > > > > On Sat, Nov 2, 2013 at 7:36 PM, Arunkumar Srinivasan wrote: > > Eddi, > > While loading the data in, maybe, if it is essential to keep names intact, we can probably add an argument, "asis=TRUE" or something like that. But I don't see a reason for doing anything else in `data.table` using duplicate names and trying to catch errors when nothing meaningful can be done with them. Besides data presentation, can you tell any other use with them? > > > > Arun > > > > > > On Sunday, November 3, 2013 at 1:31 AM, Eduard Antonyan wrote: > > > > > The main usage case I've personally encountered is data presentation (for either self or others), where I would sometimes organize data like so: > > > > > > category1 name,colname1,colname2,category2 name,colname1,colname2 > > > ....numbersandstuff.... > > > > > > Also, in general there are many cases I brought up above that generate duplicate names, and I definitely don't want either lost columns or renamed columns as a result - both are data loss that I don't appreciate. > > > > > > > > > On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou wrote: > > > > Hi, > > > > > > > > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan > > > > wrote: > > > > [snip] > > > > > Overall, I agree keeping duplicate names may help some users. But then, the > > > > > potential side-effects should be marked with warnings/errors distinctly, in > > > > > all cases (and preferably documented). > > > > [/snip] > > > > > > > > I guess I must have missed it, but has anyone anywhere (in this > > > > thread, a FR or something) actually present a (concrete) compelling > > > > situation where allowing duplicate column names was actually useful? > > > > > > > > I'm hard pressed to come up with any situation where (purposefully) > > > > keeping duplicate column names in a data.table has more benefit than > > > > downside. Seems to me that if this ever happens, it most certainly > > > > would be by mistake. > > > > > > > > Can someone help me out here? > > > > > > > > In the case of cbinding two data.tables together that end up having > > > > two duplicate names, I'd imagine unique-ing the names of the > > > > data.tables and firing a warning that this was done would be most > > > > useful (uniqueness priority would be from left to right as the > > > > data.tables are passed into the cbind call) > > > > > > > > -steve > > > > > > > > -- > > > > Steve Lianoglou > > > > Computational Biologist > > > > Bioinformatics and Computational Biology > > > > Genentech > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Sun Nov 3 02:15:38 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Sat, 2 Nov 2013 18:15:38 -0700 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan wrote: > Tbh I don't see why data presentation and preservation (i.e. if you're > reading in data with duplicated columns) is not enough of a use case - > that's the only reason we allow arbitrary symbols in column names. > > So, instead of giving you another use case, how about you tell me instead > what do you propose should happen here (instead of what happens now): > >> dt = data.table(1, 2) >> dt > V1 V2 > 1: 1 2 >> dt[, sum(V2), by = V1] > V1 V1 > 1: 1 2 Only Matthew could say for sure, but if I were a gambling man I'd bet that this was likely something that slipped through the cracks and sleeping dogs were left to lie. I'd be curious to see what his opinions on this are. IMHO the "data presentation" argument doesn't really hold much water. As for "data preservation," I rather see it as imposing structure on it to enable efficient -- and sane/unambigous -- computation over it. Further, I don't think is a preservation issue at all -- no data is lost. The original data is still there in the file that was loaded into R. The name of a column is changed when imported (with adequate warning) into a data.table so that the user can slice and dice it. I'd also guess the user being warned by the duplicate names would most likely be happy to receive the warning, but the fact that you disagree suggests that this isn't an obvious conclusion ;-) I'm curious if you would argue for an SQL table to allow duplicate column names for the same reasons? I do know you can torture SQL to get two colnames to be the same by aliasing, but this also seems to have slipped through as an accident: http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf (which I found from here): http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table Perhaps we should email this guy Hugh to see what he thinks about this one :-) -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From eduard.antonyan at gmail.com Sun Nov 3 02:43:02 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 2 Nov 2013 20:43:02 -0500 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: @Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere. @Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here. On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou wrote: > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan > wrote: > > Tbh I don't see why data presentation and preservation (i.e. if you're > > reading in data with duplicated columns) is not enough of a use case - > > that's the only reason we allow arbitrary symbols in column names. > > > > So, instead of giving you another use case, how about you tell me instead > > what do you propose should happen here (instead of what happens now): > > > >> dt = data.table(1, 2) > >> dt > > V1 V2 > > 1: 1 2 > >> dt[, sum(V2), by = V1] > > V1 V1 > > 1: 1 2 > > Only Matthew could say for sure, but if I were a gambling man I'd bet > that this was likely something that slipped through the cracks and > sleeping dogs were left to lie. I'd be curious to see what his > opinions on this are. > > IMHO the "data presentation" argument doesn't really hold much water. > > As for "data preservation," I rather see it as imposing structure on > it to enable efficient -- and sane/unambigous -- computation over it. > Further, I don't think is a preservation issue at all -- no data is > lost. The original data is still there in the file that was loaded > into R. The name of a column is changed when imported (with adequate > warning) into a data.table so that the user can slice and dice it. I'd > also guess the user being warned by the duplicate names would most > likely be happy to receive the warning, but the fact that you disagree > suggests that this isn't an obvious conclusion ;-) > > I'm curious if you would argue for an SQL table to allow duplicate > column names for the same reasons? I do know you can torture SQL to > get two colnames to be the same by aliasing, but this also seems to > have slipped through as an accident: > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf > > (which I found from here): > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table > > Perhaps we should email this guy Hugh to see what he thinks about this one > :-) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chinmay.patil at gmail.com Mon Nov 4 10:54:27 2013 From: chinmay.patil at gmail.com (Chinmay Patil) Date: Mon, 4 Nov 2013 17:54:27 +0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame. On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan wrote: > @Arun: Ok. Thinking about it a bit - I don't like the continuing > enumeration solution because it makes the results too unpredictable, but > could live with adding a ".1" etc. Which I assume is the idea anyway for > resolving duplicates elsewhere. > > @Steve: Not sure why you think it doesn't hold much water - I think I can > draw a parallel argument that replicates all of the duplicated names > concerns with a column that is called e.g. `dt$V1` (imagine forgetting the > backticks there and the world of hurt that potentially awaits once you do > that). I am also curious what Matthew would think about this. This is smth > I've encountered and dealt with a lot, so I'm certainly not an unbiased > party here. > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou wrote: > >> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan >> wrote: >> > Tbh I don't see why data presentation and preservation (i.e. if you're >> > reading in data with duplicated columns) is not enough of a use case - >> > that's the only reason we allow arbitrary symbols in column names. >> > >> > So, instead of giving you another use case, how about you tell me >> instead >> > what do you propose should happen here (instead of what happens now): >> > >> >> dt = data.table(1, 2) >> >> dt >> > V1 V2 >> > 1: 1 2 >> >> dt[, sum(V2), by = V1] >> > V1 V1 >> > 1: 1 2 >> >> Only Matthew could say for sure, but if I were a gambling man I'd bet >> that this was likely something that slipped through the cracks and >> sleeping dogs were left to lie. I'd be curious to see what his >> opinions on this are. >> >> IMHO the "data presentation" argument doesn't really hold much water. >> >> As for "data preservation," I rather see it as imposing structure on >> it to enable efficient -- and sane/unambigous -- computation over it. >> Further, I don't think is a preservation issue at all -- no data is >> lost. The original data is still there in the file that was loaded >> into R. The name of a column is changed when imported (with adequate >> warning) into a data.table so that the user can slice and dice it. I'd >> also guess the user being warned by the duplicate names would most >> likely be happy to receive the warning, but the fact that you disagree >> suggests that this isn't an obvious conclusion ;-) >> >> I'm curious if you would argue for an SQL table to allow duplicate >> column names for the same reasons? I do know you can torture SQL to >> get two colnames to be the same by aliasing, but this also seems to >> have slipped through as an accident: >> >> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf >> >> (which I found from here): >> >> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table >> >> Perhaps we should email this guy Hugh to see what he thinks about this >> one :-) >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Wed Nov 6 17:05:04 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 6 Nov 2013 10:05:04 -0600 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier. On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil wrote: > FWIW, data.frame does allow duplicate names as well. In the light that > data.table inherits from data.frame, I would expect that it follows same > convention as data.frame. > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan > wrote: > >> @Arun: Ok. Thinking about it a bit - I don't like the continuing >> enumeration solution because it makes the results too unpredictable, but >> could live with adding a ".1" etc. Which I assume is the idea anyway for >> resolving duplicates elsewhere. >> >> @Steve: Not sure why you think it doesn't hold much water - I think I can >> draw a parallel argument that replicates all of the duplicated names >> concerns with a column that is called e.g. `dt$V1` (imagine forgetting the >> backticks there and the world of hurt that potentially awaits once you do >> that). I am also curious what Matthew would think about this. This is smth >> I've encountered and dealt with a lot, so I'm certainly not an unbiased >> party here. >> >> >> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou > > wrote: >> >>> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan >>> wrote: >>> > Tbh I don't see why data presentation and preservation (i.e. if you're >>> > reading in data with duplicated columns) is not enough of a use case - >>> > that's the only reason we allow arbitrary symbols in column names. >>> > >>> > So, instead of giving you another use case, how about you tell me >>> instead >>> > what do you propose should happen here (instead of what happens now): >>> > >>> >> dt = data.table(1, 2) >>> >> dt >>> > V1 V2 >>> > 1: 1 2 >>> >> dt[, sum(V2), by = V1] >>> > V1 V1 >>> > 1: 1 2 >>> >>> Only Matthew could say for sure, but if I were a gambling man I'd bet >>> that this was likely something that slipped through the cracks and >>> sleeping dogs were left to lie. I'd be curious to see what his >>> opinions on this are. >>> >>> IMHO the "data presentation" argument doesn't really hold much water. >>> >>> As for "data preservation," I rather see it as imposing structure on >>> it to enable efficient -- and sane/unambigous -- computation over it. >>> Further, I don't think is a preservation issue at all -- no data is >>> lost. The original data is still there in the file that was loaded >>> into R. The name of a column is changed when imported (with adequate >>> warning) into a data.table so that the user can slice and dice it. I'd >>> also guess the user being warned by the duplicate names would most >>> likely be happy to receive the warning, but the fact that you disagree >>> suggests that this isn't an obvious conclusion ;-) >>> >>> I'm curious if you would argue for an SQL table to allow duplicate >>> column names for the same reasons? I do know you can torture SQL to >>> get two colnames to be the same by aliasing, but this also seems to >>> have slipped through as an accident: >>> >>> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf >>> >>> (which I found from here): >>> >>> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table >>> >>> Perhaps we should email this guy Hugh to see what he thinks about this >>> one :-) >>> >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Computational Biologist >>> Bioinformatics and Computational Biology >>> Genentech >>> >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Nov 6 17:10:26 2013 From: aragorn168b at gmail.com (aragorn168b at gmail.com) Date: Wed, 6 Nov 2013 17:10:26 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> Message-ID: <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Eddi, Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using ?freed?) (and issue a warning)?? Arun On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote: > Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier. > > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil wrote: > > FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame. > > > > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan wrote: > > > @Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere. > > > > > > @Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here. > > > > > > > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou wrote: > > > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan > > > > wrote: > > > > > Tbh I don't see why data presentation and preservation (i.e. if you're > > > > > reading in data with duplicated columns) is not enough of a use case - > > > > > that's the only reason we allow arbitrary symbols in column names. > > > > > > > > > > So, instead of giving you another use case, how about you tell me instead > > > > > what do you propose should happen here (instead of what happens now): > > > > > > > > > >> dt = data.table(1, 2) > > > > >> dt > > > > > V1 V2 > > > > > 1: 1 2 > > > > >> dt[, sum(V2), by = V1] > > > > > V1 V1 > > > > > 1: 1 2 > > > > > > > > Only Matthew could say for sure, but if I were a gambling man I'd bet > > > > that this was likely something that slipped through the cracks and > > > > sleeping dogs were left to lie. I'd be curious to see what his > > > > opinions on this are. > > > > > > > > IMHO the "data presentation" argument doesn't really hold much water. > > > > > > > > As for "data preservation," I rather see it as imposing structure on > > > > it to enable efficient -- and sane/unambigous -- computation over it. > > > > Further, I don't think is a preservation issue at all -- no data is > > > > lost. The original data is still there in the file that was loaded > > > > into R. The name of a column is changed when imported (with adequate > > > > warning) into a data.table so that the user can slice and dice it. I'd > > > > also guess the user being warned by the duplicate names would most > > > > likely be happy to receive the warning, but the fact that you disagree > > > > suggests that this isn't an obvious conclusion ;-) > > > > > > > > I'm curious if you would argue for an SQL table to allow duplicate > > > > column names for the same reasons? I do know you can torture SQL to > > > > get two colnames to be the same by aliasing, but this also seems to > > > > have slipped through as an accident: > > > > > > > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf > > > > > > > > (which I found from here): > > > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table > > > > > > > > Perhaps we should email this guy Hugh to see what he thinks about this one :-) > > > > > > > > -steve > > > > > > > > -- > > > > Steve Lianoglou > > > > Computational Biologist > > > > Bioinformatics and Computational Biology > > > > Genentech > > > > > > > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Wed Nov 6 17:34:18 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 6 Nov 2013 10:34:18 -0600 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <9F7DC50A9B2C470C952973F162105BC4@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: You mean what would be the problem? Well, if the user fread's that data, then modifies e.g. non-duplicate columns and then tries to write.csv it back - how would the user recover the original names for correctly writing the data back if we renamed the columns? On Wed, Nov 6, 2013 at 10:10 AM, wrote: > Eddi, > Nice! But what exactly will happen to that data, if we were to > automatically set unique names while loading it (using ?freed?) (and issue > a warning)?? > > Arun > > On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote: > > Last comment here has an example of using duplicated names - > http://stackoverflow.com/a/19809942/817778 - it's very similar to the one > I mentioned earlier. > > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil wrote: > > FWIW, data.frame does allow duplicate names as well. In the light that > data.table inherits from data.frame, I would expect that it follows same > convention as data.frame. > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan > wrote: > > @Arun: Ok. Thinking about it a bit - I don't like the continuing > enumeration solution because it makes the results too unpredictable, but > could live with adding a ".1" etc. Which I assume is the idea anyway for > resolving duplicates elsewhere. > > @Steve: Not sure why you think it doesn't hold much water - I think I can > draw a parallel argument that replicates all of the duplicated names > concerns with a column that is called e.g. `dt$V1` (imagine forgetting the > backticks there and the world of hurt that potentially awaits once you do > that). I am also curious what Matthew would think about this. This is smth > I've encountered and dealt with a lot, so I'm certainly not an unbiased > party here. > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou wrote: > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan > wrote: > > Tbh I don't see why data presentation and preservation (i.e. if you're > > reading in data with duplicated columns) is not enough of a use case - > > that's the only reason we allow arbitrary symbols in column names. > > > > So, instead of giving you another use case, how about you tell me instead > > what do you propose should happen here (instead of what happens now): > > > >> dt = data.table(1, 2) > >> dt > > V1 V2 > > 1: 1 2 > >> dt[, sum(V2), by = V1] > > V1 V1 > > 1: 1 2 > > Only Matthew could say for sure, but if I were a gambling man I'd bet > that this was likely something that slipped through the cracks and > sleeping dogs were left to lie. I'd be curious to see what his > opinions on this are. > > IMHO the "data presentation" argument doesn't really hold much water. > > As for "data preservation," I rather see it as imposing structure on > it to enable efficient -- and sane/unambigous -- computation over it. > Further, I don't think is a preservation issue at all -- no data is > lost. The original data is still there in the file that was loaded > into R. The name of a column is changed when imported (with adequate > warning) into a data.table so that the user can slice and dice it. I'd > also guess the user being warned by the duplicate names would most > likely be happy to receive the warning, but the fact that you disagree > suggests that this isn't an obvious conclusion ;-) > > I'm curious if you would argue for an SQL table to allow duplicate > column names for the same reasons? I do know you can torture SQL to > get two colnames to be the same by aliasing, but this also seems to > have slipped through as an accident: > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf > > (which I found from here): > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table > > Perhaps we should email this guy Hugh to see what he thinks about this one > :-) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Nov 6 23:50:39 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 6 Nov 2013 23:50:39 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: Eddi, 1) We can still allow duplicate names in "fread" and during creation of data.table with the data.table() command. 2) There's really no loss of data as we can allow "setnames" to set duplicate names/unduplicate them (and they anyways have the data as they load that into R using fread). There's therefore no *real* loss of data. 3) The point is to decide upon where duplicate names are allowed and where it should give an error? As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this? Arun On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote: > You mean what would be the problem? > > Well, if the user fread's that data, then modifies e.g. non-duplicate columns and then tries to write.csv it back - how would the user recover the original names for correctly writing the data back if we renamed the columns? > > > On Wed, Nov 6, 2013 at 10:10 AM, wrote: > > Eddi, > > Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using ?freed?) (and issue a warning)?? > > > > Arun > > > > > > On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote: > > > > > Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier. > > > > > > > > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil wrote: > > > > FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame. > > > > > > > > > > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan wrote: > > > > > @Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere. > > > > > > > > > > @Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here. > > > > > > > > > > > > > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou wrote: > > > > > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan > > > > > > wrote: > > > > > > > Tbh I don't see why data presentation and preservation (i.e. if you're > > > > > > > reading in data with duplicated columns) is not enough of a use case - > > > > > > > that's the only reason we allow arbitrary symbols in column names. > > > > > > > > > > > > > > So, instead of giving you another use case, how about you tell me instead > > > > > > > what do you propose should happen here (instead of what happens now): > > > > > > > > > > > > > >> dt = data.table(1, 2) > > > > > > >> dt > > > > > > > V1 V2 > > > > > > > 1: 1 2 > > > > > > >> dt[, sum(V2), by = V1] > > > > > > > V1 V1 > > > > > > > 1: 1 2 > > > > > > > > > > > > Only Matthew could say for sure, but if I were a gambling man I'd bet > > > > > > that this was likely something that slipped through the cracks and > > > > > > sleeping dogs were left to lie. I'd be curious to see what his > > > > > > opinions on this are. > > > > > > > > > > > > IMHO the "data presentation" argument doesn't really hold much water. > > > > > > > > > > > > As for "data preservation," I rather see it as imposing structure on > > > > > > it to enable efficient -- and sane/unambigous -- computation over it. > > > > > > Further, I don't think is a preservation issue at all -- no data is > > > > > > lost. The original data is still there in the file that was loaded > > > > > > into R. The name of a column is changed when imported (with adequate > > > > > > warning) into a data.table so that the user can slice and dice it. I'd > > > > > > also guess the user being warned by the duplicate names would most > > > > > > likely be happy to receive the warning, but the fact that you disagree > > > > > > suggests that this isn't an obvious conclusion ;-) > > > > > > > > > > > > I'm curious if you would argue for an SQL table to allow duplicate > > > > > > column names for the same reasons? I do know you can torture SQL to > > > > > > get two colnames to be the same by aliasing, but this also seems to > > > > > > have slipped through as an accident: > > > > > > > > > > > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf > > > > > > > > > > > > (which I found from here): > > > > > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table > > > > > > > > > > > > Perhaps we should email this guy Hugh to see what he thinks about this one :-) > > > > > > > > > > > > -steve > > > > > > > > > > > > -- > > > > > > Steve Lianoglou > > > > > > Computational Biologist > > > > > > Bioinformatics and Computational Biology > > > > > > Genentech > > > > > > > > > > > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Thu Nov 7 00:01:05 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Wed, 6 Nov 2013 15:01:05 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: On Wed, Nov 6, 2013 at 2:50 PM, Arunkumar Srinivasan wrote: > Eddi, > > 1) We can still allow duplicate names in "fread" and during creation of > data.table with the data.table() command. > 2) There's really no loss of data as we can allow "setnames" to set > duplicate names/unduplicate them (and they anyways have the data as they > load that into R using fread). There's therefore no *real* loss of data. > 3) The point is to decide upon where duplicate names are allowed and where > it should give an error? > > As I said before, I think it's essential to allow duplicate names while > loading a file (and therefore for consistency during creation of data.table > as well). However, all grouping/aggregating/subsetting etc.. where ambiguity > can arise should end in error. At least this is my stance so far. Are we > agreeing on this? Add "evaluation in `j`" to the things you want to throw an error, and I guess I'm ok w/ Arun's stance, too, since I guess we should stay as close to data.frame as possible (even though I think it's still "wrong" to have duplicate column names in principle). I guess a more clever handling of setnames needs to happen too, as it fails if the target data.table has any duplicate names (I'm assuming this has come up already, but I'm only half-tuned-in to this discussion) I also think that the output of the aggregation example Eddi used earlier should be changed, ie: R> x <- data.table(V1=sample(letters[1:3], 10, rep=TRUE), B=rnorm(10)) R> x[, sum(B), by=V1] V1 V1 1: b -0.8581098 2: a 0.8762710 3: c 1.3274762 Just feels wrong for the `sum`ed column to also be V1, but maybe this is an FR for another day. -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From eduard.antonyan at gmail.com Thu Nov 7 00:04:51 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 6 Nov 2013 17:04:51 -0600 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: > > As I said before, I think it's essential to allow duplicate names while > loading a file (and therefore for consistency during creation of data.table > as well). However, all grouping/aggregating/subsetting etc.. where > ambiguity can arise should end in error. At least this is my stance so far. > Are we agreeing on this? Sounds good to me. On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan wrote: > Eddi, > > 1) We can still allow duplicate names in "fread" and during creation of > data.table with the data.table() command. > 2) There's really no loss of data as we can allow "setnames" to set > duplicate names/unduplicate them (and they anyways have the data as they > load that into R using fread). There's therefore no *real* loss of data. > 3) The point is to decide upon where duplicate names are allowed and where > it should give an error? > > As I said before, I think it's essential to allow duplicate names while > loading a file (and therefore for consistency during creation of data.table > as well). However, all grouping/aggregating/subsetting etc.. where > ambiguity can arise should end in error. At least this is my stance so far. > Are we agreeing on this? > > Arun > > On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote: > > You mean what would be the problem? > > Well, if the user fread's that data, then modifies e.g. non-duplicate > columns and then tries to write.csv it back - how would the user recover > the original names for correctly writing the data back if we renamed the > columns? > > > On Wed, Nov 6, 2013 at 10:10 AM, wrote: > > Eddi, > Nice! But what exactly will happen to that data, if we were to > automatically set unique names while loading it (using ?freed?) (and issue > a warning)?? > > Arun > > On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote: > > Last comment here has an example of using duplicated names - > http://stackoverflow.com/a/19809942/817778 - it's very similar to the one > I mentioned earlier. > > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil wrote: > > FWIW, data.frame does allow duplicate names as well. In the light that > data.table inherits from data.frame, I would expect that it follows same > convention as data.frame. > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan > wrote: > > @Arun: Ok. Thinking about it a bit - I don't like the continuing > enumeration solution because it makes the results too unpredictable, but > could live with adding a ".1" etc. Which I assume is the idea anyway for > resolving duplicates elsewhere. > > @Steve: Not sure why you think it doesn't hold much water - I think I can > draw a parallel argument that replicates all of the duplicated names > concerns with a column that is called e.g. `dt$V1` (imagine forgetting the > backticks there and the world of hurt that potentially awaits once you do > that). I am also curious what Matthew would think about this. This is smth > I've encountered and dealt with a lot, so I'm certainly not an unbiased > party here. > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou wrote: > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan > wrote: > > Tbh I don't see why data presentation and preservation (i.e. if you're > > reading in data with duplicated columns) is not enough of a use case - > > that's the only reason we allow arbitrary symbols in column names. > > > > So, instead of giving you another use case, how about you tell me instead > > what do you propose should happen here (instead of what happens now): > > > >> dt = data.table(1, 2) > >> dt > > V1 V2 > > 1: 1 2 > >> dt[, sum(V2), by = V1] > > V1 V1 > > 1: 1 2 > > Only Matthew could say for sure, but if I were a gambling man I'd bet > that this was likely something that slipped through the cracks and > sleeping dogs were left to lie. I'd be curious to see what his > opinions on this are. > > IMHO the "data presentation" argument doesn't really hold much water. > > As for "data preservation," I rather see it as imposing structure on > it to enable efficient -- and sane/unambigous -- computation over it. > Further, I don't think is a preservation issue at all -- no data is > lost. The original data is still there in the file that was loaded > into R. The name of a column is changed when imported (with adequate > warning) into a data.table so that the user can slice and dice it. I'd > also guess the user being warned by the duplicate names would most > likely be happy to receive the warning, but the fact that you disagree > suggests that this isn't an obvious conclusion ;-) > > I'm curious if you would argue for an SQL table to allow duplicate > column names for the same reasons? I do know you can torture SQL to > get two colnames to be the same by aliasing, but this also seems to > have slipped through as an accident: > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf > > (which I found from here): > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table > > Perhaps we should email this guy Hugh to see what he thinks about this one > :-) > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.ohanlon at imperial.ac.uk Fri Nov 8 14:30:55 2013 From: simon.ohanlon at imperial.ac.uk (Simon O'Hanlon) Date: Fri, 8 Nov 2013 13:30:55 +0000 (UTC) Subject: [datatable-help] Unexpected behavior in setnames() References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: Eduard Antonyan gmail.com> writes: > > > > > As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this? > > > > Sounds good to me.? > > > On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan gmail.com> wrote: > > Eddi, > > > 1) We can still allow duplicate names in "fread" and during creation of data.table with the data.table() command. > 2) There's really no loss of data as we can allow "setnames" to set duplicate names/unduplicate them (and they anyways have the data as they load that into R using fread). There's therefore no *real* loss of data. > > 3) The point is to decide upon where duplicate names are allowed and where it should give an error?? > > As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this? > > > > > Arun > > > > > On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote: > > > > You mean what would be the problem? > Well, if the user fread's that data, then modifies e.g. non-duplicate columns and then tries to write.csv it back - how would the user recover the original names for correctly writing the data back if we renamed the columns? > > > > > On Wed, Nov 6, 2013 at 10:10 AM, gmail.com> wrote: > > > > Eddi, > > Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using ?freed?) (and issue a warning)?? > > > Arun > > > > On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote: > > > > Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier. > > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil gmail.com> wrote: > > > > > > FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame. > > > > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan gmail.com> wrote: > > > > > > Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere. > Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here. > > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou gene.com> wrote: > > > > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan > gmail.com> wrote: > > Tbh I don't see why data presentation and preservation (i.e. if you're > > reading in data with duplicated columns) is not enough of a use case - > > that's the only reason we allow arbitrary symbols in column names. > > > > So, instead of giving you another use case, how about you tell me instead > > what do you propose should happen here (instead of what happens now): > > > >> dt = data.table(1, 2) > >> dt > > ? ?V1 V2 > > 1: ?1 ?2 > >> dt[, sum(V2), by = V1] > > ? ?V1 V1 > > 1: ?1 ?2 > Only Matthew could say for sure, but if I were a gambling man I'd bet > that this was likely something that slipped through the cracks and > sleeping dogs were left to lie. I'd be curious to see what his > opinions on this are. > IMHO the "data presentation" argument doesn't really hold much water. > As for "data preservation," I rather see it as imposing structure on > it to enable efficient -- and sane/unambigous -- computation over it. > Further, I don't think is a preservation issue at all -- no data is > lost. The original data is still there in the file that was loaded > into R. The name of a column is changed when imported (with adequate > warning) into a data.table so that the user can slice and dice it. I'd > also guess the user being warned by the duplicate names would most > likely be happy to receive the warning, but the fact that you disagree > suggests that this isn't an obvious conclusion > I'm curious if you would argue for an SQL table to allow duplicate > column names for the same reasons? I do know you can torture SQL to > get two colnames to be the same by aliasing, but this also seems to > have slipped through as an accident:http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column- Names.pdf > (which I found from here):http://stackoverflow.com/questions/8797593/is- there-any-use-to-duplicate-column-names-in-a-table > Perhaps we should email this guy Hugh to see what he thinks about this one > > -steve > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > > > > > > > > _______________________________________________ > datatable-help mailing listdatatable-help lists.r-forge.r- project.orghttps://lists.r-forge.r-project.org/cgi- bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help lists.r-forge.r-project.org > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable- help > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > datatable-help mailing list > datatable-help lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable- help I am not particularly opposed or otherwise, to duplicate column names, although I do see the issues that creates. I think that whatever you, as custodians of data.table decide with respect to column names, the behaviour of numeric indices to indicate columns included in .SD needs to be fixed when duplicate column names are present. As a user I'd expect the following to return two columns with the values 2 and 6 respectively: Example: dt <- data.table( 1,2,3,4 ) setnames(dt , rep( c("a", "b") , 2 ) ) a b a b 1: 1 2 3 4 dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ] a a 1: 2 2 I hope that contributes in some small way to your decision making process. This is lifted from a question I asked on Stack Overflow here; http://stackoverflow.com/questions/19811644/can-data-table-handle-identical- column-names-when-using-sdcols Thanks, Simon From lianoglou.steve at gene.com Fri Nov 8 15:03:12 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 06:03:12 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: Hi Simon, On Fri, Nov 8, 2013 at 5:30 AM, Simon O'Hanlon wrote: > I am not particularly opposed or otherwise, to duplicate column names, > although I do see the issues that creates. > > I think that whatever you, as custodians of data.table decide with respect > to column names, the behaviour of numeric indices to indicate columns > included in .SD needs to be fixed when duplicate column names are present. > As a user I'd expect the following to return two columns with the values 2 > and 6 respectively: > > Example: > > dt <- data.table( 1,2,3,4 ) > setnames(dt , rep( c("a", "b") , 2 ) ) > a b a b > 1: 1 2 3 4 > > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ] > a a > 1: 2 2 > > I hope that contributes in some small way to your decision making process. > This is lifted from a question I asked on Stack Overflow here; > > http://stackoverflow.com/questions/19811644/can-data-table-handle-identical- > column-names-when-using-sdcols I agree -- when using numeric columns, this is clearly wrong and I would expect an answer of 2 and 6. I'm curious what you think, however, when you use the names of the columns in .SDcols If you ask .SDcols="a" would you expect the first "a" column to be used, or all of them? To use all of them, would you expect to use .SDcols=c('a', 'a')? -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From simon.ohanlon at imperial.ac.uk Fri Nov 8 15:47:13 2013 From: simon.ohanlon at imperial.ac.uk (Simon =?utf-8?b?T1wnSGFubG9u?=) Date: Fri, 8 Nov 2013 14:47:13 +0000 (UTC) Subject: [datatable-help] Unexpected behavior in setnames() References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: Steve Lianoglou gene.com> writes: > > As a user I'd expect the following to return two columns with the values 2 > > and 6 respectively: > > > > Example: > > > > dt <- data.table( 1,2,3,4 ) > > setnames(dt , rep( c("a", "b") , 2 ) ) > > a b a b > > 1: 1 2 3 4 > > > > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ] > > a a > > 1: 2 2 > I agree -- when using numeric columns, this is clearly wrong and I > would expect an answer of 2 and 6. > > I'm curious what you think, however, when you use the names of the > columns in .SDcols > > If you ask .SDcols="a" would you expect the first "a" column to be > used, or all of them? To use all of them, would you expect to use > .SDcols=c('a', 'a')? > > -steve Hi Steve, That I guess is the big question. Approaching it from the point of view that duplicate column names are allowed... If I use from the above example, .SDcols = "a" there are a number of things that *could* happen: 1) data.table ignores dupe names and uses the first such matching column up to the number of times that name appears and gives no warning (as I understand it, current behaviour and probably least desirable IMHO). 2) As above with a warning - least work from a developer standpoint I guess! 3) both columns are used piece-wise from left to right and have a unique suffix appended with a warning that this occured due to duplicate column names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however there is the complication that you then need to ensure you are not creating a new duplicate from an existing column name). This precludes you from referring to a specific column name in the j function though (but this could be part of the warning forcing a user to give a column a unique name if they want to refer to it directly) 4) Most work/most flexible(?); On instantiation all columns in a data.table have an hidden attribute created that is a unique column name, which may be referred to in the j with an accessor function, for example "a" and "a" could be differentiated as .(a.1) and .(a.2) but return results under "a" and "a". There would also need to be a function to view the mapping of printed names to the unique attribute names, e.g. colnames( dt , include.hidden = TRUE ) then returns a list of the column names and the underlying unique names allowing a 'power-user' to refer to duplicate column names with a unique identifier using the accessor function. IMHO opinion this is a huge amount of work, probably unsafe and prone to many bugs. Not sure I'd even attempt it, but thought it worth bringing up. In conclusion my vote would be for current behaviour but with a warning about needing to set unique column names for calculations, or using numeric indices, in which case the handling of numeric indices should probably be "fixed" (I use that loosely because one might argue that it is not broken it just doesn't do what one might intuitively expect!). From aragorn168b at gmail.com Fri Nov 8 16:09:12 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 8 Nov 2013 16:09:12 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> Message-ID: <9996C9517A244BFF81E353BEC9CD130C@gmail.com> Simon, I've replied your last post inline: > 1) data.table ignores dupe names and uses the first such matching column up > to the number of times that name appears and gives no warning (as I > understand it, current behaviour and probably least desirable IMHO). FYI, this is what data.frame does: DF <- data.frame(x=1:5, x=6:10, check.names=FALSE) DF[, c("x")] DF[, c("x", "x")] In fact, while doing this subsetting, it automatically makes the column names unique. *Admittedly, DF[, 1:2] gives the right columns, but still the names are made unique.* > 3) both columns are used piece-wise from left to right and have a unique > suffix appended with a warning that this occured due to duplicate column > names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however > there is the complication that you then need to ensure you are not creating > a new duplicate from an existing column name). This precludes you from > referring to a specific column name in the j function though (but this could > be part of the warning forcing a user to give a column a unique name if > they want to refer to it directly) This'll be a problem to evaluate expressions in `j`. Suppose you've: DT <- data.table(x=1:5, x=6:10, ID=1:5) And you do: DT[, list(x=x*2), by=ID], then, while creating the data.table DT, the names are not changed (or so far, the consensus is not to). So, if during an operation, we were to change the dup names to unique names, we'll have trouble in mapping expressions in `j` accordingly. Note that even if we dint, this expression is ill-posed. Also think about `setkey` function. > 4) Most work/most flexible(?); On instantiation all columns in a data.table > have an hidden attribute created that is a unique column name, which may be > referred to in the j with an accessor function, for example "a" and "a" > could be differentiated as .(a.1) and .(a.2) but return results under "a" > and "a". There would also need to be a function to view the mapping of > printed names to the unique attribute names, e.g. colnames( dt , > include.hidden = TRUE ) then returns a list of the column names and the > underlying unique names allowing a 'power-user' to refer to duplicate column > names with a unique identifier using the accessor function. IMHO opinion > this is a huge amount of work, probably unsafe and prone to many bugs. Not > sure I'd even attempt it, but thought it worth bringing up. I agree with your conclusion. This is not feasible even, as the mapping is ill-posed. IF the expression in `j` contains only one of the duplicate columns, which one would you map to (.a.1) or (.a.2)? > In conclusion my vote would be for current behaviour but with a warning > about needing to set unique column names for calculations, or using numeric > indices, in which case the handling of numeric indices should probably be > "fixed" (I use that loosely because one might argue that it is not broken it > just doesn't do what one might intuitively expect!). In my opinion, the dup-names should be allowed *only* during creation of data.table, and setting names (using `setnames`, `setattr` or the bad form `names(dt) <- `). Other than that, *ALL* operations should fail (end up in error), and that includes subsetting operation. The `setnames` gives the option for the user to set the names back before writing to a file, should he choose to keep it at the end. I think it's much better this way (strict, but avoids confusion). For example, in data.frames, doing DF$x (when x occurs twice) implicitly prints only the first (no warning/error). Also, split(DF$x, DF$x) uses the first column and so does split(DF, DF$x). Arun On Friday, November 8, 2013 at 3:47 PM, Simon O\'Hanlon wrote: > Steve Lianoglou gene.com (http://gene.com)> writes: > > > > As a user I'd expect the following to return two columns with the values > 2 > > > and 6 respectively: > > > > > > Example: > > > > > > dt <- data.table( 1,2,3,4 ) > > > setnames(dt , rep( c("a", "b") , 2 ) ) > > > a b a b > > > 1: 1 2 3 4 > > > > > > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ] > > > a a > > > 1: 2 2 > > > > > > > > > I agree -- when using numeric columns, this is clearly wrong and I > > would expect an answer of 2 and 6. > > > > I'm curious what you think, however, when you use the names of the > > columns in .SDcols > > > > If you ask .SDcols="a" would you expect the first "a" column to be > > used, or all of them? To use all of them, would you expect to use > > .SDcols=c('a', 'a')? > > > > -steve > > Hi Steve, > That I guess is the big question. Approaching it from the point of view that > duplicate column names are allowed... If I use from the above example, > .SDcols = "a" there are a number of things that *could* happen: > > 1) data.table ignores dupe names and uses the first such matching column up > to the number of times that name appears and gives no warning (as I > understand it, current behaviour and probably least desirable IMHO). > > 2) As above with a warning - least work from a developer standpoint I guess! > > 3) both columns are used piece-wise from left to right and have a unique > suffix appended with a warning that this occured due to duplicate column > names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however > there is the complication that you then need to ensure you are not creating > a new duplicate from an existing column name). This precludes you from > referring to a specific column name in the j function though (but this could > be part of the warning forcing a user to give a column a unique name if > they want to refer to it directly) > > 4) Most work/most flexible(?); On instantiation all columns in a data.table > have an hidden attribute created that is a unique column name, which may be > referred to in the j with an accessor function, for example "a" and "a" > could be differentiated as .(a.1) and .(a.2) but return results under "a" > and "a". There would also need to be a function to view the mapping of > printed names to the unique attribute names, e.g. colnames( dt , > include.hidden = TRUE ) then returns a list of the column names and the > underlying unique names allowing a 'power-user' to refer to duplicate column > names with a unique identifier using the accessor function. IMHO opinion > this is a huge amount of work, probably unsafe and prone to many bugs. Not > sure I'd even attempt it, but thought it worth bringing up. > > In conclusion my vote would be for current behaviour but with a warning > about needing to set unique column names for calculations, or using numeric > indices, in which case the handling of numeric indices should probably be > "fixed" (I use that loosely because one might argue that it is not broken it > just doesn't do what one might intuitively expect!). > > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Nov 8 21:02:07 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 12:02:07 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <9996C9517A244BFF81E353BEC9CD130C@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> Message-ID: Hi, I wanted to point out that I'm in Arun's camp on this one: On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan wrote: > In my opinion, the dup-names should be allowed *only* during creation of > data.table, and setting names (using `setnames`, `setattr` or the bad form > `names(dt) <- `). Other than that, *ALL* operations should fail (end up in > error), and that includes subsetting operation. The `setnames` gives the > option for the user to set the names back before writing to a file, should > he choose to keep it at the end. > > I think it's much better this way (strict, but avoids confusion). For > example, in data.frames, doing DF$x (when x occurs twice) implicitly prints > only the first (no warning/error). Also, split(DF$x, DF$x) uses the first > column and so does split(DF, DF$x). As an opinionated footnote: I can acquiesce that since data.frames allow duplicated column names, I *guess* data.table should *allow* them, however as is clear (to me) from this long chain of "possibilities" that one can do, I strongly feel that computing over a data.table w/ duplicated columns is a fundamentally broken idea as it is ambiguous as to what the right behavior should be ... forget about even the (surely fun) book-keeping code required to make it happen. You want to import a table with duplicate names? Fine (we should warn on import if it was `fread` or `as.data.table`d). You want to set some names to duplicates? Fine -- warn there too. Want to do any computation inside the data.table via `j` or as a column in `by`? Throw an error and punt the problem to the user to figure out how they would like to disambiguate the first column named "a" from the 10th one -- I don't think we need another FAQ explaining what "the right" way that this should be done is, and why we picked it. Or if you really want to compute over a data.table with duplicate names, you might be better served by having the table in "long" format -- perhaps that's why there are duplicate column names to begin with (I'm guessing -- I still don't think I would ever want to have duped names on purpose) My two cents, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From eduard.antonyan at gmail.com Fri Nov 8 21:08:05 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 8 Nov 2013 14:08:05 -0600 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> Message-ID: Ditto - having dups, but spitting out an error on all ambiguous operations seems like a robust strategy. On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou wrote: > Hi, > > I wanted to point out that I'm in Arun's camp on this one: > > On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan > wrote: > > > In my opinion, the dup-names should be allowed *only* during creation of > > data.table, and setting names (using `setnames`, `setattr` or the bad > form > > `names(dt) <- `). Other than that, *ALL* operations should fail (end up > in > > error), and that includes subsetting operation. The `setnames` gives the > > option for the user to set the names back before writing to a file, > should > > he choose to keep it at the end. > > > > I think it's much better this way (strict, but avoids confusion). For > > example, in data.frames, doing DF$x (when x occurs twice) implicitly > prints > > only the first (no warning/error). Also, split(DF$x, DF$x) uses the first > > column and so does split(DF, DF$x). > > As an opinionated footnote: I can acquiesce that since data.frames > allow duplicated column names, I *guess* data.table should *allow* > them, however as is clear (to me) from this long chain of > "possibilities" that one can do, I strongly feel that computing over a > data.table w/ duplicated columns is a fundamentally broken idea as it > is ambiguous as to what the right behavior should be ... forget about > even the (surely fun) book-keeping code required to make it happen. > > You want to import a table with duplicate names? Fine (we should warn > on import if it was `fread` or `as.data.table`d). > > You want to set some names to duplicates? Fine -- warn there too. > > Want to do any computation inside the data.table via `j` or as a > column in `by`? Throw an error and punt the problem to the user to > figure out how they would like to disambiguate the first column named > "a" from the 10th one -- I don't think we need another FAQ explaining > what "the right" way that this should be done is, and why we picked > it. > > Or if you really want to compute over a data.table with duplicate > names, you might be better served by having the table in "long" format > -- perhaps that's why there are duplicate column names to begin with > (I'm guessing -- I still don't think I would ever want to have duped > names on purpose) > > My two cents, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Nov 8 21:16:05 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 12:16:05 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> Message-ID: Wow ... did we just reach a consensus? :-) -steve On Fri, Nov 8, 2013 at 12:08 PM, Eduard Antonyan wrote: > Ditto - having dups, but spitting out an error on all ambiguous operations > seems like a robust strategy. > > > On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou > wrote: >> >> Hi, >> >> I wanted to point out that I'm in Arun's camp on this one: >> >> On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan >> wrote: >> >> > In my opinion, the dup-names should be allowed *only* during creation of >> > data.table, and setting names (using `setnames`, `setattr` or the bad >> > form >> > `names(dt) <- `). Other than that, *ALL* operations should fail (end up >> > in >> > error), and that includes subsetting operation. The `setnames` gives the >> > option for the user to set the names back before writing to a file, >> > should >> > he choose to keep it at the end. >> > >> > I think it's much better this way (strict, but avoids confusion). For >> > example, in data.frames, doing DF$x (when x occurs twice) implicitly >> > prints >> > only the first (no warning/error). Also, split(DF$x, DF$x) uses the >> > first >> > column and so does split(DF, DF$x). >> >> As an opinionated footnote: I can acquiesce that since data.frames >> allow duplicated column names, I *guess* data.table should *allow* >> them, however as is clear (to me) from this long chain of >> "possibilities" that one can do, I strongly feel that computing over a >> data.table w/ duplicated columns is a fundamentally broken idea as it >> is ambiguous as to what the right behavior should be ... forget about >> even the (surely fun) book-keeping code required to make it happen. >> >> You want to import a table with duplicate names? Fine (we should warn >> on import if it was `fread` or `as.data.table`d). >> >> You want to set some names to duplicates? Fine -- warn there too. >> >> Want to do any computation inside the data.table via `j` or as a >> column in `by`? Throw an error and punt the problem to the user to >> figure out how they would like to disambiguate the first column named >> "a" from the 10th one -- I don't think we need another FAQ explaining >> what "the right" way that this should be done is, and why we picked >> it. >> >> Or if you really want to compute over a data.table with duplicate >> names, you might be better served by having the table in "long" format >> -- perhaps that's why there are duplicate column names to begin with >> (I'm guessing -- I still don't think I would ever want to have duped >> names on purpose) >> >> My two cents, >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Fri Nov 8 21:19:38 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 8 Nov 2013 21:19:38 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> Message-ID: <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> Steve, Maybe, but it's just getting started :) - we now have to decide what's ambiguous! Ex: Is subsetting by column number considered ambiguous (By definition of ambiguous, probably not)? But then it'd be inconsistent with subsetting when column names are provided.. So, should we prioritise consistency over function in this scenario? Arun On Friday, November 8, 2013 at 9:16 PM, Steve Lianoglou wrote: > Wow ... did we just reach a consensus? :-) > > -steve > > On Fri, Nov 8, 2013 at 12:08 PM, Eduard Antonyan > wrote: > > Ditto - having dups, but spitting out an error on all ambiguous operations > > seems like a robust strategy. > > > > > > On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou > > wrote: > > > > > > Hi, > > > > > > I wanted to point out that I'm in Arun's camp on this one: > > > > > > On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan > > > wrote: > > > > > > > In my opinion, the dup-names should be allowed *only* during creation of > > > > data.table, and setting names (using `setnames`, `setattr` or the bad > > > > form > > > > `names(dt) <- `). Other than that, *ALL* operations should fail (end up > > > > in > > > > error), and that includes subsetting operation. The `setnames` gives the > > > > option for the user to set the names back before writing to a file, > > > > should > > > > he choose to keep it at the end. > > > > > > > > I think it's much better this way (strict, but avoids confusion). For > > > > example, in data.frames, doing DF$x (when x occurs twice) implicitly > > > > prints > > > > only the first (no warning/error). Also, split(DF$x, DF$x) uses the > > > > first > > > > column and so does split(DF, DF$x). > > > > > > > > > > > > > As an opinionated footnote: I can acquiesce that since data.frames > > > allow duplicated column names, I *guess* data.table should *allow* > > > them, however as is clear (to me) from this long chain of > > > "possibilities" that one can do, I strongly feel that computing over a > > > data.table w/ duplicated columns is a fundamentally broken idea as it > > > is ambiguous as to what the right behavior should be ... forget about > > > even the (surely fun) book-keeping code required to make it happen. > > > > > > You want to import a table with duplicate names? Fine (we should warn > > > on import if it was `fread` or `as.data.table`d). > > > > > > You want to set some names to duplicates? Fine -- warn there too. > > > > > > Want to do any computation inside the data.table via `j` or as a > > > column in `by`? Throw an error and punt the problem to the user to > > > figure out how they would like to disambiguate the first column named > > > "a" from the 10th one -- I don't think we need another FAQ explaining > > > what "the right" way that this should be done is, and why we picked > > > it. > > > > > > Or if you really want to compute over a data.table with duplicate > > > names, you might be better served by having the table in "long" format > > > -- perhaps that's why there are duplicate column names to begin with > > > (I'm guessing -- I still don't think I would ever want to have duped > > > names on purpose) > > > > > > My two cents, > > > > > > -steve > > > > > > -- > > > Steve Lianoglou > > > Computational Biologist > > > Bioinformatics and Computational Biology > > > Genentech > > > _______________________________________________ > > > datatable-help mailing list > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Nov 8 21:29:44 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 12:29:44 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> Message-ID: > Steve, > Maybe, but it's just getting started :) - we now have to decide what's > ambiguous! So close, yet so far ... > Ex: Is subsetting by column number considered ambiguous (By definition of > ambiguous, probably not)? But then it'd be inconsistent with subsetting when > column names are provided.. So, should we prioritise consistency over > function in this scenario? Sorry, can you provide examples of each? I'd imagine doing anything by column number is unambiguous, but I'm not sure how you can subset by column index and by column name in a "similar" fashion. I mean dt[[1]] should work no matter what, dt[['a']] would work only if there is only one column named 'a' ... but I don't think this is what you are talking about? Sorry if I'm being obtuse, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Fri Nov 8 21:37:26 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 8 Nov 2013 21:37:26 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> Message-ID: <2A4352C9C46747AAB3951B71B4942E47@gmail.com> Sure, here's an example of what I was trying to explain: Suppose: DT <- data.table(x=1:5, y=1:5, x=6:10) Then, DT[, c(1,3), with=FALSE] # gives correct subset DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but *should provide right result if "ambiguity" is the only concern. Arun On Friday, November 8, 2013 at 9:29 PM, Steve Lianoglou wrote: > > Steve, > > Maybe, but it's just getting started :) - we now have to decide what's > > ambiguous! > > > > > So close, yet so far ... > > > Ex: Is subsetting by column number considered ambiguous (By definition of > > ambiguous, probably not)? But then it'd be inconsistent with subsetting when > > column names are provided.. So, should we prioritise consistency over > > function in this scenario? > > > > > Sorry, can you provide examples of each? > > I'd imagine doing anything by column number is unambiguous, but I'm > not sure how you can subset by column index and by column name in a > "similar" fashion. > > I mean dt[[1]] should work no matter what, dt[['a']] would work only > if there is only one column named 'a' ... but I don't think this is > what you are talking about? > > Sorry if I'm being obtuse, > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Nov 8 21:41:19 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 12:41:19 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <2A4352C9C46747AAB3951B71B4942E47@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> <2A4352C9C46747AAB3951B71B4942E47@gmail.com> Message-ID: My gut reaction is: On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan wrote: > Sure, here's an example of what I was trying to explain: > > Suppose: > DT <- data.table(x=1:5, y=1:5, x=6:10) > > Then, > > DT[, c(1,3), with=FALSE] # gives correct subset This is "OK", we just do what the user asks, here, as they are being very specific. > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong stop() -- we don't try to disambiguate (even if it "seems" specific) > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result stop() Also stop() on DT[, ..., .SDcols="x"] > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but Do what the user asks for. No? -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Fri Nov 8 21:45:26 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 8 Nov 2013 21:45:26 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> <2A4352C9C46747AAB3951B71B4942E47@gmail.com> Message-ID: <40662613E7BB48CB9C57365D9C6522D2@gmail.com> Oh I can certainly agree with that. I guess we'll have to make some changes to the code to use index based subsetting when .SDcols or j-value is number then. Arun On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote: > My gut reaction is: > > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan > wrote: > > Sure, here's an example of what I was trying to explain: > > > > Suppose: > > DT <- data.table(x=1:5, y=1:5, x=6:10) > > > > Then, > > > > DT[, c(1,3), with=FALSE] # gives correct subset > > This is "OK", we just do what the user asks, here, as they are being > very specific. > > > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong > > stop() -- we don't try to disambiguate (even if it "seems" specific) > > > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result > > stop() > > Also stop() on DT[, ..., .SDcols="x"] > > > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but > > Do what the user asks for. > > No? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Nov 8 21:47:57 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 12:47:57 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <40662613E7BB48CB9C57365D9C6522D2@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> <2A4352C9C46747AAB3951B71B4942E47@gmail.com> <40662613E7BB48CB9C57365D9C6522D2@gmail.com> Message-ID: On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan wrote: > Oh I can certainly agree with that. I guess we'll have to make some changes > to the code to use index based subsetting when .SDcols or j-value is number > then. Not sure what you mean by j-value -- the examples you gave didn't compute on the .SD, it just returned it. I think if there is a `j` expression that computes on an .SD that has duplicated colnames, I think we just stop(). Or did you mean something else? -steve > On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote: > > My gut reaction is: > > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan > wrote: > > Sure, here's an example of what I was trying to explain: > > Suppose: > DT <- data.table(x=1:5, y=1:5, x=6:10) > > Then, > > DT[, c(1,3), with=FALSE] # gives correct subset > > > This is "OK", we just do what the user asks, here, as they are being > very specific. > > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong > > > stop() -- we don't try to disambiguate (even if it "seems" specific) > > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result > > > stop() > > Also stop() on DT[, ..., .SDcols="x"] > > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but > > > Do what the user asks for. > > No? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From aragorn168b at gmail.com Fri Nov 8 21:53:18 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Fri, 8 Nov 2013 21:53:18 +0100 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> <2A4352C9C46747AAB3951B71B4942E47@gmail.com> <40662613E7BB48CB9C57365D9C6522D2@gmail.com> Message-ID: <4CB7CC37475440B68A19220EF98C75EF@gmail.com> Sorry, forget the j-value. For `.SDcols`, even when we provide integers (column numbers), internally, we compute the column name and subset to get `.SD`. And this'll have to change. Arun On Friday, November 8, 2013 at 9:47 PM, Steve Lianoglou wrote: > On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan > wrote: > > Oh I can certainly agree with that. I guess we'll have to make some changes > > to the code to use index based subsetting when .SDcols or j-value is number > > then. > > > > > Not sure what you mean by j-value -- the examples you gave didn't > compute on the .SD, it just returned it. > > I think if there is a `j` expression that computes on an .SD that has > duplicated colnames, I think we just stop(). > > Or did you mean something else? > > -steve > > > On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote: > > > > My gut reaction is: > > > > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan > > wrote: > > > > Sure, here's an example of what I was trying to explain: > > > > Suppose: > > DT <- data.table(x=1:5, y=1:5, x=6:10) > > > > Then, > > > > DT[, c(1,3), with=FALSE] # gives correct subset > > > > > > This is "OK", we just do what the user asks, here, as they are being > > very specific. > > > > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong > > > > > > stop() -- we don't try to disambiguate (even if it "seems" specific) > > > > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result > > > > > > stop() > > > > Also stop() on DT[, ..., .SDcols="x"] > > > > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but > > > > > > Do what the user asks for. > > > > No? > > > > -steve > > > > -- > > Steve Lianoglou > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lianoglou.steve at gene.com Fri Nov 8 21:56:33 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 8 Nov 2013 12:56:33 -0800 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: <4CB7CC37475440B68A19220EF98C75EF@gmail.com> References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> <2A4352C9C46747AAB3951B71B4942E47@gmail.com> <40662613E7BB48CB9C57365D9C6522D2@gmail.com> <4CB7CC37475440B68A19220EF98C75EF@gmail.com> Message-ID: Right, agreed. On Fri, Nov 8, 2013 at 12:53 PM, Arunkumar Srinivasan wrote: > Sorry, forget the j-value. For `.SDcols`, even when we provide integers > (column numbers), internally, we compute the column name and subset to get > `.SD`. And this'll have to change. > > Arun > > On Friday, November 8, 2013 at 9:47 PM, Steve Lianoglou wrote: > > On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan > wrote: > > Oh I can certainly agree with that. I guess we'll have to make some changes > to the code to use index based subsetting when .SDcols or j-value is number > then. > > > Not sure what you mean by j-value -- the examples you gave didn't > compute on the .SD, it just returned it. > > I think if there is a `j` expression that computes on an .SD that has > duplicated colnames, I think we just stop(). > > Or did you mean something else? > > -steve > > On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote: > > My gut reaction is: > > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan > wrote: > > Sure, here's an example of what I was trying to explain: > > Suppose: > DT <- data.table(x=1:5, y=1:5, x=6:10) > > Then, > > DT[, c(1,3), with=FALSE] # gives correct subset > > > This is "OK", we just do what the user asks, here, as they are being > very specific. > > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong > > > stop() -- we don't try to disambiguate (even if it "seems" specific) > > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result > > > stop() > > Also stop() on DT[, ..., .SDcols="x"] > > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but > > > Do what the user asks for. > > No? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech From eduard.antonyan at gmail.com Fri Nov 8 22:24:11 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 8 Nov 2013 15:24:11 -0600 Subject: [datatable-help] Unexpected behavior in setnames() In-Reply-To: References: <957B1243714142278898647650EBF386@gmail.com> <5E98018F047943DE89849EC57A7CF72A@gmail.com> <94F078AB544B4757A58049C7DB7433AB@gmail.com> <9F7DC50A9B2C470C952973F162105BC4@gmail.com> <9996C9517A244BFF81E353BEC9CD130C@gmail.com> <41982B8ACC9E424786C3E5816C4D3D80@gmail.com> <2A4352C9C46747AAB3951B71B4942E47@gmail.com> <40662613E7BB48CB9C57365D9C6522D2@gmail.com> <4CB7CC37475440B68A19220EF98C75EF@gmail.com> Message-ID: > > I think if there is a `j` expression that computes on an .SD that has > duplicated colnames, I think we just stop(). I'm not entirely sure what you mean by this. The following *should* work imo: dt = data.table(x = 1:10, x = 10:1) dt[, lapply(.SD, sum)] On Fri, Nov 8, 2013 at 2:56 PM, Steve Lianoglou wrote: > Right, agreed. > > On Fri, Nov 8, 2013 at 12:53 PM, Arunkumar Srinivasan > wrote: > > Sorry, forget the j-value. For `.SDcols`, even when we provide integers > > (column numbers), internally, we compute the column name and subset to > get > > `.SD`. And this'll have to change. > > > > Arun > > > > On Friday, November 8, 2013 at 9:47 PM, Steve Lianoglou wrote: > > > > On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan > > wrote: > > > > Oh I can certainly agree with that. I guess we'll have to make some > changes > > to the code to use index based subsetting when .SDcols or j-value is > number > > then. > > > > > > Not sure what you mean by j-value -- the examples you gave didn't > > compute on the .SD, it just returned it. > > > > I think if there is a `j` expression that computes on an .SD that has > > duplicated colnames, I think we just stop(). > > > > Or did you mean something else? > > > > -steve > > > > On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote: > > > > My gut reaction is: > > > > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan > > wrote: > > > > Sure, here's an example of what I was trying to explain: > > > > Suppose: > > DT <- data.table(x=1:5, y=1:5, x=6:10) > > > > Then, > > > > DT[, c(1,3), with=FALSE] # gives correct subset > > > > > > This is "OK", we just do what the user asks, here, as they are being > > very specific. > > > > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong > > > > > > stop() -- we don't try to disambiguate (even if it "seems" specific) > > > > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result > > > > > > stop() > > > > Also stop() on DT[, ..., .SDcols="x"] > > > > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but > > > > > > Do what the user asks for. > > > > No? > > > > -steve > > > > -- > > Steve Lianoglou > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > -- > > Steve Lianoglou > > Computational Biologist > > Bioinformatics and Computational Biology > > Genentech > > > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 9 12:32:59 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 9 Nov 2013 12:32:59 +0100 Subject: [datatable-help] FR #748 discussion Message-ID: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com> Hey everybody, I've been wanting to implement this for a while. I dint know there was a FR lying around. All the more good! https://r-forge.r-project.org/tracker/index.php?func=detail&aid=748&group_id=240&atid=978 It's about filling unavailable values during "join" with values other than default NA. Ex: require(data.table) DT <- data.table(x=c(1,2,3,6), y="A", key="x") DT[J(1:6)] # at the moment, all "y" with no match to key entry will be NA_character_ The FR: DT[J(1:6), fill = "bla"] What I wanted to discuss about is the handling on "nomatch" parameter: At the moment we've a "nomatch" parameter that takes values NA or 0. NA being default and when it's 0, the no matches are *removed*. So how do we allow the "fill" argument? I think "nomatch" should become logical with TRUE and FALSE mimicking the old functionality of filling with something or removing unavailable entries (that is, "nomatch=FALSE" = old "nomatch=0"). And if "nomatch=TRUE", then the value of "fill" (default = NA) will be used. For backwards compatibility, "nomatch" will be TRUE (keep no matches) and "fill=NA" (and assign them NA). Basically, "nomatch" has more priority than "fill". If "nomatch=FALSE", "fill" is ignored. Hm, do you find "nomatch=TRUE" as "keeping no matches" confusing? Maybe then we'll have to change this to "keep.nomatch". I'm all ears for better ideas! So, please weigh in. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sat Nov 9 17:50:48 2013 From: gsee000 at gmail.com (G See) Date: Sat, 9 Nov 2013 10:50:48 -0600 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning Message-ID: Hi, Please note the inconsistency between the behavior of rbind() and rbindlist() below. m1 <- as.data.table(mtcars) m2 <- copy(m1) rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT bind by name What do you think about making them have the same behavior and/or warning? Personally, I prefer the behavior of rbind(), and would prefer to see a warning if column names are ignored like they are with rbindlist(). Thanks, Garrett From eduard.antonyan at gmail.com Sat Nov 9 18:06:33 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 9 Nov 2013 11:06:33 -0600 Subject: [datatable-help] FR #748 discussion In-Reply-To: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com> References: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com> Message-ID: Great, this is smth I was thinking about recently as well. I do find nomatch=TRUE/FALSE confusing and keep.nomatch is better in that respect, but that's a lot more characters to type, which downgrades it for me a lot. I think in a world where I was designing it from scratch I would have nomatch=NA do what it does now, nomatch=NULL do what 0 does now, and then the rest of the values would fill. Not sure if this can be done smoothly though in the world we live in. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 9 18:08:47 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 9 Nov 2013 18:08:47 +0100 Subject: [datatable-help] FR #748 discussion In-Reply-To: References: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com> Message-ID: Eddi, Gabor's suggestions here: http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001738.html are quite nice! We did agree with those changes at the time :). I'll have to write to Matthew to get his feedback. Arun On Saturday, November 9, 2013 at 6:06 PM, Eduard Antonyan wrote: > Great, this is smth I was thinking about recently as well. > I do find nomatch=TRUE/FALSE confusing and keep.nomatch is better in that respect, but that's a lot more characters to type, which downgrades it for me a lot. > I think in a world where I was designing it from scratch I would have nomatch=NA do what it does now, nomatch=NULL do what 0 does now, and then the rest of the values would fill. Not sure if this can be done smoothly though in the world we live in. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sat Nov 9 18:29:39 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 9 Nov 2013 11:29:39 -0600 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: References: Message-ID: Fyi, it's not well documented, but setting use.names=FALSE in rbind would replicate rbindlist behavior. I think it's a reasonable FR - if/when all of rbind code goes into C, it would be trivial to add. On Nov 9, 2013 10:51 AM, "G See" wrote: > Hi, > > Please note the inconsistency between the behavior of rbind() and > rbindlist() below. > > m1 <- as.data.table(mtcars) > m2 <- copy(m1) > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT > bind by name > > What do you think about making them have the same behavior and/or > warning? Personally, I prefer the behavior of rbind(), and would > prefer to see a warning if column names are ignored like they are with > rbindlist(). > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 9 18:33:49 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 9 Nov 2013 18:33:49 +0100 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: References: Message-ID: GSee, I find this a bit confusing at the moment as well - the convergence of "rbind" and "rbindlist" and therefore the future of "rbindlist". `rbindlist` gained speed (to some extent) by assuming things like this and skipping checks in the first place. So, should we include checks like this? Also, if "rbind" and/or "rbindlist" are made to do the exact same thing, then, what's the purpose of "rbindlist"? Any thoughts? Arun On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote: > Fyi, it's not well documented, but setting use.names=FALSE in rbind would replicate rbindlist behavior. > I think it's a reasonable FR - if/when all of rbind code goes into C, it would be trivial to add. > On Nov 9, 2013 10:51 AM, "G See" wrote: > > Hi, > > > > Please note the inconsistency between the behavior of rbind() and > > rbindlist() below. > > > > m1 <- as.data.table(mtcars) > > m2 <- copy(m1) > > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name > > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT > > bind by name > > > > What do you think about making them have the same behavior and/or > > warning? Personally, I prefer the behavior of rbind(), and would > > prefer to see a warning if column names are ignored like they are with > > rbindlist(). > > > > Thanks, > > Garrett > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sat Nov 9 18:38:12 2013 From: gsee000 at gmail.com (G See) Date: Sat, 9 Nov 2013 11:38:12 -0600 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: References: Message-ID: Isn't rbindlist(myList) faster than do.call(rbind, myList)? Garrett On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan wrote: > GSee, I find this a bit confusing at the moment as well - the convergence of > "rbind" and "rbindlist" and therefore the future of "rbindlist". > > `rbindlist` gained speed (to some extent) by assuming things like this and > skipping checks in the first place. So, should we include checks like this? > Also, if "rbind" and/or "rbindlist" are made to do the exact same thing, > then, what's the purpose of "rbindlist"? > > Any thoughts? > > Arun > > On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote: > > Fyi, it's not well documented, but setting use.names=FALSE in rbind would > replicate rbindlist behavior. > > I think it's a reasonable FR - if/when all of rbind code goes into C, it > would be trivial to add. > > On Nov 9, 2013 10:51 AM, "G See" wrote: > > Hi, > > Please note the inconsistency between the behavior of rbind() and > rbindlist() below. > > m1 <- as.data.table(mtcars) > m2 <- copy(m1) > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT > bind by name > > What do you think about making them have the same behavior and/or > warning? Personally, I prefer the behavior of rbind(), and would > prefer to see a warning if column names are ignored like they are with > rbindlist(). > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > From aragorn168b at gmail.com Sat Nov 9 18:44:19 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 9 Nov 2013 18:44:19 +0100 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: References: Message-ID: <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com> I am not aware of the status now after eddi's recent edits. "rbindlist" initially only checked the type of the first data.table's columns. But now I guess with eddi's changes, it does look-down and decide based on class hierarchy. That is, if column 1 of dt1 is integer, but of dt2 is numeric, it's now "numeric", but before it was "integer". I guess this'll affect the speed. I've not done any benchmarking yet. But I'm guessing it'll be slower than at least the previous version. Eddi, any thoughts on this? Arun On Saturday, November 9, 2013 at 6:38 PM, G See wrote: > Isn't rbindlist(myList) faster than do.call(rbind, myList)? > > Garrett > > On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan > wrote: > > GSee, I find this a bit confusing at the moment as well - the convergence of > > "rbind" and "rbindlist" and therefore the future of "rbindlist". > > > > `rbindlist` gained speed (to some extent) by assuming things like this and > > skipping checks in the first place. So, should we include checks like this? > > Also, if "rbind" and/or "rbindlist" are made to do the exact same thing, > > then, what's the purpose of "rbindlist"? > > > > Any thoughts? > > > > Arun > > > > On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote: > > > > Fyi, it's not well documented, but setting use.names=FALSE in rbind would > > replicate rbindlist behavior. > > > > I think it's a reasonable FR - if/when all of rbind code goes into C, it > > would be trivial to add. > > > > On Nov 9, 2013 10:51 AM, "G See" wrote: > > > > Hi, > > > > Please note the inconsistency between the behavior of rbind() and > > rbindlist() below. > > > > m1 <- as.data.table(mtcars) > > m2 <- copy(m1) > > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name > > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT > > bind by name > > > > What do you think about making them have the same behavior and/or > > warning? Personally, I prefer the behavior of rbind(), and would > > prefer to see a warning if column names are ignored like they are with > > rbindlist(). > > > > Thanks, > > Garrett > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sat Nov 9 18:49:13 2013 From: gsee000 at gmail.com (G See) Date: Sat, 9 Nov 2013 11:49:13 -0600 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com> References: <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com> Message-ID: I really meant that I thought that do.call(rbind, list(a, b)) would be slower than rbindlist(list(a, b)). e.g. when you don't know the length of the list of data.tables On Sat, Nov 9, 2013 at 11:44 AM, Arunkumar Srinivasan wrote: > I am not aware of the status now after eddi's recent edits. "rbindlist" > initially only checked the type of the first data.table's columns. But now I > guess with eddi's changes, it does look-down and decide based on class > hierarchy. That is, if column 1 of dt1 is integer, but of dt2 is numeric, > it's now "numeric", but before it was "integer". I guess this'll affect the > speed. I've not done any benchmarking yet. But I'm guessing it'll be slower > than at least the previous version. > > Eddi, any thoughts on this? > > Arun > > On Saturday, November 9, 2013 at 6:38 PM, G See wrote: > > Isn't rbindlist(myList) faster than do.call(rbind, myList)? > > Garrett > > On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan > wrote: > > GSee, I find this a bit confusing at the moment as well - the convergence of > "rbind" and "rbindlist" and therefore the future of "rbindlist". > > `rbindlist` gained speed (to some extent) by assuming things like this and > skipping checks in the first place. So, should we include checks like this? > Also, if "rbind" and/or "rbindlist" are made to do the exact same thing, > then, what's the purpose of "rbindlist"? > > Any thoughts? > > Arun > > On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote: > > Fyi, it's not well documented, but setting use.names=FALSE in rbind would > replicate rbindlist behavior. > > I think it's a reasonable FR - if/when all of rbind code goes into C, it > would be trivial to add. > > On Nov 9, 2013 10:51 AM, "G See" wrote: > > Hi, > > Please note the inconsistency between the behavior of rbind() and > rbindlist() below. > > m1 <- as.data.table(mtcars) > m2 <- copy(m1) > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT > bind by name > > What do you think about making them have the same behavior and/or > warning? Personally, I prefer the behavior of rbind(), and would > prefer to see a warning if column names are ignored like they are with > rbindlist(). > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > From eduard.antonyan at gmail.com Sat Nov 9 19:25:00 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 9 Nov 2013 12:25:00 -0600 Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: References: <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com> Message-ID: Re speed: last I checked, new rbindlist was about 5% slower in no-coercion cases and was quite a bit faster in cases where there was coercion. do.call(rbind is indeed much slower than rbindlist and even if .rbind.data.table took no time to do, it'll still be much slower than rbindlist because of all the dispatching before it gets to .rbind.data.table. That said, I'm pretty sure rbind is now faster than rbind in 1.8.10 in all cases. On Sat, Nov 9, 2013 at 11:49 AM, G See wrote: > I really meant that I thought that do.call(rbind, list(a, b)) would be > slower than rbindlist(list(a, b)). e.g. when you don't know the > length of the list of data.tables > > On Sat, Nov 9, 2013 at 11:44 AM, Arunkumar Srinivasan > wrote: > > I am not aware of the status now after eddi's recent edits. "rbindlist" > > initially only checked the type of the first data.table's columns. But > now I > > guess with eddi's changes, it does look-down and decide based on class > > hierarchy. That is, if column 1 of dt1 is integer, but of dt2 is numeric, > > it's now "numeric", but before it was "integer". I guess this'll affect > the > > speed. I've not done any benchmarking yet. But I'm guessing it'll be > slower > > than at least the previous version. > > > > Eddi, any thoughts on this? > > > > Arun > > > > On Saturday, November 9, 2013 at 6:38 PM, G See wrote: > > > > Isn't rbindlist(myList) faster than do.call(rbind, myList)? > > > > Garrett > > > > On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan > > wrote: > > > > GSee, I find this a bit confusing at the moment as well - the > convergence of > > "rbind" and "rbindlist" and therefore the future of "rbindlist". > > > > `rbindlist` gained speed (to some extent) by assuming things like this > and > > skipping checks in the first place. So, should we include checks like > this? > > Also, if "rbind" and/or "rbindlist" are made to do the exact same thing, > > then, what's the purpose of "rbindlist"? > > > > Any thoughts? > > > > Arun > > > > On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote: > > > > Fyi, it's not well documented, but setting use.names=FALSE in rbind would > > replicate rbindlist behavior. > > > > I think it's a reasonable FR - if/when all of rbind code goes into C, it > > would be trivial to add. > > > > On Nov 9, 2013 10:51 AM, "G See" wrote: > > > > Hi, > > > > Please note the inconsistency between the behavior of rbind() and > > rbindlist() below. > > > > m1 <- as.data.table(mtcars) > > m2 <- copy(m1) > > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name > > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT > > bind by name > > > > What do you think about making them have the same behavior and/or > > warning? Personally, I prefer the behavior of rbind(), and would > > prefer to see a warning if column names are ignored like they are with > > rbindlist(). > > > > Thanks, > > Garrett > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Sat Nov 9 19:55:36 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Sat, 9 Nov 2013 12:55:36 -0600 Subject: [datatable-help] FR #748 discussion In-Reply-To: References: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com> Message-ID: Good point - I had forgotten about that, and I still do like that proposal! :) On Sat, Nov 9, 2013 at 11:08 AM, Arunkumar Srinivasan wrote: > Eddi, > Gabor's suggestions here: > http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001738.html are > quite nice! We did agree with those changes at the time :). I'll have to > write to Matthew to get his feedback. > > Arun > > On Saturday, November 9, 2013 at 6:06 PM, Eduard Antonyan wrote: > > Great, this is smth I was thinking about recently as well. > > I do find nomatch=TRUE/FALSE confusing and keep.nomatch is better in that > respect, but that's a lot more characters to type, which downgrades it for > me a lot. > > I think in a world where I was designing it from scratch I would have > nomatch=NA do what it does now, nomatch=NULL do what 0 does now, and then > the rest of the values would fill. Not sure if this can be done smoothly > though in the world we live in. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sun Nov 10 12:43:55 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sun, 10 Nov 2013 12:43:55 +0100 Subject: [datatable-help] Revisiting scoping rules in "j" (reviving Gabor's post) Message-ID: Hi everyone, To revive the discussion Gabor started here: http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the (related, but subtly different) FR mnel filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 Suppose you have: require(data.table) d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") Then as Gabor points out: `d1[d2, id1]` should *not* result in an error, because FAQ 2.8 states (copied from Gabor's post linked above): 1. The scope of X's subset; i.e., X's column names. 2. The scope of each row of Y; i.e., Y's column names (join inherited scope) ? In this case, the desired output for `d1[d2, id1]` should then be: id1 id1 1: 1 1 2: 2 2 3: 2 2 4: 4 NA That's what I at least understand from what the documentation intends. However, this recommends a subtle change to the current method of referring to columns, if we were to keep this idea. That is, consider the data.table "d3" as follows: d3 <- copy(d2) setnames(d3, names(d1)) Now, what should `d1[d3, id1]` give? The answer, I believe, is same as `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be intended - here comes the "subtle change", then one should do: d1[d3, i.d1] # referring to i's variables with the "i." notation. I've managed to implement the first part where X's columns are looked up so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure that my understanding of the FAQ is right (and that the FAQ makes sense - it does to me). Please let me know what you all think so that I can implement the second part and commit. This, I believe will let us get a step closer to the consistency in DT syntax. Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sun Nov 10 20:39:15 2013 From: gsee000 at gmail.com (G See) Date: Sun, 10 Nov 2013 13:39:15 -0600 Subject: [datatable-help] lapply without anonymous function Message-ID: Hi, I have a list of data.tables and I am trying to extract a subset from each of them. I can achieve what I want with this: > L <- list(data.table(BOD), data.table(BOD)) > lapply(L, function(x) x[Time==3L]) [[1]] Time demand 1: 3 19 [[2]] Time demand 1: 3 19 However, I'd rather not type have to create an anonymous function. I tried the below, but `[.data.frame` is being dispatched. > lapply(L, "[", Time==3L) Error in `[.data.frame`(x, i) : object 'Time' not found Even if I am explicit, `[.data.table` does not get dispatched: > lapply(L, data.table:::`[.data.table`, Time==3L) Error in `[.data.frame`(x, i) : object 'Time' not found I'm guessing this is due to where evaluation takes place. Is there an alternate syntax I should use? Thanks, Garrett From eduard.antonyan at gmail.com Mon Nov 11 14:40:03 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 11 Nov 2013 07:40:03 -0600 Subject: [datatable-help] lapply without anonymous function In-Reply-To: References: Message-ID: I think your last attempt's failure is a bug of the internal cedta function, but note that if it did work, it'd be more symbols to type than the anonymous function option :) On Nov 10, 2013 1:39 PM, "G See" wrote: > Hi, > > I have a list of data.tables and I am trying to extract a subset from > each of them. I can achieve what I want with this: > > > L <- list(data.table(BOD), data.table(BOD)) > > lapply(L, function(x) x[Time==3L]) > [[1]] > Time demand > 1: 3 19 > > [[2]] > Time demand > 1: 3 19 > > However, I'd rather not type have to create an anonymous function. I > tried the below, but `[.data.frame` is being dispatched. > > > lapply(L, "[", Time==3L) > Error in `[.data.frame`(x, i) : object 'Time' not found > > Even if I am explicit, `[.data.table` does not get dispatched: > > > lapply(L, data.table:::`[.data.table`, Time==3L) > Error in `[.data.frame`(x, i) : object 'Time' not found > > I'm guessing this is due to where evaluation takes place. Is there an > alternate syntax I should use? > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Nov 11 14:41:38 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 11 Nov 2013 07:41:38 -0600 Subject: [datatable-help] lapply without anonymous function In-Reply-To: References: Message-ID: But I guess your second attempt should have worked - submit a bug request imo. On Nov 11, 2013 7:40 AM, "Eduard Antonyan" wrote: > I think your last attempt's failure is a bug of the internal cedta > function, but note that if it did work, it'd be more symbols to type than > the anonymous function option :) > On Nov 10, 2013 1:39 PM, "G See" wrote: > >> Hi, >> >> I have a list of data.tables and I am trying to extract a subset from >> each of them. I can achieve what I want with this: >> >> > L <- list(data.table(BOD), data.table(BOD)) >> > lapply(L, function(x) x[Time==3L]) >> [[1]] >> Time demand >> 1: 3 19 >> >> [[2]] >> Time demand >> 1: 3 19 >> >> However, I'd rather not type have to create an anonymous function. I >> tried the below, but `[.data.frame` is being dispatched. >> >> > lapply(L, "[", Time==3L) >> Error in `[.data.frame`(x, i) : object 'Time' not found >> >> Even if I am explicit, `[.data.table` does not get dispatched: >> >> > lapply(L, data.table:::`[.data.table`, Time==3L) >> Error in `[.data.frame`(x, i) : object 'Time' not found >> >> I'm guessing this is due to where evaluation takes place. Is there an >> alternate syntax I should use? >> >> Thanks, >> Garrett >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Mon Nov 11 14:45:38 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 11 Nov 2013 07:45:38 -0600 Subject: [datatable-help] Revisiting scoping rules in "j" (reviving Gabor's post) In-Reply-To: References: Message-ID: I haven't checked yet what it does currently but what you wrote makes perfect sense. On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" wrote: > Hi everyone, > > To revive the discussion Gabor started here: > http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the > (related, but subtly different) FR mnel filed here: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 > > Suppose you have: > > require(data.table) > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > Then as Gabor points out: `d1[d2, id1]` should *not* result in an error, > because FAQ 2.8 states (copied from Gabor's post linked above): > > 1. The scope of X's subset; i.e., X's column names. > 2. The scope of each row of Y; i.e., Y's column names (join inherited > scope) > ? > > In this case, the desired output for `d1[d2, id1]` should then be: > id1 id1 > 1: 1 1 > 2: 2 2 > 3: 2 2 > 4: 4 NA > > That's what I at least understand from what the documentation intends. > > However, this recommends a subtle change to the current method of > referring to columns, if we were to keep this idea. That is, consider the > data.table "d3" as follows: > > d3 <- copy(d2) > setnames(d3, names(d1)) > > Now, what should `d1[d3, id1]` give? The answer, I believe, is same as > `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked > up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the > values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be > intended - here comes the "subtle change", then one should do: > > d1[d3, i.d1] # referring to i's variables with the "i." notation. > > I've managed to implement the first part where X's columns are looked up > so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure > that my understanding of the FAQ is right (and that the FAQ makes sense - > it does to me). > > Please let me know what you all think so that I can implement the second > part and commit. This, I believe will let us get a step closer to the > consistency in DT syntax. > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Nov 11 14:55:27 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 11 Nov 2013 14:55:27 +0100 Subject: [datatable-help] Revisiting scoping rules in "j" (reviving Gabor's post) In-Reply-To: References: Message-ID: Eddi, Thank you. However, I've realised something and made a slight change to the concept (at least I think that's the way to go). Basically, if you've: require(data.table) d1 <- data.table(id1=c(1L, 2L, 2L, 3L), val=1:4, key="id1") and you do: d1[, print(id1), by=id1] [1] 1 [1] 2 [1] 3 That is, while grouping, the grouping variables length for every group remains 1 (when grouping using "by"). for id=2, we don't get "2" two times. Going by the same logic, if we were to do: d1[J(2), id1] id1 id1 1: 2 2 Here' the first "id1" is the grouping "id1" (from J(2)). Following FR #2693 from mnel, I've changed the names of J(.) when it has no names to resemble that of key columns of "d1". The second "id1" corresponds to the corresponding value of "id1" for "id1=2". And even though it's present 2 times, we print it only once. That is, it'll be identical to d1[, id1, by=id1], even though d1's "id1" is *not really* the grouping variable. If we've to refer to i's columns, then we've to explicitly state "i.id1". That is, here, it would be: d1[J(2), i.id1] # identical results, but i.id1 corresponds to data.table from J(2) with column name = id1 To illustrate the difference nicely, let's consider these data.tables: d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") d3 <- copy(d2) setnames(d3, names(d1)) d1[d2, list(id1)] # what Gabor's post highlighted should work (but it doesn't give 1,2,2,NA as pointed out in the earlier post) id1 id1 1: 1 1 2: 2 2 3: 4 NA d1[d3, list(id1, i.id1)] # id1 refers to values from d1 and i.id1 to d3. id1 id1 i.id1 1: 1 1 1 2: 2 2 2 3: 4 NA 4 Note that for every (implicit) grouping value from d3, the only possible values for d1's grouping would be 1) identical to that of d3 or 2) NA. Let me know what you guys think. Arun On Monday, November 11, 2013 at 2:45 PM, Eduard Antonyan wrote: > I haven't checked yet what it does currently but what you wrote makes perfect sense. > On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" wrote: > > Hi everyone, > > > > To revive the discussion Gabor started here: http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the (related, but subtly different) FR mnel filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 > > > > Suppose you have: > > > > require(data.table) > > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > > > Then as Gabor points out: `d1[d2, id1]` should *not* result in an error, because FAQ 2.8 states (copied from Gabor's post linked above): > > > > 1. The scope of X's subset; i.e., X's column names. > > 2. The scope of each row of Y; i.e., Y's column names (join inherited scope) > > ? > > > > In this case, the desired output for `d1[d2, id1]` should then be: > > id1 id1 > > 1: 1 1 > > 2: 2 2 > > 3: 2 2 > > 4: 4 NA > > > > > > That's what I at least understand from what the documentation intends. > > > > However, this recommends a subtle change to the current method of referring to columns, if we were to keep this idea. That is, consider the data.table "d3" as follows: > > > > d3 <- copy(d2) > > setnames(d3, names(d1)) > > > > Now, what should `d1[d3, id1]` give? The answer, I believe, is same as `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be intended - here comes the "subtle change", then one should do: > > > > d1[d3, i.d1] # referring to i's variables with the "i." notation. > > > > I've managed to implement the first part where X's columns are looked up so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure that my understanding of the FAQ is right (and that the FAQ makes sense - it does to me). > > > > Please let me know what you all think so that I can implement the second part and commit. This, I believe will let us get a step closer to the consistency in DT syntax. > > > > Arun > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help -------------- next part -------------- An HTML attachment was scrubbed... URL: From ggrothendieck at gmail.com Mon Nov 11 15:06:48 2013 From: ggrothendieck at gmail.com (Gabor Grothendieck) Date: Mon, 11 Nov 2013 09:06:48 -0500 Subject: [datatable-help] lapply without anonymous function In-Reply-To: References: Message-ID: On Sun, Nov 10, 2013 at 2:39 PM, G See wrote: > Hi, > > I have a list of data.tables and I am trying to extract a subset from > each of them. I can achieve what I want with this: > >> L <- list(data.table(BOD), data.table(BOD)) >> lapply(L, function(x) x[Time==3L]) > [[1]] > Time demand > 1: 3 19 > > [[2]] > Time demand > 1: 3 19 > > However, I'd rather not type have to create an anonymous function. I > tried the below, but `[.data.frame` is being dispatched. > >> lapply(L, "[", Time==3L) > Error in `[.data.frame`(x, i) : object 'Time' not found > > Even if I am explicit, `[.data.table` does not get dispatched: > >> lapply(L, data.table:::`[.data.table`, Time==3L) > Error in `[.data.frame`(x, i) : object 'Time' not found > > I'm guessing this is due to where evaluation takes place. Is there an > alternate syntax I should use? > subset works: > lapply(L, subset, Time == 3L) [[1]] Time demand 1: 3 19 [[2]] Time demand 1: 3 19 -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com From gsee000 at gmail.com Mon Nov 11 15:40:34 2013 From: gsee000 at gmail.com (G See) Date: Mon, 11 Nov 2013 08:40:34 -0600 Subject: [datatable-help] lapply without anonymous function In-Reply-To: References: Message-ID: heh, after all my efforts to avoid subset(), it can be useful after all. :) Bug report filed, per Eduard's suggestion. On Mon, Nov 11, 2013 at 8:06 AM, Gabor Grothendieck wrote: > On Sun, Nov 10, 2013 at 2:39 PM, G See wrote: >> Hi, >> >> I have a list of data.tables and I am trying to extract a subset from >> each of them. I can achieve what I want with this: >> >>> L <- list(data.table(BOD), data.table(BOD)) >>> lapply(L, function(x) x[Time==3L]) >> [[1]] >> Time demand >> 1: 3 19 >> >> [[2]] >> Time demand >> 1: 3 19 >> >> However, I'd rather not type have to create an anonymous function. I >> tried the below, but `[.data.frame` is being dispatched. >> >>> lapply(L, "[", Time==3L) >> Error in `[.data.frame`(x, i) : object 'Time' not found >> >> Even if I am explicit, `[.data.table` does not get dispatched: >> >>> lapply(L, data.table:::`[.data.table`, Time==3L) >> Error in `[.data.frame`(x, i) : object 'Time' not found >> >> I'm guessing this is due to where evaluation takes place. Is there an >> alternate syntax I should use? >> > > subset works: > >> lapply(L, subset, Time == 3L) > [[1]] > Time demand > 1: 3 19 > > [[2]] > Time demand > 1: 3 19 > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com From eduard.antonyan at gmail.com Mon Nov 11 16:53:05 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Mon, 11 Nov 2013 09:53:05 -0600 Subject: [datatable-help] Revisiting scoping rules in "j" (reviving Gabor's post) In-Reply-To: References: Message-ID: Everything looks good to me. Note that there is also .BY[[1]] that one can potentially also want to use in those examples (which is basically same as i.id1). On Mon, Nov 11, 2013 at 7:55 AM, Arunkumar Srinivasan wrote: > Eddi, > > Thank you. However, I've realised something and made a slight change to > the concept (at least I think that's the way to go). > > Basically, if you've: > > require(data.table) > d1 <- data.table(id1=c(1L, 2L, 2L, 3L), val=1:4, key="id1") > > and you do: > > d1[, print(id1), by=id1] > [1] 1 > [1] 2 > [1] 3 > > That is, while grouping, the grouping variables length for every group > remains 1 (when grouping using "by"). for id=2, we don't get "2" two times. > Going by the same logic, if we were to do: > > d1[J(2), id1] > id1 id1 > 1: 2 2 > > Here' the first "id1" is the grouping "id1" (from J(2)). Following FR > #2693 from mnel, I've changed the names of J(.) when it has no names to > resemble that of key columns of "d1". The second "id1" corresponds to the > corresponding value of "id1" for "id1=2". And even though it's present 2 > times, we print it only once. That is, it'll be identical to d1[, id1, > by=id1], even though d1's "id1" is *not really* the grouping variable. > > If we've to refer to i's columns, then we've to explicitly state "i.id1". > That is, here, it would be: > > d1[J(2), i.id1] # identical results, but i.id1 corresponds to data.table > from J(2) with column name = id1 > > To illustrate the difference nicely, let's consider these data.tables: > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > d3 <- copy(d2) > setnames(d3, names(d1)) > > d1[d2, list(id1)] # what Gabor's post highlighted should work (but it > doesn't give 1,2,2,NA as pointed out in the earlier post) > id1 id1 > 1: 1 1 > 2: 2 2 > 3: 4 NA > > d1[d3, list(id1, i.id1)] # id1 refers to values from d1 and i.id1 to d3. > id1 id1 i.id1 > 1: 1 1 1 > 2: 2 2 2 > 3: 4 NA 4 > > Note that for every (implicit) grouping value from d3, the only possible > values for d1's grouping would be 1) identical to that of d3 or 2) NA. > > Let me know what you guys think. > > Arun > > On Monday, November 11, 2013 at 2:45 PM, Eduard Antonyan wrote: > > I haven't checked yet what it does currently but what you wrote makes > perfect sense. > On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" > wrote: > > Hi everyone, > > To revive the discussion Gabor started here: > http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the > (related, but subtly different) FR mnel filed here: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 > > Suppose you have: > > require(data.table) > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > Then as Gabor points out: `d1[d2, id1]` should *not* result in an error, > because FAQ 2.8 states (copied from Gabor's post linked above): > > 1. The scope of X's subset; i.e., X's column names. > 2. The scope of each row of Y; i.e., Y's column names (join inherited > scope) > ? > > In this case, the desired output for `d1[d2, id1]` should then be: > id1 id1 > 1: 1 1 > 2: 2 2 > 3: 2 2 > 4: 4 NA > > That's what I at least understand from what the documentation intends. > > However, this recommends a subtle change to the current method of > referring to columns, if we were to keep this idea. That is, consider the > data.table "d3" as follows: > > d3 <- copy(d2) > setnames(d3, names(d1)) > > Now, what should `d1[d3, id1]` give? The answer, I believe, is same as > `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked > up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the > values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be > intended - here comes the "subtle change", then one should do: > > d1[d3, i.d1] # referring to i's variables with the "i." notation. > > I've managed to implement the first part where X's columns are looked up > so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure > that my understanding of the FAQ is right (and that the FAQ makes sense - > it does to me). > > Please let me know what you all think so that I can implement the second > part and commit. This, I believe will let us get a step closer to the > consistency in DT syntax. > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Nov 11 16:55:04 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 11 Nov 2013 16:55:04 +0100 Subject: [datatable-help] Revisiting scoping rules in "j" (reviving Gabor's post) In-Reply-To: References: Message-ID: Great! I'll commit then and see how it goes! Yes, you're right about .BY[[1]]. But `i.id1` was already there - in SDenv$.iSD part of the code. Arun On Monday, November 11, 2013 at 4:53 PM, Eduard Antonyan wrote: > Everything looks good to me. Note that there is also .BY[[1]] that one can potentially also want to use in those examples (which is basically same as i.id1). > > > > On Mon, Nov 11, 2013 at 7:55 AM, Arunkumar Srinivasan wrote: > > Eddi, > > > > Thank you. However, I've realised something and made a slight change to the concept (at least I think that's the way to go). > > > > Basically, if you've: > > > > require(data.table) > > d1 <- data.table(id1=c(1L, 2L, 2L, 3L), val=1:4, key="id1") > > > > and you do: > > > > d1[, print(id1), by=id1] > > [1] 1 > > [1] 2 > > [1] 3 > > > > > > That is, while grouping, the grouping variables length for every group remains 1 (when grouping using "by"). for id=2, we don't get "2" two times. Going by the same logic, if we were to do: > > > > d1[J(2), id1] > > id1 id1 > > 1: 2 2 > > > > > > Here' the first "id1" is the grouping "id1" (from J(2)). Following FR #2693 from mnel, I've changed the names of J(.) when it has no names to resemble that of key columns of "d1". The second "id1" corresponds to the corresponding value of "id1" for "id1=2". And even though it's present 2 times, we print it only once. That is, it'll be identical to d1[, id1, by=id1], even though d1's "id1" is *not really* the grouping variable. > > > > If we've to refer to i's columns, then we've to explicitly state "i.id1". That is, here, it would be: > > > > d1[J(2), i.id1] # identical results, but i.id1 corresponds to data.table from J(2) with column name = id1 > > > > To illustrate the difference nicely, let's consider these data.tables: > > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > d3 <- copy(d2) > > setnames(d3, names(d1)) > > > > d1[d2, list(id1)] # what Gabor's post highlighted should work (but it doesn't give 1,2,2,NA as pointed out in the earlier post) > > id1 id1 > > 1: 1 1 > > 2: 2 2 > > > > 3: 4 NA > > > > > > d1[d3, list(id1, i.id1)] # id1 refers to values from d1 and i.id1 to d3. > > id1 id1 i.id1 > > 1: 1 1 1 > > 2: 2 2 2 > > 3: 4 NA 4 > > > > > > Note that for every (implicit) grouping value from d3, the only possible values for d1's grouping would be 1) identical to that of d3 or 2) NA. > > > > Let me know what you guys think. > > > > Arun > > > > > > On Monday, November 11, 2013 at 2:45 PM, Eduard Antonyan wrote: > > > > > I haven't checked yet what it does currently but what you wrote makes perfect sense. > > > On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" wrote: > > > > Hi everyone, > > > > > > > > To revive the discussion Gabor started here: http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the (related, but subtly different) FR mnel filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978 > > > > > > > > Suppose you have: > > > > > > > > require(data.table) > > > > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1") > > > > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2") > > > > > > > > Then as Gabor points out: `d1[d2, id1]` should *not* result in an error, because FAQ 2.8 states (copied from Gabor's post linked above): > > > > > > > > 1. The scope of X's subset; i.e., X's column names. > > > > 2. The scope of each row of Y; i.e., Y's column names (join inherited scope) > > > > ? > > > > > > > > In this case, the desired output for `d1[d2, id1]` should then be: > > > > id1 id1 > > > > 1: 1 1 > > > > 2: 2 2 > > > > 3: 2 2 > > > > 4: 4 NA > > > > > > > > > > > > That's what I at least understand from what the documentation intends. > > > > > > > > However, this recommends a subtle change to the current method of referring to columns, if we were to keep this idea. That is, consider the data.table "d3" as follows: > > > > > > > > d3 <- copy(d2) > > > > setnames(d3, names(d1)) > > > > > > > > Now, what should `d1[d3, id1]` give? The answer, I believe, is same as `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be intended - here comes the "subtle change", then one should do: > > > > > > > > d1[d3, i.d1] # referring to i's variables with the "i." notation. > > > > > > > > I've managed to implement the first part where X's columns are looked up so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure that my understanding of the FAQ is right (and that the FAQ makes sense - it does to me). > > > > > > > > Please let me know what you all think so that I can implement the second part and commit. This, I believe will let us get a step closer to the consistency in DT syntax. > > > > > > > > Arun > > > > > > > > > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Wed Nov 13 22:24:41 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Wed, 13 Nov 2013 22:24:41 +0100 Subject: [datatable-help] FR #5072 reg. Message-ID: Hi everybody, Regarding FR #5072 here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5072&group_id=240&atid=975 Let's take two data.tables X and Y with key set to one column, "V1". data.table currently deals with Y[X] differently when Y is a factor and 1) X is a factor and 2) X is not a factor. Let me illustrate this: case 1: # X and Y are factors require(data.table) X <- data.table(V1=factor(c("A", "B", "C"))) Y <- data.table(V1=factor(c("B", "D", "E")), key="V1") > Y[X] # X is a factor V1 1: A 2: B 3: C > Y[X]$V1 [1] A B C Levels: A B C ** Note that when both X and Y are factors, only the levels of X are in the join'd result (no D/E). case 2: # X is **not** a factor require(data.table) X <- data.table(V1=c("A", "B", "C")) Y <- data.table(V1=factor(c("B", "D", "E")), key="V1") > Y[X] # x is not a factor V1 1: NA 2: B 3: NA > Y[X]$V1 [1] B Levels: B D E ** Note that the results have "NA" in them as the join is concerned with retaining levels from "Y". The first question is: Why this difference? Should there be a difference between when X is or is not a factor? What do you guys think should be the intended result? The side-effect comes during "merge" as it internally uses this principle (and hence FR #5072). For example: merge(X, Y, by="V1", all=TRUE) V1 1: NA 2: NA 3: B 4: D 5: E > merge(X, Y, by="V1", all=TRUE)$V1 [1] B D E Levels: B D E The second question is: Is this intended result? Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Wed Nov 13 22:55:52 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Wed, 13 Nov 2013 15:55:52 -0600 Subject: [datatable-help] FR #5072 reg. In-Reply-To: References: Message-ID: I think case 1 and case 2 should have same output and I think that the merge should combine factor levels similar to how rbind does. Btw another issue about factors exists in rbind'ing the j-expression: dt = data.table(a = 1:2) dt[, factor('a', levels = letters[1:.I]), by = a]$V1 #[1] a a #Levels: a but if you print out the j-expression it's evident that factor information gets lost: dt[, print(factor('a', levels = letters[1:.I])), by = a] #[1] a #Levels: a #[1] a #Levels: a b On Wed, Nov 13, 2013 at 3:24 PM, Arunkumar Srinivasan wrote: > Hi everybody, > Regarding FR #5072 here: > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5072&group_id=240&atid=975 > > Let's take two data.tables X and Y with key set to one column, "V1". > data.table currently deals with Y[X] differently when Y is a factor and 1) > X is a factor and 2) X is not a factor. Let me illustrate this: > > case 1: > # X and Y are factors > require(data.table) > X <- data.table(V1=factor(c("A", "B", "C"))) > Y <- data.table(V1=factor(c("B", "D", "E")), key="V1") > > > Y[X] # X is a factor > V1 > 1: A > 2: B > 3: C > > Y[X]$V1 > [1] A B C > Levels: A B C > > ** Note that when both X and Y are factors, only the levels of X are in > the join'd result (no D/E). > > case 2: > # X is **not** a factor > require(data.table) > X <- data.table(V1=c("A", "B", "C")) > Y <- data.table(V1=factor(c("B", "D", "E")), key="V1") > > Y[X] # x is not a factor > V1 > 1: NA > 2: B > 3: NA > > > Y[X]$V1 > [1] B > Levels: B D E > > ** Note that the results have "NA" in them as the join is concerned with > retaining levels from "Y". > > The first question is: Why this difference? Should there be a difference > between when X is or is not a factor? What do you guys think should be the > intended result? > > The side-effect comes during "merge" as it internally uses this principle > (and hence FR #5072). For example: > > merge(X, Y, by="V1", all=TRUE) > V1 > 1: NA > 2: NA > 3: B > 4: D > 5: E > > > merge(X, Y, by="V1", all=TRUE)$V1 > [1] B D E > Levels: B D E > > The second question is: Is this intended result? > > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Nov 14 13:09:01 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 14 Nov 2013 13:09:01 +0100 Subject: [datatable-help] Bug report #5100 reg. Message-ID: Hi everybody, It'd be nice if you could weigh-in on the bug report filed by Bill here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 The gist of it is: require(data.table) DT <- data.table(x=1:5, y=6:10, z=11:15) DT[, y] # returns a vector DT[, "y", with=FALSE] # returns a data.table The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?" My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? Thank you, Arun -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Nov 14 17:25:04 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Thu, 14 Nov 2013 10:25:04 -0600 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: References: Message-ID: DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)]. I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)]. On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan wrote: > Hi everybody, > > It'd be nice if you could weigh-in on the bug report filed by Bill here: > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 > > The gist of it is: > > require(data.table) > DT <- data.table(x=1:5, y=6:10, z=11:15) > DT[, y] # returns a vector > DT[, "y", with=FALSE] # returns a data.table > > The question from the bug report basically is: "why is that in the first > case, 'j' has only one column and we get a vector, but in the second case, > we get a data.table?" > > My question is: Is this behaviour okay or do you prefer that the first one > returns a data.table as well or the second one (with "with=FALSE") returns > a vector? > > Thank you, > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Nov 14 17:33:19 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 14 Nov 2013 17:33:19 +0100 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: References: Message-ID: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> Eddi, At the least, I think the documentation needs to be clearer on the use of "with=FALSE". It does feel inconsistent with the fact that "j" with a single column should return a vector. In data.frames, the type in "j" being column names, if it's just one column name, would return a vector, unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, c("x", "y")] will return a data.frame. So, it is inconsistent with data.frame here, I think. Arun On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote: > DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)]. > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). > > Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)]. > > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan wrote: > > Hi everybody, > > > > It'd be nice if you could weigh-in on the bug report filed by Bill here: > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 > > > > The gist of it is: > > > > require(data.table) > > DT <- data.table(x=1:5, y=6:10, z=11:15) > > DT[, y] # returns a vector > > DT[, "y", with=FALSE] # returns a data.table > > > > The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?" > > > > My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? > > > > Thank you, > > Arun > > > > > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Nov 14 17:39:25 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Thu, 14 Nov 2013 10:39:25 -0600 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> References: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> Message-ID: I agree that it's inconsistent with data.frame, and imo that's a good thing. We don't replicate the drop argument, so it wouldn't be possible to return a data.table when with=FALSE and either way drop=TRUE by default is a bad design choice in data.frame and matrix (that is unlikely to change given R-core's attitude towards that type of a thing). I'm always pro more and better documentation :) On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan < aragorn168b at gmail.com> wrote: > Eddi, At the least, I think the documentation needs to be clearer on the > use of "with=FALSE". It does feel inconsistent with the fact that "j" with > a single column should return a vector. In data.frames, the type in "j" > being column names, if it's just one column name, would return a vector, > unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, > c("x", "y")] will return a data.frame. So, it is inconsistent with > data.frame here, I think. > > > Arun > > On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote: > > DT[, y] returning a vector is I think the only correct behavior, given the > understanding of j-expression as something evaluated in the DT environment. > If they want a data.table they should simply use DT[, list(y)] or DT[, > data.table(y)]. > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much > never use that form, but I see an argument for it staying as is, because > "y" and c("y") are the same and since we all presumably agree that DT[, > c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with > = FALSE] returned a different type that would mean inconsistent return > types which makes life much harder for users (as evidenced by the periodic > drop=FALSE questions that come up on SO). > > Going back to DT[, y], note that y and list(y) actually produce > *different* results (in e.g. base_env), so there is no type consistency > issue there between DT[, y] and DT[, list(y, z)]. > > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Hi everybody, > > It'd be nice if you could weigh-in on the bug report filed by Bill here: > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 > > The gist of it is: > > require(data.table) > DT <- data.table(x=1:5, y=6:10, z=11:15) > DT[, y] # returns a vector > DT[, "y", with=FALSE] # returns a data.table > > The question from the bug report basically is: "why is that in the first > case, 'j' has only one column and we get a vector, but in the second case, > we get a data.table?" > > My question is: Is this behaviour okay or do you prefer that the first one > returns a data.table as well or the second one (with "with=FALSE") returns > a vector? > > Thank you, > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Nov 14 17:46:50 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 14 Nov 2013 17:46:50 +0100 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: References: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> Message-ID: <862AB4459A55499EB1DA0AB24D04A890@gmail.com> Glad that we agree on better-ing the documentation. However, I don't find it a sound argument that we deviate from data.frame because the design is bad, *when we inherit from data.frame*. The choice is already made! Too many such trivial inconsistencies piles up pretty quickly and could potentially result in a steep learning curve - as there are different set of rules to be memorised. Tackling the point of "inheriting from data.frame", *but* this, this, this.. and many other things are different, if can't be avoided, should be *very clearly* documented (in the beginning, maybe as a cheat sheet) so that people aren't confused. Arun On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote: > I agree that it's inconsistent with data.frame, and imo that's a good thing. We don't replicate the drop argument, so it wouldn't be possible to return a data.table when with=FALSE and either way drop=TRUE by default is a bad design choice in data.frame and matrix (that is unlikely to change given R-core's attitude towards that type of a thing). > > I'm always pro more and better documentation :) > > > On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan wrote: > > Eddi, At the least, I think the documentation needs to be clearer on the use of "with=FALSE". It does feel inconsistent with the fact that "j" with a single column should return a vector. In data.frames, the type in "j" being column names, if it's just one column name, would return a vector, unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, c("x", "y")] will return a data.frame. So, it is inconsistent with data.frame here, I think. > > > > > > Arun > > > > > > On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote: > > > > > DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)]. > > > > > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). > > > > > > Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)]. > > > > > > > > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan wrote: > > > > Hi everybody, > > > > > > > > It'd be nice if you could weigh-in on the bug report filed by Bill here: > > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 > > > > > > > > The gist of it is: > > > > > > > > require(data.table) > > > > DT <- data.table(x=1:5, y=6:10, z=11:15) > > > > DT[, y] # returns a vector > > > > DT[, "y", with=FALSE] # returns a data.table > > > > > > > > The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?" > > > > > > > > My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? > > > > > > > > Thank you, > > > > Arun > > > > > > > > > > > > _______________________________________________ > > > > datatable-help mailing list > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Thu Nov 14 17:47:51 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Thu, 14 Nov 2013 17:47:51 +0100 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: <862AB4459A55499EB1DA0AB24D04A890@gmail.com> References: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> <862AB4459A55499EB1DA0AB24D04A890@gmail.com> Message-ID: <1D2952C1F9244FF5A473FCAED0B03920@gmail.com> I'll try to make a list of places where data.table != data.frame operation. Arun On Thursday, November 14, 2013 at 5:46 PM, Arunkumar Srinivasan wrote: > Glad that we agree on better-ing the documentation. However, I don't find it a sound argument that we deviate from data.frame because the design is bad, *when we inherit from data.frame*. The choice is already made! Too many such trivial inconsistencies piles up pretty quickly and could potentially result in a steep learning curve - as there are different set of rules to be memorised. > > Tackling the point of "inheriting from data.frame", *but* this, this, this.. and many other things are different, if can't be avoided, should be *very clearly* documented (in the beginning, maybe as a cheat sheet) so that people aren't confused. > > > Arun > > > On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote: > > > I agree that it's inconsistent with data.frame, and imo that's a good thing. We don't replicate the drop argument, so it wouldn't be possible to return a data.table when with=FALSE and either way drop=TRUE by default is a bad design choice in data.frame and matrix (that is unlikely to change given R-core's attitude towards that type of a thing). > > > > I'm always pro more and better documentation :) > > > > > > On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan wrote: > > > Eddi, At the least, I think the documentation needs to be clearer on the use of "with=FALSE". It does feel inconsistent with the fact that "j" with a single column should return a vector. In data.frames, the type in "j" being column names, if it's just one column name, would return a vector, unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, c("x", "y")] will return a data.frame. So, it is inconsistent with data.frame here, I think. > > > > > > > > > Arun > > > > > > > > > On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote: > > > > > > > DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)]. > > > > > > > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). > > > > > > > > Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)]. > > > > > > > > > > > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan wrote: > > > > > Hi everybody, > > > > > > > > > > It'd be nice if you could weigh-in on the bug report filed by Bill here: > > > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 > > > > > > > > > > The gist of it is: > > > > > > > > > > require(data.table) > > > > > DT <- data.table(x=1:5, y=6:10, z=11:15) > > > > > DT[, y] # returns a vector > > > > > DT[, "y", with=FALSE] # returns a data.table > > > > > > > > > > The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?" > > > > > > > > > > My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? > > > > > > > > > > Thank you, > > > > > Arun > > > > > > > > > > > > > > > _______________________________________________ > > > > > datatable-help mailing list > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Thu Nov 14 17:59:09 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Thu, 14 Nov 2013 10:59:09 -0600 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: <1D2952C1F9244FF5A473FCAED0B03920@gmail.com> References: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> <862AB4459A55499EB1DA0AB24D04A890@gmail.com> <1D2952C1F9244FF5A473FCAED0B03920@gmail.com> Message-ID: Perhaps a simple sentence along the lines of "drop argument is absent and should be considered as FALSE when comparing with data.frame in with=FALSE mode" would suffice. The fact that i-expression is a full-on data.table i-expression in with=FALSE mode will probably also cause inconsistencies. On Thu, Nov 14, 2013 at 10:47 AM, Arunkumar Srinivasan < aragorn168b at gmail.com> wrote: > I'll try to make a list of places where data.table != data.frame > operation. > > Arun > > On Thursday, November 14, 2013 at 5:46 PM, Arunkumar Srinivasan wrote: > > Glad that we agree on better-ing the documentation. However, I don't > find it a sound argument that we deviate from data.frame because the design > is bad, *when we inherit from data.frame*. The choice is already made! Too > many such trivial inconsistencies piles up pretty quickly and could > potentially result in a steep learning curve - as there are different set > of rules to be memorised. > > Tackling the point of "inheriting from data.frame", *but* this, this, > this.. and many other things are different, if can't be avoided, should be > *very clearly* documented (in the beginning, maybe as a cheat sheet) so > that people aren't confused. > > > Arun > > On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote: > > I agree that it's inconsistent with data.frame, and imo that's a good > thing. We don't replicate the drop argument, so it wouldn't be possible to > return a data.table when with=FALSE and either way drop=TRUE by default is > a bad design choice in data.frame and matrix (that is unlikely to change > given R-core's attitude towards that type of a thing). > > I'm always pro more and better documentation :) > > > On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Eddi, At the least, I think the documentation needs to be clearer on the > use of "with=FALSE". It does feel inconsistent with the fact that "j" with > a single column should return a vector. In data.frames, the type in "j" > being column names, if it's just one column name, would return a vector, > unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, > c("x", "y")] will return a data.frame. So, it is inconsistent with > data.frame here, I think. > > > Arun > > On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote: > > DT[, y] returning a vector is I think the only correct behavior, given the > understanding of j-expression as something evaluated in the DT environment. > If they want a data.table they should simply use DT[, list(y)] or DT[, > data.table(y)]. > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much > never use that form, but I see an argument for it staying as is, because > "y" and c("y") are the same and since we all presumably agree that DT[, > c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with > = FALSE] returned a different type that would mean inconsistent return > types which makes life much harder for users (as evidenced by the periodic > drop=FALSE questions that come up on SO). > > Going back to DT[, y], note that y and list(y) actually produce > *different* results (in e.g. base_env), so there is no type consistency > issue there between DT[, y] and DT[, list(y, z)]. > > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > > Hi everybody, > > It'd be nice if you could weigh-in on the bug report filed by Bill here: > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 > > The gist of it is: > > require(data.table) > DT <- data.table(x=1:5, y=6:10, z=11:15) > DT[, y] # returns a vector > DT[, "y", with=FALSE] # returns a data.table > > The question from the bug report basically is: "why is that in the first > case, 'j' has only one column and we get a vector, but in the second case, > we get a data.table?" > > My question is: Is this behaviour okay or do you prefer that the first one > returns a data.table as well or the second one (with "with=FALSE") returns > a vector? > > Thank you, > Arun > > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From FErickson at psu.edu Thu Nov 14 19:45:52 2013 From: FErickson at psu.edu (Frank Erickson) Date: Thu, 14 Nov 2013 13:45:52 -0500 Subject: [datatable-help] Bug report #5100 reg. In-Reply-To: References: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com> <862AB4459A55499EB1DA0AB24D04A890@gmail.com> <1D2952C1F9244FF5A473FCAED0B03920@gmail.com> Message-ID: For what it's worth, I use the with=FALSE version frequently without knowing how many columns I have selected, so I like the implicit wrapping of the columns in a list() (or implicit drop=FALSE). An example (almost) from something I did yesterday: mycols <- grep("^Vbar",names(DT),value=TRUE) DT1 <- DT[,mycols,with=FALSE] -- Frank On Thu, Nov 14, 2013 at 11:59 AM, Eduard Antonyan wrote: > Perhaps a simple sentence along the lines of "drop argument is absent and > should be considered as FALSE when comparing with data.frame in with=FALSE > mode" would suffice. The fact that i-expression is a full-on data.table > i-expression in with=FALSE mode will probably also cause inconsistencies. > > > On Thu, Nov 14, 2013 at 10:47 AM, Arunkumar Srinivasan < > aragorn168b at gmail.com> wrote: > >> I'll try to make a list of places where data.table != data.frame >> operation. >> >> Arun >> >> On Thursday, November 14, 2013 at 5:46 PM, Arunkumar Srinivasan wrote: >> >> Glad that we agree on better-ing the documentation. However, I don't >> find it a sound argument that we deviate from data.frame because the design >> is bad, *when we inherit from data.frame*. The choice is already made! Too >> many such trivial inconsistencies piles up pretty quickly and could >> potentially result in a steep learning curve - as there are different set >> of rules to be memorised. >> >> Tackling the point of "inheriting from data.frame", *but* this, this, >> this.. and many other things are different, if can't be avoided, should be >> *very clearly* documented (in the beginning, maybe as a cheat sheet) so >> that people aren't confused. >> >> >> Arun >> >> On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote: >> >> I agree that it's inconsistent with data.frame, and imo that's a good >> thing. We don't replicate the drop argument, so it wouldn't be possible to >> return a data.table when with=FALSE and either way drop=TRUE by default is >> a bad design choice in data.frame and matrix (that is unlikely to change >> given R-core's attitude towards that type of a thing). >> >> I'm always pro more and better documentation :) >> >> >> On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan < >> aragorn168b at gmail.com> wrote: >> >> Eddi, At the least, I think the documentation needs to be clearer on the >> use of "with=FALSE". It does feel inconsistent with the fact that "j" with >> a single column should return a vector. In data.frames, the type in "j" >> being column names, if it's just one column name, would return a vector, >> unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, >> c("x", "y")] will return a data.frame. So, it is inconsistent with >> data.frame here, I think. >> >> >> Arun >> >> On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote: >> >> DT[, y] returning a vector is I think the only correct behavior, given >> the understanding of j-expression as something evaluated in the DT >> environment. If they want a data.table they should simply use DT[, list(y)] >> or DT[, data.table(y)]. >> >> I haven't thought about DT[, "y", with = FALSE] before as I pretty much >> never use that form, but I see an argument for it staying as is, because >> "y" and c("y") are the same and since we all presumably agree that DT[, >> c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with >> = FALSE] returned a different type that would mean inconsistent return >> types which makes life much harder for users (as evidenced by the periodic >> drop=FALSE questions that come up on SO). >> >> Going back to DT[, y], note that y and list(y) actually produce >> *different* results (in e.g. base_env), so there is no type consistency >> issue there between DT[, y] and DT[, list(y, z)]. >> >> >> On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan < >> aragorn168b at gmail.com> wrote: >> >> Hi everybody, >> >> It'd be nice if you could weigh-in on the bug report filed by Bill here: >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975 >> >> The gist of it is: >> >> require(data.table) >> DT <- data.table(x=1:5, y=6:10, z=11:15) >> DT[, y] # returns a vector >> DT[, "y", with=FALSE] # returns a data.table >> >> The question from the bug report basically is: "why is that in the first >> case, 'j' has only one column and we get a vector, but in the second case, >> we get a data.table?" >> >> My question is: Is this behaviour okay or do you prefer that the first >> one returns a data.table as well or the second one (with "with=FALSE") >> returns a vector? >> >> Thank you, >> Arun >> >> >> _______________________________________________ >> datatable-help mailing list >> datatable-help at lists.r-forge.r-project.org >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> >> >> >> > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bogaso.christofer at gmail.com Sat Nov 16 23:53:40 2013 From: bogaso.christofer at gmail.com (Christofer Bogaso) Date: Sun, 17 Nov 2013 04:38:40 +0545 Subject: [datatable-help] 'OR' operation with data.table Message-ID: Hello all, I am a new user of data.table and really started to liking it :) I am seeking some suggestion on how I can implement 'OR'/AND' operator to fetch a subset of a data.table. Below is my example data.table (my actual data.table is quite big): DT = data.table(x = 1:20, y1 = rep(letters[1:4], 5), y2 = rep(LETTERS[1:4], each = 5)) setkey(DT, y1, y2) > DT x y1 y2 1: 1 a A 2: 5 a A 3: 9 a B 4: 13 a C 5: 17 a D 6: 2 b A 7: 6 b B 8: 10 b B 9: 14 b C 10: 18 b D 11: 3 c A 12: 7 c B 13: 11 c C 14: 15 c C 15: 19 c D 16: 4 d A 17: 8 d B 18: 12 d C 19: 16 d D 20: 20 d D Now I want to fetch those rows for which "y1 = a OR b AND y2 = B OR D" with ordinary data.frame, this is straightforward to achieve, however I am wondering what could be the data.table way for fast computation. I would really appreciate for your help/pointer. Thanks and regards, -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Sat Nov 16 23:56:43 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Sat, 16 Nov 2013 23:56:43 +0100 Subject: [datatable-help] 'OR' operation with data.table In-Reply-To: References: Message-ID: <118DC425D96346DEAFCCEF77379D6687@gmail.com> How about this? DT[CJ(c("a", "b"), c("B", "D"))] Arun On Saturday, November 16, 2013 at 11:53 PM, Christofer Bogaso wrote: > Hello all, > > I am a new user of data.table and really started to liking it :) > > I am seeking some suggestion on how I can implement 'OR'/AND' operator to fetch a subset of a data.table. > > Below is my example data.table (my actual data.table is quite big): > > DT = data.table(x = 1:20, y1 = rep(letters[1:4], 5), y2 = rep(LETTERS[1:4], each = 5)) > setkey(DT, y1, y2) > > > > DT > x y1 y2 > 1: 1 a A > 2: 5 a A > 3: 9 a B > 4: 13 a C > 5: 17 a D > 6: 2 b A > 7: 6 b B > 8: 10 b B > 9: 14 b C > 10: 18 b D > 11: 3 c A > 12: 7 c B > 13: 11 c C > 14: 15 c C > 15: 19 c D > 16: 4 d A > 17: 8 d B > 18: 12 d C > 19: 16 d D > 20: 20 d D > > > > Now I want to fetch those rows for which "y1 = a OR b AND y2 = B OR D" > > with ordinary data.frame, this is straightforward to achieve, however I am wondering what could be the data.table way for fast computation. > > I would really appreciate for your help/pointer. > > Thanks and regards, > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bogaso.christofer at gmail.com Sun Nov 17 00:00:33 2013 From: bogaso.christofer at gmail.com (Christofer Bogaso) Date: Sun, 17 Nov 2013 04:45:33 +0545 Subject: [datatable-help] 'OR' operation with data.table In-Reply-To: <118DC425D96346DEAFCCEF77379D6687@gmail.com> References: <118DC425D96346DEAFCCEF77379D6687@gmail.com> Message-ID: Thanks a lot. This is working for me. Thanks and regards, On Sun, Nov 17, 2013 at 4:41 AM, Arunkumar Srinivasan wrote: > How about this? > DT[CJ(c("a", "b"), c("B", "D"))] > > Arun > > On Saturday, November 16, 2013 at 11:53 PM, Christofer Bogaso wrote: > > Hello all, > > I am a new user of data.table and really started to liking it :) > > I am seeking some suggestion on how I can implement 'OR'/AND' operator to > fetch a subset of a data.table. > > Below is my example data.table (my actual data.table is quite big): > > DT = data.table(x = 1:20, y1 = rep(letters[1:4], 5), y2 = > rep(LETTERS[1:4], each = 5)) > setkey(DT, y1, y2) > > > DT > x y1 y2 > 1: 1 a A > 2: 5 a A > 3: 9 a B > 4: 13 a C > 5: 17 a D > 6: 2 b A > 7: 6 b B > 8: 10 b B > 9: 14 b C > 10: 18 b D > 11: 3 c A > 12: 7 c B > 13: 11 c C > 14: 15 c C > 15: 19 c D > 16: 4 d A > 17: 8 d B > 18: 12 d C > 19: 16 d D > 20: 20 d D > > > Now I want to fetch those rows for which "y1 = a OR b AND y2 = B OR D" > > with ordinary data.frame, this is straightforward to achieve, however I am > wondering what could be the data.table way for fast computation. > > I would really appreciate for your help/pointer. > > Thanks and regards, > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Sun Nov 17 23:32:24 2013 From: gsee000 at gmail.com (G See) Date: Sun, 17 Nov 2013 16:32:24 -0600 Subject: [datatable-help] .SD is locked Message-ID: Hi, Is the following error expected? > library(data.table) data.table 1.8.11 For help type: help("data.table") > x <- as.data.table(BOD) > xx <- x[, .SD, .SDcols="Time"] > xx[, Time:=as.numeric(Time)] Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) : .SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference. > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.11 loaded via a namespace (and not attached): [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 Thanks, Garrett From michael.nelson at sydney.edu.au Mon Nov 18 00:11:30 2013 From: michael.nelson at sydney.edu.au (Michael Nelson) Date: Sun, 17 Nov 2013 23:11:30 +0000 Subject: [datatable-help] .SD is locked In-Reply-To: References: Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05> I don't believe this is to be expected. A bug report should be filed (it is present in 1.8.10 on CRAN as well) .SD is locked so you can't "mess" with it within a call to `[.data.table`, but this "locked" status should not be retained following the completion of that call ________________________________________ From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of G See [gsee000 at gmail.com] Sent: Monday, 18 November 2013 9:32 AM To: datatable-help at lists.r-forge.r-project.org Subject: [datatable-help] .SD is locked Hi, Is the following error expected? > library(data.table) data.table 1.8.11 For help type: help("data.table") > x <- as.data.table(BOD) > xx <- x[, .SD, .SDcols="Time"] > xx[, Time:=as.numeric(Time)] Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) : .SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference. > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.8.11 loaded via a namespace (and not attached): [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 Thanks, Garrett _______________________________________________ datatable-help mailing list datatable-help at lists.r-forge.r-project.org https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help From aragorn168b at gmail.com Mon Nov 18 00:29:14 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 18 Nov 2013 00:29:14 +0100 Subject: [datatable-help] .SD is locked In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05> References: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05> Message-ID: <5E6D77DD478449CCA64B6544F607A161@gmail.com> Hm, nice catch! In this special case, the value returned is from this code: jval = eval(jsub, SDenv, parent.frame()) Since `jsub = .SD`, this evaluates to .SD ('s value). However, since `jval` remains untouched, a copy is not made (I think). This can be seen with a `tracemem` statement: x <- as.data.table(BOD) xx <- x[, {print(tracemem(.SD)); .SD}, .SDcols="Time"] [1] "<0x7fa4e9a518f0>" tracemem(xx) [1] "<0x7fa4e9a518f0>" Basically `xx` is `.SD` and therefore is 'locked'. I guess a fix would be to check this and make a copy on return. Not sure. Arun On Monday, November 18, 2013 at 12:11 AM, Michael Nelson wrote: > I don't believe this is to be expected. > > A bug report should be filed (it is present in 1.8.10 on CRAN as well) > > .SD is locked so you can't "mess" with it within a call to `[.data.table`, but this "locked" status should not be retained following the completion of that call > > > ________________________________________ > From: datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org) [datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org)] on behalf of G See [gsee000 at gmail.com (mailto:gsee000 at gmail.com)] > Sent: Monday, 18 November 2013 9:32 AM > To: datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > Subject: [datatable-help] .SD is locked > > Hi, > > Is the following error expected? > > > library(data.table) > data.table 1.8.11 For help type: help("data.table") > > x <- as.data.table(BOD) > > xx <- x[, .SD, .SDcols="Time"] > > xx[, Time:=as.numeric(Time)] > > > > Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) : > .SD is locked. Using := in .SD's j is reserved for possible future > use; a tortuously flexible way to modify by group. Use := in j > directly to modify by group by reference. > > sessionInfo() > > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.11 > > loaded via a namespace (and not attached): > [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 > > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aragorn168b at gmail.com Mon Nov 18 00:43:19 2013 From: aragorn168b at gmail.com (Arunkumar Srinivasan) Date: Mon, 18 Nov 2013 00:43:19 +0100 Subject: [datatable-help] .SD is locked In-Reply-To: <5E6D77DD478449CCA64B6544F607A161@gmail.com> References: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05> <5E6D77DD478449CCA64B6544F607A161@gmail.com> Message-ID: <5EEF6BC312F64D0C9E121AA8DDEA1DAC@gmail.com> Gsee, just adding the line: if (identical(jval, SDenv$.SD)) jval = copy(jval) before `return(jval)` seems to fix this (and all tests also complete without any issues). If you're in a hurry for fix, you could just add it for now. I'll test it again later and commit with other changes I've staged locally. It'd still be nice to file this as a bug so that it could be tracked. Best, Arun On Monday, November 18, 2013 at 12:29 AM, Arunkumar Srinivasan wrote: > Hm, nice catch! In this special case, the value returned is from this code: > > jval = eval(jsub, SDenv, parent.frame()) > > Since `jsub = .SD`, this evaluates to .SD ('s value). However, since `jval` remains untouched, a copy is not made (I think). This can be seen with a `tracemem` statement: > > x <- as.data.table(BOD) > xx <- x[, {print(tracemem(.SD)); .SD}, .SDcols="Time"] > [1] "<0x7fa4e9a518f0>" > tracemem(xx) > [1] "<0x7fa4e9a518f0>" > > > Basically `xx` is `.SD` and therefore is 'locked'. I guess a fix would be to check this and make a copy on return. Not sure. > > Arun > > > On Monday, November 18, 2013 at 12:11 AM, Michael Nelson wrote: > > > I don't believe this is to be expected. > > > > A bug report should be filed (it is present in 1.8.10 on CRAN as well) > > > > .SD is locked so you can't "mess" with it within a call to `[.data.table`, but this "locked" status should not be retained following the completion of that call > > > > > > ________________________________________ > > From: datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org) [datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org)] on behalf of G See [gsee000 at gmail.com (mailto:gsee000 at gmail.com)] > > Sent: Monday, 18 November 2013 9:32 AM > > To: datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > Subject: [datatable-help] .SD is locked > > > > Hi, > > > > Is the following error expected? > > > > > library(data.table) > > data.table 1.8.11 For help type: help("data.table") > > > x <- as.data.table(BOD) > > > xx <- x[, .SD, .SDcols="Time"] > > > xx[, Time:=as.numeric(Time)] > > > > > > > Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) : > > .SD is locked. Using := in .SD's j is reserved for possible future > > use; a tortuously flexible way to modify by group. Use := in j > > directly to modify by group by reference. > > > sessionInfo() > > > > R version 3.0.2 (2013-09-25) > > Platform: x86_64-pc-linux-gnu (64-bit) > > > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > > > attached base packages: > > [1] stats graphics grDevices utils datasets methods base > > > > other attached packages: > > [1] data.table_1.8.11 > > > > loaded via a namespace (and not attached): > > [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 > > > > > > Thanks, > > Garrett > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > _______________________________________________ > > datatable-help mailing list > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org) > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gsee000 at gmail.com Mon Nov 18 00:48:50 2013 From: gsee000 at gmail.com (G See) Date: Sun, 17 Nov 2013 17:48:50 -0600 Subject: [datatable-help] .SD is locked In-Reply-To: <5EEF6BC312F64D0C9E121AA8DDEA1DAC@gmail.com> References: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05> <5E6D77DD478449CCA64B6544F607A161@gmail.com> <5EEF6BC312F64D0C9E121AA8DDEA1DAC@gmail.com> Message-ID: Thanks guys. Bug report filed. On Sun, Nov 17, 2013 at 5:43 PM, Arunkumar Srinivasan wrote: > Gsee, just adding the line: > > if (identical(jval, SDenv$.SD)) jval = copy(jval) > > before `return(jval)` seems to fix this (and all tests also complete without > any issues). If you're in a hurry for fix, you could just add it for now. > > I'll test it again later and commit with other changes I've staged locally. > It'd still be nice to file this as a bug so that it could be tracked. > > Best, > Arun > > On Monday, November 18, 2013 at 12:29 AM, Arunkumar Srinivasan wrote: > > Hm, nice catch! In this special case, the value returned is from this code: > > jval = eval(jsub, SDenv, parent.frame()) > > Since `jsub = .SD`, this evaluates to .SD ('s value). However, since `jval` > remains untouched, a copy is not made (I think). This can be seen with a > `tracemem` statement: > > x <- as.data.table(BOD) > xx <- x[, {print(tracemem(.SD)); .SD}, .SDcols="Time"] > [1] "<0x7fa4e9a518f0>" > tracemem(xx) > [1] "<0x7fa4e9a518f0>" > > Basically `xx` is `.SD` and therefore is 'locked'. I guess a fix would be to > check this and make a copy on return. Not sure. > > Arun > > On Monday, November 18, 2013 at 12:11 AM, Michael Nelson wrote: > > I don't believe this is to be expected. > > A bug report should be filed (it is present in 1.8.10 on CRAN as well) > > .SD is locked so you can't "mess" with it within a call to `[.data.table`, > but this "locked" status should not be retained following the completion of > that call > > > ________________________________________ > From: datatable-help-bounces at lists.r-forge.r-project.org > [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of G See > [gsee000 at gmail.com] > Sent: Monday, 18 November 2013 9:32 AM > To: datatable-help at lists.r-forge.r-project.org > Subject: [datatable-help] .SD is locked > > Hi, > > Is the following error expected? > > library(data.table) > > data.table 1.8.11 For help type: help("data.table") > > x <- as.data.table(BOD) > xx <- x[, .SD, .SDcols="Time"] > xx[, Time:=as.numeric(Time)] > > Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) : > .SD is locked. Using := in .SD's j is reserved for possible future > use; a tortuously flexible way to modify by group. Use := in j > directly to modify by group by reference. > > sessionInfo() > > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] data.table_1.8.11 > > loaded via a namespace (and not attached): > [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2 > > > Thanks, > Garrett > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > From danielrlabar at gmail.com Tue Nov 19 17:39:08 2013 From: danielrlabar at gmail.com (dnlbrky) Date: Tue, 19 Nov 2013 08:39:08 -0800 (PST) Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning In-Reply-To: References: Message-ID: <1384879148805-4680743.post@n4.nabble.com> Arunkumar Srinivasan wrote > `rbindlist` gained speed (to some extent) by assuming things like this and > skipping checks in the first place. So, should we include checks like > this? Also, if "rbind" and/or "rbindlist" are made to do the exact same > thing, then, what's the purpose of "rbindlist"? My vote for the purpose of rbindlist is to continue to be a fast version of rbind for data.tables, while providing as much functionality as possible. Could the functionality be optional? In "bare bones mode" it would be super fast, and in "full featured mode" it would probably be faster than rbind but slower than "bare bones". Like Garrett, I would like to have the option of binding by column names in rbindlist. In addition, it would be great if rbindlist could handle missing columns. The smartbind function in gtools does both of these. -- View this message in context: http://r.789695.n4.nabble.com/rbind-vs-rbindlist-behavior-warning-tp4680116p4680743.html Sent from the datatable-help mailing list archive at Nabble.com. From saporta at scarletmail.rutgers.edu Tue Nov 19 23:29:33 2013 From: saporta at scarletmail.rutgers.edu (Ricardo Saporta) Date: Tue, 19 Nov 2013 17:29:33 -0500 Subject: [datatable-help] Help with code efficieny Message-ID: Hey guys, I am working with some code that is taking several hours to run. I posted a question on SO about it, if anyone has some thoughts I am open to suggestions http://stackoverflow.com/questions/20083432/increase-efficiency-in-finding-first-occurrence-of-events Thanks Rick -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.krizian at gmail.com Fri Nov 22 15:15:37 2013 From: daniel.krizian at gmail.com (daniel.krizian) Date: Fri, 22 Nov 2013 06:15:37 -0800 (PST) Subject: [datatable-help] Key dropped when DT[, list(a, b)] Message-ID: <1385129737870-4680965.post@n4.nabble.com> Hello, I have: DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b")) key(DT) # [1] "a" "b" key(DT[,list(a,b)]) # NULL Note that DT loses its key when I select a subset of columns like above. Is this a (known) bug/ expected result? Maybe it is just me, but I would expect the data.table to retain its key in the SELECT-like operation, otherwise it causes me to repeatedly call (expensive) setkey(), when in fact I am not changing the structure of rows/indices significantly. Thanks, Daniel -- View this message in context: http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965.html Sent from the datatable-help mailing list archive at Nabble.com. From lianoglou.steve at gene.com Fri Nov 22 15:45:20 2013 From: lianoglou.steve at gene.com (Steve Lianoglou) Date: Fri, 22 Nov 2013 06:45:20 -0800 Subject: [datatable-help] Key dropped when DT[, list(a, b)] In-Reply-To: <1385129737870-4680965.post@n4.nabble.com> References: <1385129737870-4680965.post@n4.nabble.com> Message-ID: Hi Daniel, On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian wrote: > Hello, I have: > > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b")) > key(DT) # [1] "a" "b" > key(DT[,list(a,b)]) # NULL > > Note that DT loses its key when I select a subset of columns like above. > > Is this a (known) bug/ expected result? The key is retained for me when I run your code: R> key(DT[,list(a,b)]) [1] "a" "b" What version of data.table are you using? -steve -- Steve Lianoglou Computational Biologist Genentech From daniel.krizian at gmail.com Fri Nov 22 17:32:26 2013 From: daniel.krizian at gmail.com (Daniel Krizian) Date: Fri, 22 Nov 2013 16:32:26 +0000 Subject: [datatable-help] Key dropped when DT[, list(a, b)] Message-ID: Hello Steve and thanks for your reply. I am running data.table_1.8.11 Full details below: > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] timeDate_3010.98 data.table_1.8.11 quantstrat_0.7.8 [4] foreach_1.4.1 blotter_0.8.15 PerformanceAnalytics_1.1.1 [7] FinancialInstrument_1.1 quantmod_0.4-0 Defaults_1.1-1 [10] TTR_0.22-0 xts_0.9-5 zoo_1.7-10 loaded via a namespace (and not attached): [1] codetools_0.2-8 grid_3.0.2 iterators_1.0.6 lattice_0.20-23 plyr_1.8 [6] reshape2_1.2.2 stringr_0.6.2 tools_3.0.2 On Fri, Nov 22, 2013 at 2:45 PM, Steve Lianoglou wrote: > Hi Daniel, > > On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian > wrote: > > Hello, I have: > > > > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b")) > > key(DT) # [1] "a" "b" > > key(DT[,list(a,b)]) # NULL > > > > Note that DT loses its key when I select a subset of columns like above. > > > > Is this a (known) bug/ expected result? > > The key is retained for me when I run your code: > > R> key(DT[,list(a,b)]) > [1] "a" "b" > > What version of data.table are you using? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > -- *____________________________* *Daniel Krizian, CFA, CAIA* T: +44 74 5372 1101 M: daniel.krizian at gmail.com uk.linkedin.com/in/danielkrizian B: quantology.wordpress.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.krizian at gmail.com Fri Nov 22 17:33:26 2013 From: daniel.krizian at gmail.com (daniel.krizian) Date: Fri, 22 Nov 2013 08:33:26 -0800 (PST) Subject: [datatable-help] Key dropped when DT[, list(a, b)] In-Reply-To: References: <1385129737870-4680965.post@n4.nabble.com> Message-ID: Hello Steve and thanks for your reply. I am running data.table_1.8.11 Full details below: > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] timeDate_3010.98 data.table_1.8.11 quantstrat_0.7.8 [4] foreach_1.4.1 blotter_0.8.15 PerformanceAnalytics_1.1.1 [7] FinancialInstrument_1.1 quantmod_0.4-0 Defaults_1.1-1 [10] TTR_0.22-0 xts_0.9-5 zoo_1.7-10 loaded via a namespace (and not attached): [1] codetools_0.2-8 grid_3.0.2 iterators_1.0.6 lattice_0.20-23 plyr_1.8 [6] reshape2_1.2.2 stringr_0.6.2 tools_3.0.2 On Fri, Nov 22, 2013 at 2:50 PM, Steve Lianoglou-2 [via R] < ml-node+s789695n4680967h38 at n4.nabble.com> wrote: > Hi Daniel, > > On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian > <[hidden email] > > wrote: > > Hello, I have: > > > > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b")) > > key(DT) # [1] "a" "b" > > key(DT[,list(a,b)]) # NULL > > > > Note that DT loses its key when I select a subset of columns like above. > > > > Is this a (known) bug/ expected result? > > The key is retained for me when I run your code: > > R> key(DT[,list(a,b)]) > [1] "a" "b" > > What version of data.table are you using? > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > _______________________________________________ > datatable-help mailing list > [hidden email] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > > http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680967.html > To unsubscribe from Key dropped when DT[, list(a, b)], click here > . > NAML > -- *____________________________* *Daniel Krizian, CFA, CAIA* T: +44 74 5372 1101 M: daniel.krizian at gmail.com uk.linkedin.com/in/danielkrizian B: quantology.wordpress.com -- View this message in context: http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680970.html Sent from the datatable-help mailing list archive at Nabble.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eduard.antonyan at gmail.com Fri Nov 22 18:01:27 2013 From: eduard.antonyan at gmail.com (Eduard Antonyan) Date: Fri, 22 Nov 2013 11:01:27 -0600 Subject: [datatable-help] Key dropped when DT[, list(a, b)] In-Reply-To: References: <1385129737870-4680965.post@n4.nabble.com> Message-ID: This was fixed relatively recently in revision 999, so try updating your build. On Fri, Nov 22, 2013 at 10:33 AM, daniel.krizian wrote: > Hello Steve and thanks for your reply. I am running data.table_1.8.11 > > Full details below: > > > sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] timeDate_3010.98 data.table_1.8.11 > quantstrat_0.7.8 > [4] foreach_1.4.1 blotter_0.8.15 > PerformanceAnalytics_1.1.1 > [7] FinancialInstrument_1.1 quantmod_0.4-0 Defaults_1.1-1 > > [10] TTR_0.22-0 xts_0.9-5 zoo_1.7-10 > > > loaded via a namespace (and not attached): > [1] codetools_0.2-8 grid_3.0.2 iterators_1.0.6 lattice_0.20-23 > plyr_1.8 > [6] reshape2_1.2.2 stringr_0.6.2 tools_3.0.2 > > > On Fri, Nov 22, 2013 at 2:50 PM, Steve Lianoglou-2 [via R] <[hidden email] > > wrote: > >> Hi Daniel, >> >> On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian >> <[hidden email] > >> wrote: >> > Hello, I have: >> > >> > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b")) >> > key(DT) # [1] "a" "b" >> > key(DT[,list(a,b)]) # NULL >> > >> > Note that DT loses its key when I select a subset of columns like >> above. >> > >> > Is this a (known) bug/ expected result? >> >> The key is retained for me when I run your code: >> >> R> key(DT[,list(a,b)]) >> [1] "a" "b" >> >> What version of data.table are you using? >> >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> _______________________________________________ >> datatable-help mailing list >> [hidden email] >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> ------------------------------ >> If you reply to this email, your message will be added to the >> discussion below: >> >> http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680967.html >> To unsubscribe from Key dropped when DT[, list(a, b)], click here. >> NAML >> > > > > -- > *____________________________* > *Daniel Krizian, CFA, CAIA* > T: +44 74 5372 1101 > M: [hidden email] > uk.linkedin.com/in/danielkrizian > B: quantology.wordpress.com > > ------------------------------ > View this message in context: Re: Key dropped when DT[, list(a, b)] > > Sent from the datatable-help mailing list archiveat Nabble.com. > > _______________________________________________ > datatable-help mailing list > datatable-help at lists.r-forge.r-project.org > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > -------------- next part -------------- An HTML attachment was scrubbed... URL: