From alexandre.sieira at gmail.com  Fri Nov  1 22:49:07 2013
From: alexandre.sieira at gmail.com (Alexandre Sieira)
Date: Fri, 1 Nov 2013 19:49:07 -0200
Subject: [datatable-help] Unexpected behavior in setnames()
Message-ID: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>

I found this behavior during a debugging session:?

> d = data.table(a=1, b=2, c=3)
> setnames(d, "a", "b")
> d
? ?b b c
1: 1 2 3

Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131101/329d1ad5/attachment.html>

From eduard.antonyan at gmail.com  Fri Nov  1 22:59:57 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 1 Nov 2013 16:59:57 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
Message-ID: <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>

Having duplicate names is allowed and not that unusual in data.table
framework, so there is no need to signal anything here.

A different question is whether there should be a warning here:

  dt = data.table(a = 1, a = 2)
  dt[, a]

and I think that'd be a pretty good FR to have.


On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com
> wrote:

> I found this behavior during a debugging session:
>
> > d = data.table(a=1, b=2, c=3)
> > setnames(d, "a", "b")
> > d
>    b b c
> 1: 1 2 3
>
> Shouldn?t setnames() check if the new column names already exist before
> renaming, and signal an error or at least a warning if they do?
>
> --
> Alexandre Sieira
> CISA, CISSP, ISO 27001 Lead Auditor
>
> "The truth is rarely pure and never simple."
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131101/7f419cd2/attachment.html>

From aragorn168b at gmail.com  Fri Nov  1 23:51:18 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 1 Nov 2013 23:51:18 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
Message-ID: <957B1243714142278898647650EBF386@gmail.com>

Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
But I don't think having duplicate names is an easy-to-implement concept. For ex:

dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
dt[, print(.SD), by=y]
   x
1: 1
2: 2
   x
1: 3


.SD loses the second "x". Also, some other questions become difficult to handle. Ex:  

dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
dt[, list(x=x/x[1], y=y), by=x]


Which "x" should be choose for which operation?

Arun


On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:

> Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.
>  
> A different question is whether there should be a warning here:  
>  
>   dt = data.table(a = 1, a = 2)
>   dt[, a]
>  
> and I think that'd be a pretty good FR to have.
>  
>  
> On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com (mailto:alexandre.sieira at gmail.com)> wrote:
> > I found this behavior during a debugging session:  
> >  
> > > d = data.table(a=1, b=2, c=3)
> > > setnames(d, "a", "b")
> > > d
> >    b b c
> > 1: 1 2 3
> >  
> > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?
> > --  
> > Alexandre Sieira
> > CISA, CISSP, ISO 27001 Lead Auditor
> >  
> > "The truth is rarely pure and never simple."
> > Oscar Wilde, The Importance of Being Earnest, 1895, Act I
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>  
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131101/d06fa228/attachment.html>

From eduard.antonyan at gmail.com  Fri Nov  1 23:57:51 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 1 Nov 2013 17:57:51 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <957B1243714142278898647650EBF386@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
Message-ID: <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>

I think currently it chooses the first "x", but it's definitely a good idea
to add a warning there.


On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

>  Ricardo added a bug report here on this topic:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
> But I don't think having duplicate names is an easy-to-implement concept.
> For ex:
>
> dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
> dt[, print(.SD), by=y]
>    x
> 1: 1
> 2: 2
>    x
> 1: 3
>
> .SD loses the second "x". Also, some other questions become difficult to
> handle. Ex:
>
> dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
> dt[, list(x=x/x[1], y=y), by=x]
>
> Which "x" should be choose for which operation?
>
> Arun
>
> On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
>
> Having duplicate names is allowed and not that unusual in data.table
> framework, so there is no need to signal anything here.
>
> A different question is whether there should be a warning here:
>
>   dt = data.table(a = 1, a = 2)
>   dt[, a]
>
> and I think that'd be a pretty good FR to have.
>
>
> On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <
> alexandre.sieira at gmail.com> wrote:
>
> I found this behavior during a debugging session:
>
> > d = data.table(a=1, b=2, c=3)
> > setnames(d, "a", "b")
> > d
>    b b c
> 1: 1 2 3
>
> Shouldn?t setnames() check if the new column names already exist before
> renaming, and signal an error or at least a warning if they do?
>
> --
> Alexandre Sieira
> CISA, CISSP, ISO 27001 Lead Auditor
>
> "The truth is rarely pure and never simple."
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131101/92c828b1/attachment-0001.html>

From aragorn168b at gmail.com  Sat Nov  2 00:02:38 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 2 Nov 2013 00:02:38 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
Message-ID: <5E98018F047943DE89849EC57A7CF72A@gmail.com>

Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?  

Arun


On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:

> I think currently it chooses the first "x", but it's definitely a good idea to add a warning there.
>  
>  
> On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975  
> > But I don't think having duplicate names is an easy-to-implement concept. For ex:
> >  
> > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
> > dt[, print(.SD), by=y]
> >    x
> > 1: 1
> > 2: 2
> >    x
> > 1: 3
> >  
> >  
> > .SD loses the second "x". Also, some other questions become difficult to handle. Ex:  
> >  
> > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> > dt[, list(x=x/x[1], y=y), by=x]
> >  
> >  
> > Which "x" should be choose for which operation?
> >  
> > Arun
> >  
> >  
> > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
> >  
> > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.
> > >  
> > > A different question is whether there should be a warning here:  
> > >  
> > >   dt = data.table(a = 1, a = 2)
> > >   dt[, a]
> > >  
> > > and I think that'd be a pretty good FR to have.
> > >  
> > >  
> > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com (mailto:alexandre.sieira at gmail.com)> wrote:
> > > > I found this behavior during a debugging session:  
> > > >  
> > > > > d = data.table(a=1, b=2, c=3)
> > > > > setnames(d, "a", "b")
> > > > > d
> > > >    b b c
> > > > 1: 1 2 3
> > > >  
> > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?
> > > > --  
> > > > Alexandre Sieira
> > > > CISA, CISSP, ISO 27001 Lead Auditor
> > > >  
> > > > "The truth is rarely pure and never simple."
> > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >  
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >  
> > >  
> > >  
> >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/9e1310b6/attachment.html>

From eduard.antonyan at gmail.com  Sat Nov  2 00:05:46 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 1 Nov 2013 18:05:46 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <5E98018F047943DE89849EC57A7CF72A@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
Message-ID: <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>

Because it's very useful for e.g. data presentation purposes.


On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

>  Yes, it chooses the first. But we won't be able to perform any operation
> as intended. So why allow duplicate names (ex: in `setnames` as Alexandre
> asks)?
>
> Arun
>
> On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:
>
> I think currently it chooses the first "x", but it's definitely a good
> idea to add a warning there.
>
>
> On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>  Ricardo added a bug report here on this topic:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
> But I don't think having duplicate names is an easy-to-implement concept.
> For ex:
>
> dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
> dt[, print(.SD), by=y]
>    x
> 1: 1
> 2: 2
>    x
> 1: 3
>
> .SD loses the second "x". Also, some other questions become difficult to
> handle. Ex:
>
> dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
> dt[, list(x=x/x[1], y=y), by=x]
>
> Which "x" should be choose for which operation?
>
> Arun
>
> On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
>
> Having duplicate names is allowed and not that unusual in data.table
> framework, so there is no need to signal anything here.
>
> A different question is whether there should be a warning here:
>
>   dt = data.table(a = 1, a = 2)
>   dt[, a]
>
> and I think that'd be a pretty good FR to have.
>
>
> On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <
> alexandre.sieira at gmail.com> wrote:
>
> I found this behavior during a debugging session:
>
> > d = data.table(a=1, b=2, c=3)
> > setnames(d, "a", "b")
> > d
>    b b c
> 1: 1 2 3
>
> Shouldn?t setnames() check if the new column names already exist before
> renaming, and signal an error or at least a warning if they do?
>
> --
> Alexandre Sieira
> CISA, CISSP, ISO 27001 Lead Auditor
>
> "The truth is rarely pure and never simple."
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131101/664bdb65/attachment.html>

From aragorn168b at gmail.com  Sat Nov  2 00:10:41 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 2 Nov 2013 00:10:41 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
Message-ID: <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>

Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning).  

At the moment, I'm not convinced that it's worth that much trouble to help data presentation.  

Arun


On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:

> Because it's very useful for e.g. data presentation purposes.
>  
>  
> On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?  
> >  
> > Arun
> >  
> >  
> > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:
> >  
> > > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there.
> > >  
> > >  
> > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975  
> > > > But I don't think having duplicate names is an easy-to-implement concept. For ex:
> > > >  
> > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
> > > > dt[, print(.SD), by=y]
> > > >    x
> > > > 1: 1
> > > > 2: 2
> > > >    x
> > > > 1: 3
> > > >  
> > > >  
> > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex:  
> > > >  
> > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> > > > dt[, list(x=x/x[1], y=y), by=x]
> > > >  
> > > >  
> > > > Which "x" should be choose for which operation?
> > > >  
> > > > Arun
> > > >  
> > > >  
> > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
> > > >  
> > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.
> > > > >  
> > > > > A different question is whether there should be a warning here:  
> > > > >  
> > > > >   dt = data.table(a = 1, a = 2)
> > > > >   dt[, a]
> > > > >  
> > > > > and I think that'd be a pretty good FR to have.
> > > > >  
> > > > >  
> > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com (mailto:alexandre.sieira at gmail.com)> wrote:
> > > > > > I found this behavior during a debugging session:  
> > > > > >  
> > > > > > > d = data.table(a=1, b=2, c=3)
> > > > > > > setnames(d, "a", "b")
> > > > > > > d
> > > > > >    b b c
> > > > > > 1: 1 2 3
> > > > > >  
> > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?
> > > > > > --  
> > > > > > Alexandre Sieira
> > > > > > CISA, CISSP, ISO 27001 Lead Auditor
> > > > > >  
> > > > > > "The truth is rarely pure and never simple."
> > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I
> > > > > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > >  
> > > > > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/d785e3f6/attachment-0001.html>

From alexandre.sieira at gmail.com  Sat Nov  2 13:00:17 2013
From: alexandre.sieira at gmail.com (Alexandre Sieira)
Date: Sat, 2 Nov 2013 10:00:17 -0200
Subject: [datatable-help] datatable-help Digest, Vol 45, Issue 2
In-Reply-To: <mailman.1019.1383347451.27466.datatable-help@lists.r-forge.r-project.org>
References: <mailman.1019.1383347451.27466.datatable-help@lists.r-forge.r-project.org>
Message-ID: <etPan.5274e951.3804823e.16a@MacBook-Pro-de-Alexandre-Sieira.local>

My 2 cents here.

There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea:
?
- It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table.

- For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :)

Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users.

Having said that, this is a design decision and it's up to the data.table demigods to decide. :)

BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read.

In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in ?my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster.

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

On 1 de novembro de 2013 at 21:10:54, datatable-help-request at lists.r-forge.r-project.org (datatable-help-request at lists.r-forge.r-project.org) wrote:

Send datatable-help mailing list submissions to  
datatable-help at lists.r-forge.r-project.org  

To subscribe or unsubscribe via the World Wide Web, visit  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  

or, via email, send a message with subject or body 'help' to  
datatable-help-request at lists.r-forge.r-project.org  

You can reach the person managing the list at  
datatable-help-owner at lists.r-forge.r-project.org  

When replying, please edit your Subject line so it is more specific  
than "Re: Contents of datatable-help digest..."  


Today's Topics:  

1. Re: Unexpected behavior in setnames() (Arunkumar Srinivasan)  
2. Re: Unexpected behavior in setnames() (Eduard Antonyan)  
3. Re: Unexpected behavior in setnames() (Arunkumar Srinivasan)  


----------------------------------------------------------------------  

Message: 1  
Date: Sat, 2 Nov 2013 00:02:38 +0100  
From: Arunkumar Srinivasan <aragorn168b at gmail.com>  
To: Eduard Antonyan <eduard.antonyan at gmail.com>  
Cc: "=?utf-8?Q?datatable-help=40lists.r-forge.r-project.org?="  
<datatable-help at lists.r-forge.r-project.org>, Alexandre Sieira  
<alexandre.sieira at gmail.com>  
Subject: Re: [datatable-help] Unexpected behavior in setnames()  
Message-ID: <5E98018F047943DE89849EC57A7CF72A at gmail.com>  
Content-Type: text/plain; charset="utf-8"  

Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?  

Arun  


On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:  

> I think currently it chooses the first "x", but it's definitely a good idea to add a warning there.  
>  
>  
> On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:  
> > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975  
> > But I don't think having duplicate names is an easy-to-implement concept. For ex:  
> >  
> > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))  
> > dt[, print(.SD), by=y]  
> > x  
> > 1: 1  
> > 2: 2  
> > x  
> > 1: 3  
> >  
> >  
> > .SD loses the second "x". Also, some other questions become difficult to handle. Ex:  
> >  
> > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> > dt[, list(x=x/x[1], y=y), by=x]  
> >  
> >  
> > Which "x" should be choose for which operation?  
> >  
> > Arun  
> >  
> >  
> > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:  
> >  
> > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.  
> > >  
> > > A different question is whether there should be a warning here:  
> > >  
> > > dt = data.table(a = 1, a = 2)  
> > > dt[, a]  
> > >  
> > > and I think that'd be a pretty good FR to have.  
> > >  
> > >  
> > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com (mailto:alexandre.sieira at gmail.com)> wrote:  
> > > > I found this behavior during a debugging session:  
> > > >  
> > > > > d = data.table(a=1, b=2, c=3)  
> > > > > setnames(d, "a", "b")  
> > > > > d  
> > > > b b c  
> > > > 1: 1 2 3  
> > > >  
> > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?  
> > > > --  
> > > > Alexandre Sieira  
> > > > CISA, CISSP, ISO 27001 Lead Auditor  
> > > >  
> > > > "The truth is rarely pure and never simple."  
> > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
> > > > _______________________________________________  
> > > > datatable-help mailing list  
> > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)  
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
> > >  
> > > _______________________________________________  
> > > datatable-help mailing list  
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)  
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
> > >  
> > >  
> > >  
> >  
> >  
>  

-------------- next part --------------  
An HTML attachment was scrubbed...  
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/9e1310b6/attachment-0001.html>  

------------------------------  

Message: 2  
Date: Fri, 1 Nov 2013 18:05:46 -0500  
From: Eduard Antonyan <eduard.antonyan at gmail.com>  
To: Arunkumar Srinivasan <aragorn168b at gmail.com>  
Cc: "datatable-help at lists.r-forge.r-project.org"  
<datatable-help at lists.r-forge.r-project.org>, Alexandre Sieira  
<alexandre.sieira at gmail.com>  
Subject: Re: [datatable-help] Unexpected behavior in setnames()  
Message-ID:  
<CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g at mail.gmail.com>  
Content-Type: text/plain; charset="windows-1252"  

Because it's very useful for e.g. data presentation purposes.  


On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan  
<aragorn168b at gmail.com>wrote:  

> Yes, it chooses the first. But we won't be able to perform any operation  
> as intended. So why allow duplicate names (ex: in `setnames` as Alexandre  
> asks)?  
>  
> Arun  
>  
> On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:  
>  
> I think currently it chooses the first "x", but it's definitely a good  
> idea to add a warning there.  
>  
>  
> On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <  
> aragorn168b at gmail.com> wrote:  
>  
> Ricardo added a bug report here on this topic:  
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975  
> But I don't think having duplicate names is an easy-to-implement concept.  
> For ex:  
>  
> dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))  
> dt[, print(.SD), by=y]  
> x  
> 1: 1  
> 2: 2  
> x  
> 1: 3  
>  
> .SD loses the second "x". Also, some other questions become difficult to  
> handle. Ex:  
>  
> dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> dt[, list(x=x/x[1], y=y), by=x]  
>  
> Which "x" should be choose for which operation?  
>  
> Arun  
>  
> On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:  
>  
> Having duplicate names is allowed and not that unusual in data.table  
> framework, so there is no need to signal anything here.  
>  
> A different question is whether there should be a warning here:  
>  
> dt = data.table(a = 1, a = 2)  
> dt[, a]  
>  
> and I think that'd be a pretty good FR to have.  
>  
>  
> On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <  
> alexandre.sieira at gmail.com> wrote:  
>  
> I found this behavior during a debugging session:  
>  
> > d = data.table(a=1, b=2, c=3)  
> > setnames(d, "a", "b")  
> > d  
> b b c  
> 1: 1 2 3  
>  
> Shouldn?t setnames() check if the new column names already exist before  
> renaming, and signal an error or at least a warning if they do?  
>  
> --  
> Alexandre Sieira  
> CISA, CISSP, ISO 27001 Lead Auditor  
>  
> "The truth is rarely pure and never simple."  
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
>  
> _______________________________________________  
> datatable-help mailing list  
> datatable-help at lists.r-forge.r-project.org  
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>  
>  
> _______________________________________________  
> datatable-help mailing list  
> datatable-help at lists.r-forge.r-project.org  
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
>  
>  
>  
>  
>  
-------------- next part --------------  
An HTML attachment was scrubbed...  
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131101/664bdb65/attachment-0001.html>  

------------------------------  

Message: 3  
Date: Sat, 2 Nov 2013 00:10:41 +0100  
From: Arunkumar Srinivasan <aragorn168b at gmail.com>  
To: Eduard Antonyan <eduard.antonyan at gmail.com>  
Cc: "=?utf-8?Q?datatable-help=40lists.r-forge.r-project.org?="  
<datatable-help at lists.r-forge.r-project.org>, Alexandre Sieira  
<alexandre.sieira at gmail.com>  
Subject: Re: [datatable-help] Unexpected behavior in setnames()  
Message-ID: <D70F31E4E83842EF95F46C9565E7AEEA at gmail.com>  
Content-Type: text/plain; charset="utf-8"  

Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning).  

At the moment, I'm not convinced that it's worth that much trouble to help data presentation.  

Arun  


On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:  

> Because it's very useful for e.g. data presentation purposes.  
>  
>  
> On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:  
> > Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?  
> >  
> > Arun  
> >  
> >  
> > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:  
> >  
> > > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there.  
> > >  
> > >  
> > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:  
> > > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975  
> > > > But I don't think having duplicate names is an easy-to-implement concept. For ex:  
> > > >  
> > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))  
> > > > dt[, print(.SD), by=y]  
> > > > x  
> > > > 1: 1  
> > > > 2: 2  
> > > > x  
> > > > 1: 3  
> > > >  
> > > >  
> > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex:  
> > > >  
> > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> > > > dt[, list(x=x/x[1], y=y), by=x]  
> > > >  
> > > >  
> > > > Which "x" should be choose for which operation?  
> > > >  
> > > > Arun  
> > > >  
> > > >  
> > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:  
> > > >  
> > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.  
> > > > >  
> > > > > A different question is whether there should be a warning here:  
> > > > >  
> > > > > dt = data.table(a = 1, a = 2)  
> > > > > dt[, a]  
> > > > >  
> > > > > and I think that'd be a pretty good FR to have.  
> > > > >  
> > > > >  
> > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com (mailto:alexandre.sieira at gmail.com)> wrote:  
> > > > > > I found this behavior during a debugging session:  
> > > > > >  
> > > > > > > d = data.table(a=1, b=2, c=3)  
> > > > > > > setnames(d, "a", "b")  
> > > > > > > d  
> > > > > > b b c  
> > > > > > 1: 1 2 3  
> > > > > >  
> > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?  
> > > > > > --  
> > > > > > Alexandre Sieira  
> > > > > > CISA, CISSP, ISO 27001 Lead Auditor  
> > > > > >  
> > > > > > "The truth is rarely pure and never simple."  
> > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
> > > > > > _______________________________________________  
> > > > > > datatable-help mailing list  
> > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)  
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
> > > > >  
> > > > > _______________________________________________  
> > > > > datatable-help mailing list  
> > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)  
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  
> > > > >  
> > > > >  
> > > > >  
> > > >  
> > > >  
> > >  
> >  
>  

-------------- next part --------------  
An HTML attachment was scrubbed...  
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/d785e3f6/attachment.html>  

------------------------------  

_______________________________________________  
datatable-help mailing list  
datatable-help at lists.r-forge.r-project.org  
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help  

End of datatable-help Digest, Vol 45, Issue 2  
*********************************************  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/f5e3060b/attachment-0001.html>

From alexandre.sieira at gmail.com  Sat Nov  2 13:10:13 2013
From: alexandre.sieira at gmail.com (Alexandre Sieira)
Date: Sat, 2 Nov 2013 10:10:13 -0200
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
Message-ID: <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>

My 2 cents here.

There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea:
?
- It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table.

- For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :)

Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users.

Having said that, this is a design decision and it's up to the data.table demigods to decide. :)

BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read.

In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in ?my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster.

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan (aragorn168b at gmail.com) wrote:

Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning).?

At the moment, I'm not convinced that it's worth that much trouble to help data presentation.

Arun

On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:

Because it's very useful for e.g. data presentation purposes.


On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?

Arun

On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:

I think currently it chooses the first "x", but it's definitely a good idea to add a warning there.


On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:
Ricardo added a bug report here on this topic:?https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
But I don't think having duplicate names is an easy-to-implement concept. For ex:

dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
dt[, print(.SD), by=y]
? ?x
1: 1
2: 2
? ?x
1: 3

.SD loses the second "x". Also, some other questions become difficult to handle. Ex:?

dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
dt[, list(x=x/x[1], y=y), by=x]

Which "x" should be choose for which operation?

Arun

On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:

Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.

A different question is whether there should be a warning here:

? dt = data.table(a = 1, a = 2)
? dt[, a]

and I think that'd be a pretty good FR to have.


On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com> wrote:
I found this behavior during a debugging session:?

> d = data.table(a=1, b=2, c=3)
> setnames(d, "a", "b")
> d
? ?b b c
1: 1 2 3

Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?

--?
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/96e0ecd5/attachment.html>

From eduard.antonyan at gmail.com  Sat Nov  2 16:30:17 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 2 Nov 2013 10:30:17 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
Message-ID: <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>

Thanks Alexandre. I added (a non-committal) FR about this -
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978,
which will likely go in the direction this thread goes.

To address your points:

    1. If user decides to have column with duplicate names, yes, their job
will become harder, but that's a user decision and everyone else who
doesn't use duplicate names does not lose flexibility and doesn't need to
use column numbers or whatnot.
    2. I agree that this should be documented better and appropriate
warnings should be added.

One of the cool things about data.table that's very different from
data.frame is that you can have arbitrary column names. Whether they
include spaces, crazy symbols or are duplicate - it'll all be valid. This
is very useful for reading and writing/presenting arbitrary data.

This does mean though that if (and *only* if) you choose to use non
standard names you'll need to do more work.

Now the issue you ran into is that you didn't realize that you were using
non-standard naming (or even wanted to, but we can't guess what you want
:)). And a warning in the right place can help you out and also let
non-standard users proceed.

Once you understand that there is nothing wrong with duplicate names, it
should be clear that the appropriate warning spot is when you use them
potentially incorrectly, and not when you set them.

For reference there are a *lot* of different ways to get duplicate names,
to name a few besides setnames and creating one straight up - cbinding
similarly named data.tables, merging, having default named columns and
grouping (e.g. dt[, sum(smth), by = V1]), freading, etc.
 My 2 cents here.

There are several reasons why I don?t think, IMHO, allowing multiple
columns with the same name is a good idea:

- It will force the code to use column numbers to access all the data in a
predictable fashion (since depending on your code you might now know which
of the two columns with the same name will be the first), so we?ll lose all
the delicious syntactic sugar painstakingly added to data.table.

- For people learning data.table and having data.frame or even the concept
of a relational table as a reference, this is a definite WTF and will cause
confusion and complicate troubleshooting. I speak from experience on this
matter. :)

Even though there might be some situations where this might be a plus, I
imagine they are few and far between and could be worked around. I could be
wrong, it?s been know to happen :) - but I have never seen and can?t even
imagine a situation where multiple columns with the same name would be
essential. So in the balance I consider keeping this behavior as a bad
trade-off for most users.

Having said that, this is a design decision and it's up to the data.table
demigods to decide. :)

BTW, is there any part of the data.table documentation that covers this? If
you choose to maintain this property, I strongly suggest it be documented
somewhere that most beginners would read.

In my personal example, I ran into this problem after a rather long
troubleshooting of a very esoteric problem that was happening in  my code.
I was renaming a column to a name that already existed, and this broke
things in a completely different part of my code. If ?setnames()? had at
least warned me that a duplicate column name was created, I would have been
able to detect the source cause much faster.

-- 
Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan (
aragorn168b at gmail.com <//aragorn168b at gmail.com>) wrote:

Hm, I've not encountered that use myself, can't comment there. Probably
then it should be allowed everywhere except where deciding which column
could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc..
should result in error (if one has the time, one could do this by checking
if the duplicate column is in use actually or not and then issue an
error/warning).

At the moment, I'm not convinced that it's worth that much trouble to help
data presentation.

Arun

 On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:

 Because it's very useful for e.g. data presentation purposes.


On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

 Yes, it chooses the first. But we won't be able to perform any operation
as intended. So why allow duplicate names (ex: in `setnames` as Alexandre
asks)?

Arun

  On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:

  I think currently it chooses the first "x", but it's definitely a good
idea to add a warning there.


On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

 Ricardo added a bug report here on this topic:
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975
 But I don't think having duplicate names is an easy-to-implement concept.
For ex:

dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))
dt[, print(.SD), by=y]
    x
1: 1
2: 2
   x
1: 3

.SD loses the second "x". Also, some other questions become difficult to
handle. Ex:

 dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))
dt[, list(x=x/x[1], y=y), by=x]

Which "x" should be choose for which operation?

Arun

  On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:

  Having duplicate names is allowed and not that unusual in data.table
framework, so there is no need to signal anything here.

A different question is whether there should be a warning here:

  dt = data.table(a = 1, a = 2)
  dt[, a]

and I think that'd be a pretty good FR to have.


On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com
> wrote:

  I found this behavior during a debugging session:

 > d = data.table(a=1, b=2, c=3)
> setnames(d, "a", "b")
> d
   b b c
1: 1 2 3

Shouldn?t setnames() check if the new column names already exist before
renaming, and signal an error or at least a warning if they do?

 --
 Alexandre Sieira
CISA, CISSP, ISO 27001 Lead Auditor

"The truth is rarely pure and never simple."
Oscar Wilde, The Importance of Being Earnest, 1895, Act I

_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


 _______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/33986a2b/attachment-0001.html>

From aragorn168b at gmail.com  Sat Nov  2 16:41:53 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 2 Nov 2013 16:41:53 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
Message-ID: <94F078AB544B4757A58049C7DB7433AB@gmail.com>

> Now the issue you ran into is that you didn't realize that you were using non-standard naming (or even wanted to, but we can't guess what you want :)).  

I've to disagree here. This is one of the things `data.table` is *extremely* good at. The error/warning messages are precise and under most circumstances provides the solution (as to what you ought to do) by spotting the mistake exactly.

> Once you understand that there is nothing wrong with duplicate names, it should be clear that the appropriate warning spot is when you use them potentially incorrectly, and not when you set them.


This, I believe is also not entirely true, at least in this scenario. For example, an error happens when assigning a duplicate column using `:=`, for example.


DT <- data.table(x=1:5, y=6:10)
> DT[, c("y", "y") := 1L]
Error in `[.data.table`(DT, , `:=`(c("y", "y"), 1L)) :  
  Can't assign to the same column twice in the same query (duplicates detected).


So, it's only natural to expect a warning/error in other cases as well. In general, prevention is better - it's nicer to catch it earlier, spit a warning/error rather than letting it on to only catch later.

Overall, I agree keeping duplicate names may help some users. But then, the potential side-effects should be marked with warnings/errors distinctly, in all cases (and preferably documented). Ex: grouping/aggregating is once such scenario (Ricardo's bug report) where we can not possibly know which column to use..  

Arun


On Saturday, November 2, 2013 at 4:30 PM, Eduard Antonyan wrote:

> Thanks Alexandre. I added (a non-committal) FR about this - https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5037&group_id=240&atid=978, which will likely go in the direction this thread goes.
> To address your points:  
>     1. If user decides to have column with duplicate names, yes, their job will become harder, but that's a user decision and everyone else who doesn't use duplicate names does not lose flexibility and doesn't need to use column numbers or whatnot.  
>     2. I agree that this should be documented better and appropriate warnings should be added.  
> One of the cool things about data.table that's very different from data.frame is that you can have arbitrary column names. Whether they include spaces, crazy symbols or are duplicate - it'll all be valid. This is very useful for reading and writing/presenting arbitrary data.
> This does mean though that if (and *only* if) you choose to use non standard names you'll need to do more work.
> Now the issue you ran into is that you didn't realize that you were using non-standard naming (or even wanted to, but we can't guess what you want :)). And a warning in the right place can help you out and also let non-standard users proceed.  
> Once you understand that there is nothing wrong with duplicate names, it should be clear that the appropriate warning spot is when you use them potentially incorrectly, and not when you set them.
> For reference there are a *lot* of different ways to get duplicate names, to name a few besides setnames and creating one straight up - cbinding similarly named data.tables, merging, having default named columns and grouping (e.g. dt[, sum(smth), by = V1]), freading, etc.
> My 2 cents here.
>  
> There are several reasons why I don?t think, IMHO, allowing multiple columns with the same name is a good idea:  
>   
> - It will force the code to use column numbers to access all the data in a predictable fashion (since depending on your code you might now know which of the two columns with the same name will be the first), so we?ll lose all the delicious syntactic sugar painstakingly added to data.table.
>  
> - For people learning data.table and having data.frame or even the concept of a relational table as a reference, this is a definite WTF and will cause confusion and complicate troubleshooting. I speak from experience on this matter. :)  
>  
> Even though there might be some situations where this might be a plus, I imagine they are few and far between and could be worked around. I could be wrong, it?s been know to happen :) - but I have never seen and can?t even imagine a situation where multiple columns with the same name would be essential. So in the balance I consider keeping this behavior as a bad trade-off for most users.  
>  
> Having said that, this is a design decision and it's up to the data.table demigods to decide. :)
>  
> BTW, is there any part of the data.table documentation that covers this? If you choose to maintain this property, I strongly suggest it be documented somewhere that most beginners would read.
>  
> In my personal example, I ran into this problem after a rather long troubleshooting of a very esoteric problem that was happening in  my code. I was renaming a column to a name that already existed, and this broke things in a completely different part of my code. If ?setnames()? had at least warned me that a duplicate column name was created, I would have been able to detect the source cause much faster.  
>  
> --  
> Alexandre Sieira
> CISA, CISSP, ISO 27001 Lead Auditor
>  
> "The truth is rarely pure and never simple."
> Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
>  
> On 1 de novembro de 2013 at 21:10:45, Arunkumar Srinivasan (aragorn168b at gmail.com (mailto://aragorn168b at gmail.com)) wrote:
>  
> > Hm, I've not encountered that use myself, can't comment there. Probably then it should be allowed everywhere except where deciding which column could be an issue? Ex: subsetting/aggregating/grouping/by-without-by etc.. should result in error (if one has the time, one could do this by checking if the duplicate column is in use actually or not and then issue an error/warning).  
> >  
> > At the moment, I'm not convinced that it's worth that much trouble to help data presentation.  
> >  
> > Arun  
> >  
> >  
> > On Saturday, November 2, 2013 at 12:05 AM, Eduard Antonyan wrote:
> >  
> > > Because it's very useful for e.g. data presentation purposes.
> > >  
> > >  
> > > On Fri, Nov 1, 2013 at 6:02 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > Yes, it chooses the first. But we won't be able to perform any operation as intended. So why allow duplicate names (ex: in `setnames` as Alexandre asks)?  
> > > >  
> > > > Arun  
> > > >  
> > > >  
> > > > On Friday, November 1, 2013 at 11:57 PM, Eduard Antonyan wrote:
> > > >  
> > > > > I think currently it chooses the first "x", but it's definitely a good idea to add a warning there.
> > > > >  
> > > > >  
> > > > > On Fri, Nov 1, 2013 at 5:51 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > > > Ricardo added a bug report here on this topic: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5008&group_id=240&atid=975  
> > > > > > But I don't think having duplicate names is an easy-to-implement concept. For ex:
> > > > > >  
> > > > > > dt <- data.table(x=1:3, x=4:6, y=c(1,1,2))  
> > > > > > dt[, print(.SD), by=y]
> > > > > >    x
> > > > > > 1: 1
> > > > > > 2: 2
> > > > > >    x
> > > > > > 1: 3
> > > > > >  
> > > > > >  
> > > > > > .SD loses the second "x". Also, some other questions become difficult to handle. Ex:   
> > > > > >  
> > > > > > dt <- data.table(x=c(1,1,2,2), y=c(1,2,3,4), x=c(2,2,1,1))  
> > > > > > dt[, list(x=x/x[1], y=y), by=x]
> > > > > >  
> > > > > >  
> > > > > > Which "x" should be choose for which operation?  
> > > > > >  
> > > > > > Arun  
> > > > > >  
> > > > > >  
> > > > > > On Friday, November 1, 2013 at 10:59 PM, Eduard Antonyan wrote:
> > > > > >  
> > > > > > > Having duplicate names is allowed and not that unusual in data.table framework, so there is no need to signal anything here.  
> > > > > > >  
> > > > > > > A different question is whether there should be a warning here:  
> > > > > > >  
> > > > > > >   dt = data.table(a = 1, a = 2)  
> > > > > > >   dt[, a]
> > > > > > >  
> > > > > > > and I think that'd be a pretty good FR to have.  
> > > > > > >  
> > > > > > >  
> > > > > > > On Fri, Nov 1, 2013 at 4:49 PM, Alexandre Sieira <alexandre.sieira at gmail.com (mailto:alexandre.sieira at gmail.com)> wrote:
> > > > > > > > I found this behavior during a debugging session:   
> > > > > > > >  
> > > > > > > > > d = data.table(a=1, b=2, c=3)  
> > > > > > > > > setnames(d, "a", "b")
> > > > > > > > > d
> > > > > > > >    b b c
> > > > > > > > 1: 1 2 3
> > > > > > > >  
> > > > > > > > Shouldn?t setnames() check if the new column names already exist before renaming, and signal an error or at least a warning if they do?  
> > > > > > > > --   
> > > > > > > > Alexandre Sieira
> > > > > > > > CISA, CISSP, ISO 27001 Lead Auditor
> > > > > > > >  
> > > > > > > > "The truth is rarely pure and never simple."
> > > > > > > > Oscar Wilde, The Importance of Being Earnest, 1895, Act I  
> > > > > > > > _______________________________________________
> > > > > > > > datatable-help mailing list
> > > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > >  
> > > > > > > _______________________________________________  
> > > > > > > datatable-help mailing list
> > > > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > > >  
> > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/da747820/attachment.html>

From lianoglou.steve at gene.com  Sun Nov  3 01:10:07 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Sat, 2 Nov 2013 17:10:07 -0700
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <94F078AB544B4757A58049C7DB7433AB@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
Message-ID: <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>

Hi,

On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
[snip]
> Overall, I agree keeping duplicate names may help some users. But then, the
> potential side-effects should be marked with warnings/errors distinctly, in
> all cases (and preferably documented).
[/snip]

I guess I must have missed it, but has anyone anywhere (in this
thread, a FR or something) actually present a (concrete) compelling
situation where allowing duplicate column names was actually useful?

I'm hard pressed to come up with any situation where (purposefully)
keeping duplicate column names in a data.table has more benefit than
downside. Seems to me that if this ever happens, it most certainly
would be by mistake.

Can someone help me out here?

In the case of cbinding two data.tables together that end up having
two duplicate names, I'd imagine unique-ing the names of the
data.tables and firing a warning that this was done would be most
useful (uniqueness priority would be from left to right as the
data.tables are passed into the cbind call)

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From aragorn168b at gmail.com  Sun Nov  3 01:31:28 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 3 Nov 2013 01:31:28 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
Message-ID: <CC4697C1513245D0962946F00231B89F@gmail.com>

> I guess I must have missed it, but has anyone anywhere (in this
> thread, a FR or something) actually present a (concrete) compelling
> situation where allowing duplicate column names was actually useful?

True, Not quite compelling situations so far. The only example I've seen (in this thread) is reg. data presentation purpose (from eddi). I don't quite know exactly in what way, still. I can understand although, that the data by itself sometimes maybe available in such format. But one can always make unique names while loading.


> I'm hard pressed to come up with any situation where (purposefully)
> keeping duplicate column names in a data.table has more benefit than
> downside. Seems to me that if this ever happens, it most certainly
> would be by mistake.


I agree.


> In the case of cbinding two data.tables together that end up having
> two duplicate names, I'd imagine unique-ing the names of the
> data.tables and firing a warning that this was done would be most
> useful (uniqueness priority would be from left to right as the
> data.tables are passed into the cbind call)


Unless there's a nice argument why this (unique-ing the names) would be bad or in which case keeping duplicate names would be good, I agree with you on this point as well.


Arun


On Sunday, November 3, 2013 at 1:10 AM, Steve Lianoglou wrote:

> Hi,
> 
> On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> [snip]
> > Overall, I agree keeping duplicate names may help some users. But then, the
> > potential side-effects should be marked with warnings/errors distinctly, in
> > all cases (and preferably documented).
> > 
> 
> [/snip]
> 
> I guess I must have missed it, but has anyone anywhere (in this
> thread, a FR or something) actually present a (concrete) compelling
> situation where allowing duplicate column names was actually useful?
> 
> I'm hard pressed to come up with any situation where (purposefully)
> keeping duplicate column names in a data.table has more benefit than
> downside. Seems to me that if this ever happens, it most certainly
> would be by mistake.
> 
> Can someone help me out here?
> 
> In the case of cbinding two data.tables together that end up having
> two duplicate names, I'd imagine unique-ing the names of the
> data.tables and firing a warning that this was done would be most
> useful (uniqueness priority would be from left to right as the
> data.tables are passed into the cbind call)
> 
> -steve
> 
> -- 
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131103/e3ccb03a/attachment.html>

From eduard.antonyan at gmail.com  Sun Nov  3 01:31:52 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 2 Nov 2013 19:31:52 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
Message-ID: <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>

The main usage case I've personally encountered is data presentation (for
either self or others), where I would sometimes organize data like so:

category1 name,colname1,colname2,category2 name,colname1,colname2
....numbersandstuff....

Also, in general there are many cases I brought up above that generate
duplicate names, and I definitely don't want either lost columns or renamed
columns as a result - both are data loss that I don't appreciate.


On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:

> Hi,
>
> On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
> [snip]
> > Overall, I agree keeping duplicate names may help some users. But then,
> the
> > potential side-effects should be marked with warnings/errors distinctly,
> in
> > all cases (and preferably documented).
> [/snip]
>
> I guess I must have missed it, but has anyone anywhere (in this
> thread, a FR or something) actually present a (concrete) compelling
> situation where allowing duplicate column names was actually useful?
>
> I'm hard pressed to come up with any situation where (purposefully)
> keeping duplicate column names in a data.table has more benefit than
> downside. Seems to me that if this ever happens, it most certainly
> would be by mistake.
>
> Can someone help me out here?
>
> In the case of cbinding two data.tables together that end up having
> two duplicate names, I'd imagine unique-ing the names of the
> data.tables and firing a warning that this was done would be most
> useful (uniqueness priority would be from left to right as the
> data.tables are passed into the cbind call)
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/3227284e/attachment.html>

From aragorn168b at gmail.com  Sun Nov  3 01:36:35 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 3 Nov 2013 01:36:35 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
Message-ID: <F9719F33B38C4C75A97B476179984868@gmail.com>

Eddi, 
While loading the data in, maybe, if it is essential to keep names intact, we can probably add an argument, "asis=TRUE" or something like that. But I don't see a reason for doing anything else in `data.table` using duplicate names and trying to catch errors when nothing meaningful can be done with them. Besides data presentation, can you tell any other use with them?

Arun


On Sunday, November 3, 2013 at 1:31 AM, Eduard Antonyan wrote:

> The main usage case I've personally encountered is data presentation (for either self or others), where I would sometimes organize data like so:
> 
> category1 name,colname1,colname2,category2 name,colname1,colname2
> ....numbersandstuff....
> 
> Also, in general there are many cases I brought up above that generate duplicate names, and I definitely don't want either lost columns or renamed columns as a result - both are data loss that I don't appreciate.
> 
> 
> On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)> wrote:
> > Hi,
> > 
> > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan
> > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > [snip]
> > > Overall, I agree keeping duplicate names may help some users. But then, the
> > > potential side-effects should be marked with warnings/errors distinctly, in
> > > all cases (and preferably documented).
> > [/snip]
> > 
> > I guess I must have missed it, but has anyone anywhere (in this
> > thread, a FR or something) actually present a (concrete) compelling
> > situation where allowing duplicate column names was actually useful?
> > 
> > I'm hard pressed to come up with any situation where (purposefully)
> > keeping duplicate column names in a data.table has more benefit than
> > downside. Seems to me that if this ever happens, it most certainly
> > would be by mistake.
> > 
> > Can someone help me out here?
> > 
> > In the case of cbinding two data.tables together that end up having
> > two duplicate names, I'd imagine unique-ing the names of the
> > data.tables and firing a warning that this was done would be most
> > useful (uniqueness priority would be from left to right as the
> > data.tables are passed into the cbind call)
> > 
> > -steve
> > 
> > --
> > Steve Lianoglou
> > Computational Biologist
> > Bioinformatics and Computational Biology
> > Genentech
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131103/912486c8/attachment.html>

From eduard.antonyan at gmail.com  Sun Nov  3 01:43:56 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 2 Nov 2013 19:43:56 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <F9719F33B38C4C75A97B476179984868@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
Message-ID: <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>

Tbh I don't see why data presentation and preservation (i.e. if you're
reading in data with duplicated columns) is not enough of a use case -
that's the only reason we allow arbitrary symbols in column names.

So, instead of giving you another use case, how about you tell me instead
what do you propose should happen here (instead of what happens now):

> dt = data.table(1, 2)
> dt
   V1 V2
1:  1  2
> dt[, sum(V2), by = V1]
   V1 V1
1:  1  2


On Sat, Nov 2, 2013 at 7:36 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

>  Eddi,
> While loading the data in, maybe, if it is essential to keep names intact,
> we can probably add an argument, "asis=TRUE" or something like that. But I
> don't see a reason for doing anything else in `data.table` using duplicate
> names and trying to catch errors when nothing meaningful can be done with
> them. Besides data presentation, can you tell any other use with them?
>
> Arun
>
> On Sunday, November 3, 2013 at 1:31 AM, Eduard Antonyan wrote:
>
> The main usage case I've personally encountered is data presentation (for
> either self or others), where I would sometimes organize data like so:
>
> category1 name,colname1,colname2,category2 name,colname1,colname2
> ....numbersandstuff....
>
> Also, in general there are many cases I brought up above that generate
> duplicate names, and I definitely don't want either lost columns or renamed
> columns as a result - both are data loss that I don't appreciate.
>
>
> On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:
>
> Hi,
>
> On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
> [snip]
> > Overall, I agree keeping duplicate names may help some users. But then,
> the
> > potential side-effects should be marked with warnings/errors distinctly,
> in
> > all cases (and preferably documented).
> [/snip]
>
> I guess I must have missed it, but has anyone anywhere (in this
> thread, a FR or something) actually present a (concrete) compelling
> situation where allowing duplicate column names was actually useful?
>
> I'm hard pressed to come up with any situation where (purposefully)
> keeping duplicate column names in a data.table has more benefit than
> downside. Seems to me that if this ever happens, it most certainly
> would be by mistake.
>
> Can someone help me out here?
>
> In the case of cbinding two data.tables together that end up having
> two duplicate names, I'd imagine unique-ing the names of the
> data.tables and firing a warning that this was done would be most
> useful (uniqueness priority would be from left to right as the
> data.tables are passed into the cbind call)
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/1c89d722/attachment-0001.html>

From aragorn168b at gmail.com  Sun Nov  3 01:47:42 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 3 Nov 2013 01:47:42 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
Message-ID: <D01E0FA8318948DFABBE75BAEF61BC6E@gmail.com>

> > dt[, sum(V2), by = V1]
>    V1 V1
> 1:  1  2
Eddi, the simplest explanation is, since we generate auto-names, we should check if the V-series names exist and if so, generate the next one automatically. That is, in this case, my thought process is, "V1" is the grouping column and it's going to be retained. "V2" is in "J", but it has no name. So, we should be able to decide that "V1" is already taken and assign "V2" automatically. At least, that's what I think *should* happen. We can check for the names to "list(?)" argument in "j" to do this, I think, not sure though.


Arun


On Sunday, November 3, 2013 at 1:43 AM, Eduard Antonyan wrote:

> Tbh I don't see why data presentation and preservation (i.e. if you're reading in data with duplicated columns) is not enough of a use case - that's the only reason we allow arbitrary symbols in column names.
>  
> So, instead of giving you another use case, how about you tell me instead what do you propose should happen here (instead of what happens now):
>  
> > dt = data.table(1, 2)
> > dt
>    V1 V2
> 1:  1  2
> > dt[, sum(V2), by = V1]
>    V1 V1
> 1:  1  2
>  
>  
>  
>  
> On Sat, Nov 2, 2013 at 7:36 PM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Eddi,  
> > While loading the data in, maybe, if it is essential to keep names intact, we can probably add an argument, "asis=TRUE" or something like that. But I don't see a reason for doing anything else in `data.table` using duplicate names and trying to catch errors when nothing meaningful can be done with them. Besides data presentation, can you tell any other use with them?
> >  
> > Arun
> >  
> >  
> > On Sunday, November 3, 2013 at 1:31 AM, Eduard Antonyan wrote:
> >  
> > > The main usage case I've personally encountered is data presentation (for either self or others), where I would sometimes organize data like so:
> > >  
> > > category1 name,colname1,colname2,category2 name,colname1,colname2
> > > ....numbersandstuff....
> > >  
> > > Also, in general there are many cases I brought up above that generate duplicate names, and I definitely don't want either lost columns or renamed columns as a result - both are data loss that I don't appreciate.
> > >  
> > >  
> > > On Sat, Nov 2, 2013 at 7:10 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)> wrote:
> > > > Hi,
> > > >  
> > > > On Sat, Nov 2, 2013 at 8:41 AM, Arunkumar Srinivasan
> > > > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > [snip]
> > > > > Overall, I agree keeping duplicate names may help some users. But then, the
> > > > > potential side-effects should be marked with warnings/errors distinctly, in
> > > > > all cases (and preferably documented).
> > > > [/snip]
> > > >  
> > > > I guess I must have missed it, but has anyone anywhere (in this
> > > > thread, a FR or something) actually present a (concrete) compelling
> > > > situation where allowing duplicate column names was actually useful?
> > > >  
> > > > I'm hard pressed to come up with any situation where (purposefully)
> > > > keeping duplicate column names in a data.table has more benefit than
> > > > downside. Seems to me that if this ever happens, it most certainly
> > > > would be by mistake.
> > > >  
> > > > Can someone help me out here?
> > > >  
> > > > In the case of cbinding two data.tables together that end up having
> > > > two duplicate names, I'd imagine unique-ing the names of the
> > > > data.tables and firing a warning that this was done would be most
> > > > useful (uniqueness priority would be from left to right as the
> > > > data.tables are passed into the cbind call)
> > > >  
> > > > -steve
> > > >  
> > > > --
> > > > Steve Lianoglou
> > > > Computational Biologist
> > > > Bioinformatics and Computational Biology
> > > > Genentech
> > >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131103/f9ec12bd/attachment.html>

From lianoglou.steve at gene.com  Sun Nov  3 02:15:38 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Sat, 2 Nov 2013 18:15:38 -0700
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
Message-ID: <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>

On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
<eduard.antonyan at gmail.com> wrote:
> Tbh I don't see why data presentation and preservation (i.e. if you're
> reading in data with duplicated columns) is not enough of a use case -
> that's the only reason we allow arbitrary symbols in column names.
>
> So, instead of giving you another use case, how about you tell me instead
> what do you propose should happen here (instead of what happens now):
>
>> dt = data.table(1, 2)
>> dt
>    V1 V2
> 1:  1  2
>> dt[, sum(V2), by = V1]
>    V1 V1
> 1:  1  2

Only Matthew could say for sure, but if I were a gambling man I'd bet
that this was likely something that slipped through the cracks and
sleeping dogs were left to lie. I'd be curious to see what his
opinions on this are.

IMHO the "data presentation" argument doesn't really hold much water.

As for "data preservation," I rather see it as imposing structure on
it to enable efficient -- and sane/unambigous -- computation over it.
Further, I don't think is a preservation issue at all -- no data is
lost. The original data is still there in the file that was loaded
into R. The name of a column is changed when imported (with adequate
warning) into a data.table so that the user can slice and dice it. I'd
also guess the user being warned by the duplicate names would most
likely be happy to receive the warning, but the fact that you disagree
suggests that this isn't an obvious conclusion ;-)

I'm curious if you would argue for an SQL table to allow duplicate
column names for the same reasons? I do know you can torture SQL to
get two colnames to be the same by aliasing, but this also seems to
have slipped through as an accident:

http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf

(which I found from here):
http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table

Perhaps we should email this guy Hugh to see what he thinks about this one :-)

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From eduard.antonyan at gmail.com  Sun Nov  3 02:43:02 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 2 Nov 2013 20:43:02 -0500
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
Message-ID: <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>

@Arun: Ok. Thinking about it a bit - I don't like the continuing
enumeration solution because it makes the results too unpredictable, but
could live with adding a ".1" etc. Which I assume is the idea anyway for
resolving duplicates elsewhere.

@Steve: Not sure why you think it doesn't hold much water - I think I can
draw a parallel argument that replicates all of the duplicated names
concerns with a column that is called e.g. `dt$V1` (imagine forgetting the
backticks there and the world of hurt that potentially awaits once you do
that). I am also curious what Matthew would think about this. This is smth
I've encountered and dealt with a lot, so I'm certainly not an unbiased
party here.


On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:

> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > Tbh I don't see why data presentation and preservation (i.e. if you're
> > reading in data with duplicated columns) is not enough of a use case -
> > that's the only reason we allow arbitrary symbols in column names.
> >
> > So, instead of giving you another use case, how about you tell me instead
> > what do you propose should happen here (instead of what happens now):
> >
> >> dt = data.table(1, 2)
> >> dt
> >    V1 V2
> > 1:  1  2
> >> dt[, sum(V2), by = V1]
> >    V1 V1
> > 1:  1  2
>
> Only Matthew could say for sure, but if I were a gambling man I'd bet
> that this was likely something that slipped through the cracks and
> sleeping dogs were left to lie. I'd be curious to see what his
> opinions on this are.
>
> IMHO the "data presentation" argument doesn't really hold much water.
>
> As for "data preservation," I rather see it as imposing structure on
> it to enable efficient -- and sane/unambigous -- computation over it.
> Further, I don't think is a preservation issue at all -- no data is
> lost. The original data is still there in the file that was loaded
> into R. The name of a column is changed when imported (with adequate
> warning) into a data.table so that the user can slice and dice it. I'd
> also guess the user being warned by the duplicate names would most
> likely be happy to receive the warning, but the fact that you disagree
> suggests that this isn't an obvious conclusion ;-)
>
> I'm curious if you would argue for an SQL table to allow duplicate
> column names for the same reasons? I do know you can torture SQL to
> get two colnames to be the same by aliasing, but this also seems to
> have slipped through as an accident:
>
> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
>
> (which I found from here):
>
> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
>
> Perhaps we should email this guy Hugh to see what he thinks about this one
> :-)
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131102/2f94cf0e/attachment.html>

From chinmay.patil at gmail.com  Mon Nov  4 10:54:27 2013
From: chinmay.patil at gmail.com (Chinmay Patil)
Date: Mon, 4 Nov 2013 17:54:27 +0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
Message-ID: <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>

FWIW, data.frame does allow duplicate names as well. In the light that
data.table inherits from data.frame, I would expect that it follows same
convention as data.frame.


On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan
<eduard.antonyan at gmail.com>wrote:

> @Arun: Ok. Thinking about it a bit - I don't like the continuing
> enumeration solution because it makes the results too unpredictable, but
> could live with adding a ".1" etc. Which I assume is the idea anyway for
> resolving duplicates elsewhere.
>
> @Steve: Not sure why you think it doesn't hold much water - I think I can
> draw a parallel argument that replicates all of the duplicated names
> concerns with a column that is called e.g. `dt$V1` (imagine forgetting the
> backticks there and the world of hurt that potentially awaits once you do
> that). I am also curious what Matthew would think about this. This is smth
> I've encountered and dealt with a lot, so I'm certainly not an unbiased
> party here.
>
>
> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:
>
>> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
>> <eduard.antonyan at gmail.com> wrote:
>> > Tbh I don't see why data presentation and preservation (i.e. if you're
>> > reading in data with duplicated columns) is not enough of a use case -
>> > that's the only reason we allow arbitrary symbols in column names.
>> >
>> > So, instead of giving you another use case, how about you tell me
>> instead
>> > what do you propose should happen here (instead of what happens now):
>> >
>> >> dt = data.table(1, 2)
>> >> dt
>> >    V1 V2
>> > 1:  1  2
>> >> dt[, sum(V2), by = V1]
>> >    V1 V1
>> > 1:  1  2
>>
>> Only Matthew could say for sure, but if I were a gambling man I'd bet
>> that this was likely something that slipped through the cracks and
>> sleeping dogs were left to lie. I'd be curious to see what his
>> opinions on this are.
>>
>> IMHO the "data presentation" argument doesn't really hold much water.
>>
>> As for "data preservation," I rather see it as imposing structure on
>> it to enable efficient -- and sane/unambigous -- computation over it.
>> Further, I don't think is a preservation issue at all -- no data is
>> lost. The original data is still there in the file that was loaded
>> into R. The name of a column is changed when imported (with adequate
>> warning) into a data.table so that the user can slice and dice it. I'd
>> also guess the user being warned by the duplicate names would most
>> likely be happy to receive the warning, but the fact that you disagree
>> suggests that this isn't an obvious conclusion ;-)
>>
>> I'm curious if you would argue for an SQL table to allow duplicate
>> column names for the same reasons? I do know you can torture SQL to
>> get two colnames to be the same by aliasing, but this also seems to
>> have slipped through as an accident:
>>
>> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
>>
>> (which I found from here):
>>
>> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
>>
>> Perhaps we should email this guy Hugh to see what he thinks about this
>> one :-)
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131104/ce0faff3/attachment.html>

From eduard.antonyan at gmail.com  Wed Nov  6 17:05:04 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 6 Nov 2013 10:05:04 -0600
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
Message-ID: <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>

Last comment here has an example of using duplicated names -
http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I
mentioned earlier.


On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com>wrote:

> FWIW, data.frame does allow duplicate names as well. In the light that
> data.table inherits from data.frame, I would expect that it follows same
> convention as data.frame.
>
>
> On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com
> > wrote:
>
>> @Arun: Ok. Thinking about it a bit - I don't like the continuing
>> enumeration solution because it makes the results too unpredictable, but
>> could live with adding a ".1" etc. Which I assume is the idea anyway for
>> resolving duplicates elsewhere.
>>
>> @Steve: Not sure why you think it doesn't hold much water - I think I can
>> draw a parallel argument that replicates all of the duplicated names
>> concerns with a column that is called e.g. `dt$V1` (imagine forgetting the
>> backticks there and the world of hurt that potentially awaits once you do
>> that). I am also curious what Matthew would think about this. This is smth
>> I've encountered and dealt with a lot, so I'm certainly not an unbiased
>> party here.
>>
>>
>> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com
>> > wrote:
>>
>>> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
>>> <eduard.antonyan at gmail.com> wrote:
>>> > Tbh I don't see why data presentation and preservation (i.e. if you're
>>> > reading in data with duplicated columns) is not enough of a use case -
>>> > that's the only reason we allow arbitrary symbols in column names.
>>> >
>>> > So, instead of giving you another use case, how about you tell me
>>> instead
>>> > what do you propose should happen here (instead of what happens now):
>>> >
>>> >> dt = data.table(1, 2)
>>> >> dt
>>> >    V1 V2
>>> > 1:  1  2
>>> >> dt[, sum(V2), by = V1]
>>> >    V1 V1
>>> > 1:  1  2
>>>
>>> Only Matthew could say for sure, but if I were a gambling man I'd bet
>>> that this was likely something that slipped through the cracks and
>>> sleeping dogs were left to lie. I'd be curious to see what his
>>> opinions on this are.
>>>
>>> IMHO the "data presentation" argument doesn't really hold much water.
>>>
>>> As for "data preservation," I rather see it as imposing structure on
>>> it to enable efficient -- and sane/unambigous -- computation over it.
>>> Further, I don't think is a preservation issue at all -- no data is
>>> lost. The original data is still there in the file that was loaded
>>> into R. The name of a column is changed when imported (with adequate
>>> warning) into a data.table so that the user can slice and dice it. I'd
>>> also guess the user being warned by the duplicate names would most
>>> likely be happy to receive the warning, but the fact that you disagree
>>> suggests that this isn't an obvious conclusion ;-)
>>>
>>> I'm curious if you would argue for an SQL table to allow duplicate
>>> column names for the same reasons? I do know you can torture SQL to
>>> get two colnames to be the same by aliasing, but this also seems to
>>> have slipped through as an accident:
>>>
>>> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
>>>
>>> (which I found from here):
>>>
>>> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
>>>
>>> Perhaps we should email this guy Hugh to see what he thinks about this
>>> one :-)
>>>
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Computational Biologist
>>> Bioinformatics and Computational Biology
>>> Genentech
>>>
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/0d754b9e/attachment.html>

From aragorn168b at gmail.com  Wed Nov  6 17:10:26 2013
From: aragorn168b at gmail.com (aragorn168b at gmail.com)
Date: Wed, 6 Nov 2013 17:10:26 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
Message-ID: <9F7DC50A9B2C470C952973F162105BC4@gmail.com>

Eddi,  
Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using ?freed?) (and issue a warning)??

Arun


On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:

> Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier.
>  
>  
> On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com (mailto:chinmay.patil at gmail.com)> wrote:
> > FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame.  
> >  
> >  
> > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > @Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere.
> > >  
> > > @Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here.
> > >  
> > >  
> > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)> wrote:
> > > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > Tbh I don't see why data presentation and preservation (i.e. if you're
> > > > > reading in data with duplicated columns) is not enough of a use case -
> > > > > that's the only reason we allow arbitrary symbols in column names.
> > > > >
> > > > > So, instead of giving you another use case, how about you tell me instead
> > > > > what do you propose should happen here (instead of what happens now):
> > > > >
> > > > >> dt = data.table(1, 2)
> > > > >> dt
> > > > >    V1 V2
> > > > > 1:  1  2
> > > > >> dt[, sum(V2), by = V1]
> > > > >    V1 V1
> > > > > 1:  1  2
> > > >  
> > > > Only Matthew could say for sure, but if I were a gambling man I'd bet
> > > > that this was likely something that slipped through the cracks and
> > > > sleeping dogs were left to lie. I'd be curious to see what his
> > > > opinions on this are.
> > > >  
> > > > IMHO the "data presentation" argument doesn't really hold much water.
> > > >  
> > > > As for "data preservation," I rather see it as imposing structure on
> > > > it to enable efficient -- and sane/unambigous -- computation over it.
> > > > Further, I don't think is a preservation issue at all -- no data is
> > > > lost. The original data is still there in the file that was loaded
> > > > into R. The name of a column is changed when imported (with adequate
> > > > warning) into a data.table so that the user can slice and dice it. I'd
> > > > also guess the user being warned by the duplicate names would most
> > > > likely be happy to receive the warning, but the fact that you disagree
> > > > suggests that this isn't an obvious conclusion ;-)
> > > >  
> > > > I'm curious if you would argue for an SQL table to allow duplicate
> > > > column names for the same reasons? I do know you can torture SQL to
> > > > get two colnames to be the same by aliasing, but this also seems to
> > > > have slipped through as an accident:
> > > >  
> > > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
> > > >  
> > > > (which I found from here):
> > > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
> > > >  
> > > > Perhaps we should email this guy Hugh to see what he thinks about this one :-)
> > > >  
> > > > -steve
> > > >  
> > > > --
> > > > Steve Lianoglou
> > > > Computational Biologist
> > > > Bioinformatics and Computational Biology
> > > > Genentech
> > >  
> > >  
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >  
>  
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>  
>  


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/79bc4ec4/attachment.html>

From eduard.antonyan at gmail.com  Wed Nov  6 17:34:18 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 6 Nov 2013 10:34:18 -0600
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
Message-ID: <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>

You mean what would be the problem?

Well, if the user fread's that data, then modifies e.g. non-duplicate
columns and then tries to write.csv it back - how would the user recover
the original names for correctly writing the data back if we renamed the
columns?


On Wed, Nov 6, 2013 at 10:10 AM, <aragorn168b at gmail.com> wrote:

>  Eddi,
> Nice! But what exactly will happen to that data, if we were to
> automatically set unique names while loading it (using ?freed?) (and issue
> a warning)??
>
> Arun
>
> On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:
>
> Last comment here has an example of using duplicated names -
> http://stackoverflow.com/a/19809942/817778 - it's very similar to the one
> I mentioned earlier.
>
>
> On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com>wrote:
>
> FWIW, data.frame does allow duplicate names as well. In the light that
> data.table inherits from data.frame, I would expect that it follows same
> convention as data.frame.
>
>
> On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com
> > wrote:
>
> @Arun: Ok. Thinking about it a bit - I don't like the continuing
> enumeration solution because it makes the results too unpredictable, but
> could live with adding a ".1" etc. Which I assume is the idea anyway for
> resolving duplicates elsewhere.
>
> @Steve: Not sure why you think it doesn't hold much water - I think I can
> draw a parallel argument that replicates all of the duplicated names
> concerns with a column that is called e.g. `dt$V1` (imagine forgetting the
> backticks there and the world of hurt that potentially awaits once you do
> that). I am also curious what Matthew would think about this. This is smth
> I've encountered and dealt with a lot, so I'm certainly not an unbiased
> party here.
>
>
> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:
>
> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > Tbh I don't see why data presentation and preservation (i.e. if you're
> > reading in data with duplicated columns) is not enough of a use case -
> > that's the only reason we allow arbitrary symbols in column names.
> >
> > So, instead of giving you another use case, how about you tell me instead
> > what do you propose should happen here (instead of what happens now):
> >
> >> dt = data.table(1, 2)
> >> dt
> >    V1 V2
> > 1:  1  2
> >> dt[, sum(V2), by = V1]
> >    V1 V1
> > 1:  1  2
>
> Only Matthew could say for sure, but if I were a gambling man I'd bet
> that this was likely something that slipped through the cracks and
> sleeping dogs were left to lie. I'd be curious to see what his
> opinions on this are.
>
> IMHO the "data presentation" argument doesn't really hold much water.
>
> As for "data preservation," I rather see it as imposing structure on
> it to enable efficient -- and sane/unambigous -- computation over it.
> Further, I don't think is a preservation issue at all -- no data is
> lost. The original data is still there in the file that was loaded
> into R. The name of a column is changed when imported (with adequate
> warning) into a data.table so that the user can slice and dice it. I'd
> also guess the user being warned by the duplicate names would most
> likely be happy to receive the warning, but the fact that you disagree
> suggests that this isn't an obvious conclusion ;-)
>
> I'm curious if you would argue for an SQL table to allow duplicate
> column names for the same reasons? I do know you can torture SQL to
> get two colnames to be the same by aliasing, but this also seems to
> have slipped through as an accident:
>
> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
>
> (which I found from here):
>
> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
>
> Perhaps we should email this guy Hugh to see what he thinks about this one
> :-)
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/943c4ed9/attachment-0001.html>

From aragorn168b at gmail.com  Wed Nov  6 23:50:39 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 6 Nov 2013 23:50:39 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
Message-ID: <F032BA02EC23428D91D17F83936DBA57@gmail.com>

Eddi,  

1) We can still allow duplicate names in "fread" and during creation of data.table with the data.table() command.
2) There's really no loss of data as we can allow "setnames" to set duplicate names/unduplicate them (and they anyways have the data as they load that into R using fread). There's therefore no *real* loss of data.
3) The point is to decide upon where duplicate names are allowed and where it should give an error?  

As I said before, I think it's essential to allow duplicate names while loading a file (and therefore for consistency during creation of data.table as well). However, all grouping/aggregating/subsetting etc.. where ambiguity can arise should end in error. At least this is my stance so far. Are we agreeing on this?  

Arun


On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote:

> You mean what would be the problem?
>  
> Well, if the user fread's that data, then modifies e.g. non-duplicate columns and then tries to write.csv it back - how would the user recover the original names for correctly writing the data back if we renamed the columns?  
>  
>  
> On Wed, Nov 6, 2013 at 10:10 AM, <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Eddi,  
> > Nice! But what exactly will happen to that data, if we were to automatically set unique names while loading it (using ?freed?) (and issue a warning)??
> >  
> > Arun
> >  
> >  
> > On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:
> >  
> > > Last comment here has an example of using duplicated names - http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I mentioned earlier.
> > >  
> > >  
> > > On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com (mailto:chinmay.patil at gmail.com)> wrote:
> > > > FWIW, data.frame does allow duplicate names as well. In the light that data.table inherits from data.frame, I would expect that it follows same convention as data.frame.  
> > > >  
> > > >  
> > > > On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > @Arun: Ok. Thinking about it a bit - I don't like the continuing enumeration solution because it makes the results too unpredictable, but could live with adding a ".1" etc. Which I assume is the idea anyway for resolving duplicates elsewhere.
> > > > >  
> > > > > @Steve: Not sure why you think it doesn't hold much water - I think I can draw a parallel argument that replicates all of the duplicated names concerns with a column that is called e.g. `dt$V1` (imagine forgetting the backticks there and the world of hurt that potentially awaits once you do that). I am also curious what Matthew would think about this. This is smth I've encountered and dealt with a lot, so I'm certainly not an unbiased party here.
> > > > >  
> > > > >  
> > > > > On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)> wrote:
> > > > > > On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> > > > > > <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > > > > > > Tbh I don't see why data presentation and preservation (i.e. if you're
> > > > > > > reading in data with duplicated columns) is not enough of a use case -
> > > > > > > that's the only reason we allow arbitrary symbols in column names.
> > > > > > >
> > > > > > > So, instead of giving you another use case, how about you tell me instead
> > > > > > > what do you propose should happen here (instead of what happens now):
> > > > > > >
> > > > > > >> dt = data.table(1, 2)
> > > > > > >> dt
> > > > > > >    V1 V2
> > > > > > > 1:  1  2
> > > > > > >> dt[, sum(V2), by = V1]
> > > > > > >    V1 V1
> > > > > > > 1:  1  2
> > > > > >  
> > > > > > Only Matthew could say for sure, but if I were a gambling man I'd bet
> > > > > > that this was likely something that slipped through the cracks and
> > > > > > sleeping dogs were left to lie. I'd be curious to see what his
> > > > > > opinions on this are.
> > > > > >  
> > > > > > IMHO the "data presentation" argument doesn't really hold much water.
> > > > > >  
> > > > > > As for "data preservation," I rather see it as imposing structure on
> > > > > > it to enable efficient -- and sane/unambigous -- computation over it.
> > > > > > Further, I don't think is a preservation issue at all -- no data is
> > > > > > lost. The original data is still there in the file that was loaded
> > > > > > into R. The name of a column is changed when imported (with adequate
> > > > > > warning) into a data.table so that the user can slice and dice it. I'd
> > > > > > also guess the user being warned by the duplicate names would most
> > > > > > likely be happy to receive the warning, but the fact that you disagree
> > > > > > suggests that this isn't an obvious conclusion ;-)
> > > > > >  
> > > > > > I'm curious if you would argue for an SQL table to allow duplicate
> > > > > > column names for the same reasons? I do know you can torture SQL to
> > > > > > get two colnames to be the same by aliasing, but this also seems to
> > > > > > have slipped through as an accident:
> > > > > >  
> > > > > > http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
> > > > > >  
> > > > > > (which I found from here):
> > > > > > http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
> > > > > >  
> > > > > > Perhaps we should email this guy Hugh to see what he thinks about this one :-)
> > > > > >  
> > > > > > -steve
> > > > > >  
> > > > > > --
> > > > > > Steve Lianoglou
> > > > > > Computational Biologist
> > > > > > Bioinformatics and Computational Biology
> > > > > > Genentech
> > > > >  
> > > > >  
> > > > > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > >  
> > >  
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >  
> > >  
> > >  
> >  
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/f4375a4e/attachment.html>

From lianoglou.steve at gene.com  Thu Nov  7 00:01:05 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Wed, 6 Nov 2013 15:01:05 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <F032BA02EC23428D91D17F83936DBA57@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
Message-ID: <CAHA9McNCE0cxYrqxwKOpeLZ+qyW-0qfoM7kthA_JkpYHse3DMw@mail.gmail.com>

On Wed, Nov 6, 2013 at 2:50 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Eddi,
>
> 1) We can still allow duplicate names in "fread" and during creation of
> data.table with the data.table() command.
> 2) There's really no loss of data as we can allow "setnames" to set
> duplicate names/unduplicate them (and they anyways have the data as they
> load that into R using fread). There's therefore no *real* loss of data.
> 3) The point is to decide upon where duplicate names are allowed and where
> it should give an error?
>
> As I said before, I think it's essential to allow duplicate names while
> loading a file (and therefore for consistency during creation of data.table
> as well). However, all grouping/aggregating/subsetting etc.. where ambiguity
> can arise should end in error. At least this is my stance so far. Are we
> agreeing on this?

Add "evaluation in `j`" to the things you want to throw an error, and
I guess I'm ok w/ Arun's stance, too, since I guess we should stay as
close to data.frame as possible (even though I think it's still
"wrong" to have duplicate column names in principle).

I guess a more clever handling of setnames needs to happen too, as it
fails if the target data.table has any duplicate names (I'm assuming
this has come up already, but I'm only half-tuned-in to this
discussion)

I also think that the output of the aggregation example Eddi used
earlier should be changed, ie:

R> x <- data.table(V1=sample(letters[1:3], 10, rep=TRUE), B=rnorm(10))
R> x[, sum(B), by=V1]
   V1         V1
1:  b -0.8581098
2:  a  0.8762710
3:  c  1.3274762

Just feels wrong for the `sum`ed column to also be V1, but maybe this
is an FR for another day.

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From eduard.antonyan at gmail.com  Thu Nov  7 00:04:51 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 6 Nov 2013 17:04:51 -0600
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <F032BA02EC23428D91D17F83936DBA57@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
Message-ID: <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>

>
> As I said before, I think it's essential to allow duplicate names while
> loading a file (and therefore for consistency during creation of data.table
> as well). However, all grouping/aggregating/subsetting etc.. where
> ambiguity can arise should end in error. At least this is my stance so far.
> Are we agreeing on this?


Sounds good to me.


On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com>wrote:

>  Eddi,
>
> 1) We can still allow duplicate names in "fread" and during creation of
> data.table with the data.table() command.
> 2) There's really no loss of data as we can allow "setnames" to set
> duplicate names/unduplicate them (and they anyways have the data as they
> load that into R using fread). There's therefore no *real* loss of data.
> 3) The point is to decide upon where duplicate names are allowed and where
> it should give an error?
>
> As I said before, I think it's essential to allow duplicate names while
> loading a file (and therefore for consistency during creation of data.table
> as well). However, all grouping/aggregating/subsetting etc.. where
> ambiguity can arise should end in error. At least this is my stance so far.
> Are we agreeing on this?
>
> Arun
>
> On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan wrote:
>
> You mean what would be the problem?
>
> Well, if the user fread's that data, then modifies e.g. non-duplicate
> columns and then tries to write.csv it back - how would the user recover
> the original names for correctly writing the data back if we renamed the
> columns?
>
>
> On Wed, Nov 6, 2013 at 10:10 AM, <aragorn168b at gmail.com> wrote:
>
>  Eddi,
> Nice! But what exactly will happen to that data, if we were to
> automatically set unique names while loading it (using ?freed?) (and issue
> a warning)??
>
> Arun
>
> On Wednesday 6 November 2013 at 17:05, Eduard Antonyan wrote:
>
> Last comment here has an example of using duplicated names -
> http://stackoverflow.com/a/19809942/817778 - it's very similar to the one
> I mentioned earlier.
>
>
> On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil at gmail.com>wrote:
>
>  FWIW, data.frame does allow duplicate names as well. In the light that
> data.table inherits from data.frame, I would expect that it follows same
> convention as data.frame.
>
>
> On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan at gmail.com
> > wrote:
>
> @Arun: Ok. Thinking about it a bit - I don't like the continuing
> enumeration solution because it makes the results too unpredictable, but
> could live with adding a ".1" etc. Which I assume is the idea anyway for
> resolving duplicates elsewhere.
>
> @Steve: Not sure why you think it doesn't hold much water - I think I can
> draw a parallel argument that replicates all of the duplicated names
> concerns with a column that is called e.g. `dt$V1` (imagine forgetting the
> backticks there and the world of hurt that potentially awaits once you do
> that). I am also curious what Matthew would think about this. This is smth
> I've encountered and dealt with a lot, so I'm certainly not an unbiased
> party here.
>
>
> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:
>
>  On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com> wrote:
> > Tbh I don't see why data presentation and preservation (i.e. if you're
> > reading in data with duplicated columns) is not enough of a use case -
> > that's the only reason we allow arbitrary symbols in column names.
> >
> > So, instead of giving you another use case, how about you tell me instead
> > what do you propose should happen here (instead of what happens now):
> >
> >> dt = data.table(1, 2)
> >> dt
> >    V1 V2
> > 1:  1  2
> >> dt[, sum(V2), by = V1]
> >    V1 V1
> > 1:  1  2
>
> Only Matthew could say for sure, but if I were a gambling man I'd bet
> that this was likely something that slipped through the cracks and
> sleeping dogs were left to lie. I'd be curious to see what his
> opinions on this are.
>
> IMHO the "data presentation" argument doesn't really hold much water.
>
> As for "data preservation," I rather see it as imposing structure on
> it to enable efficient -- and sane/unambigous -- computation over it.
> Further, I don't think is a preservation issue at all -- no data is
> lost. The original data is still there in the file that was loaded
> into R. The name of a column is changed when imported (with adequate
> warning) into a data.table so that the user can slice and dice it. I'd
> also guess the user being warned by the duplicate names would most
> likely be happy to receive the warning, but the fact that you disagree
> suggests that this isn't an obvious conclusion ;-)
>
> I'm curious if you would argue for an SQL table to allow duplicate
> column names for the same reasons? I do know you can torture SQL to
> get two colnames to be the same by aliasing, but this also seems to
> have slipped through as an accident:
>
> http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-Names.pdf
>
> (which I found from here):
>
> http://stackoverflow.com/questions/8797593/is-there-any-use-to-duplicate-column-names-in-a-table
>
> Perhaps we should email this guy Hugh to see what he thinks about this one
> :-)
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131106/9787e322/attachment-0001.html>

From simon.ohanlon at imperial.ac.uk  Fri Nov  8 14:30:55 2013
From: simon.ohanlon at imperial.ac.uk (Simon O'Hanlon)
Date: Fri, 8 Nov 2013 13:30:55 +0000 (UTC)
Subject: [datatable-help] Unexpected behavior in setnames()
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
Message-ID: <loom.20131108T141659-567@post.gmane.org>

Eduard Antonyan <eduard.antonyan <at> gmail.com> writes:

> 
> 
> 
> 
> As I said before, I think it's essential to allow duplicate names while 
loading a file (and therefore for consistency during creation of data.table 
as well). However, all grouping/aggregating/subsetting etc.. where ambiguity 
can arise should end in error. At least this is my stance so far. Are we 
agreeing on this?
> 
> 
> 
> Sounds good to me.?
> 
> 
> On Wed, Nov 6, 2013 at 4:50 PM, Arunkumar Srinivasan <aragorn168b <at> 
gmail.com> wrote:
>                 
>                     Eddi,
>                 
> 
> 1) We can still allow duplicate names in "fread" and during creation of 
data.table with the data.table() command.
> 2) There's really no loss of data as we can allow "setnames" to set 
duplicate names/unduplicate them (and they anyways have the data as they 
load that into R using fread). There's therefore no *real* loss of data.
> 
> 3) The point is to decide upon where duplicate names are allowed and where 
it should give an error??
> 
> As I said before, I think it's essential to allow duplicate names while 
loading a file (and therefore for consistency during creation of data.table 
as well). However, all grouping/aggregating/subsetting etc.. where ambiguity 
can arise should end in error. At least this is my stance so far. Are we 
agreeing on this?
> 
> 
>                 
> 
> Arun
> 
> 
> 
>                  
>                 On Wednesday, November 6, 2013 at 5:34 PM, Eduard Antonyan 
wrote:
>                 
>                     
> 
> You mean what would be the problem?
> Well, if the user fread's that data, then modifies e.g. non-duplicate 
columns and then tries to write.csv it back - how would the user recover the 
original names for correctly writing the data back if we renamed the 
columns?
> 
> 
> 
> 
> On Wed, Nov 6, 2013 at 10:10 AM,  <aragorn168b <at> gmail.com> wrote:
> 
> 
>                 
>                     Eddi,
>                 
> Nice! But what exactly will happen to that data, if we were to 
automatically set unique names while loading it (using ?freed?) (and issue a 
warning)??
>                 
> 
> Arun
> 
> 
>                   
>                 On Wednesday 6 November 2013 at 17:05, Eduard Antonyan 
wrote:
> 
>                     
> 
> Last comment here has an example of using duplicated names - 
http://stackoverflow.com/a/19809942/817778 - it's very similar to the one I 
mentioned earlier.
> 
> 
> On Mon, Nov 4, 2013 at 3:54 AM, Chinmay Patil <chinmay.patil <at> 
gmail.com> wrote:
> 
> 
> 
> 
> 
> FWIW, data.frame does allow duplicate names as well. In the light that 
data.table inherits from data.frame, I would expect that it follows same 
convention as data.frame.
> 
> 
> 
> 
> 
> On Sun, Nov 3, 2013 at 9:43 AM, Eduard Antonyan <eduard.antonyan <at> 
gmail.com> wrote:
> 
> 
> 
> 
> 
>  <at> Arun: Ok. Thinking about it a bit - I don't like the continuing 
enumeration solution because it makes the results too unpredictable, but 
could live with adding a ".1" etc. Which I assume is the idea anyway for 
resolving duplicates elsewhere.
>  <at> Steve: Not sure why you think it doesn't hold much water - I think I 
can draw a parallel argument that replicates all of the duplicated names 
concerns with a column that is called e.g. `dt$V1` (imagine forgetting the 
backticks there and the world of hurt that potentially awaits once you do 
that). I am also curious what Matthew would think about this. This is smth 
I've encountered and dealt with a lot, so I'm certainly not an unbiased 
party here.
> 
> 
> 
> On Sat, Nov 2, 2013 at 8:15 PM, Steve Lianoglou <lianoglou.steve <at> 
gene.com> wrote:
> 
> 
> 
> 
> On Sat, Nov 2, 2013 at 5:43 PM, Eduard Antonyan
> <eduard.antonyan <at> gmail.com> wrote:
> > Tbh I don't see why data presentation and preservation (i.e. if you're
> > reading in data with duplicated columns) is not enough of a use case -
> > that's the only reason we allow arbitrary symbols in column names.
> >
> > So, instead of giving you another use case, how about you tell me 
instead
> > what do you propose should happen here (instead of what happens now):
> >
> >> dt = data.table(1, 2)
> >> dt
> > ? ?V1 V2
> > 1: ?1 ?2
> >> dt[, sum(V2), by = V1]
> > ? ?V1 V1
> > 1: ?1 ?2
> Only Matthew could say for sure, but if I were a gambling man I'd bet
> that this was likely something that slipped through the cracks and
> sleeping dogs were left to lie. I'd be curious to see what his
> opinions on this are.
> IMHO the "data presentation" argument doesn't really hold much water.
> As for "data preservation," I rather see it as imposing structure on
> it to enable efficient -- and sane/unambigous -- computation over it.
> Further, I don't think is a preservation issue at all -- no data is
> lost. The original data is still there in the file that was loaded
> into R. The name of a column is changed when imported (with adequate
> warning) into a data.table so that the user can slice and dice it. I'd
> also guess the user being warned by the duplicate names would most
> likely be happy to receive the warning, but the fact that you disagree
> suggests that this isn't an obvious conclusion 
> I'm curious if you would argue for an SQL table to allow duplicate
> column names for the same reasons? I do know you can torture SQL to
> get two colnames to be the same by aliasing, but this also seems to
> have slipped through as an 
accident:http://www.dcs.warwick.ac.uk/~hugh/TTM/Importance-of-Column-
Names.pdf
> (which I found from here):http://stackoverflow.com/questions/8797593/is-
there-any-use-to-duplicate-column-names-in-a-table
> Perhaps we should email this guy Hugh to see what he thinks about this one 
> 
> -steve
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing listdatatable-help <at> lists.r-forge.r-
project.orghttps://lists.r-forge.r-project.org/cgi-
bin/mailman/listinfo/datatable-help
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help <at> lists.r-forge.r-project.org
> 
> 
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
help
> 
> 
>                   
>                   
>                   
>                   
>                 
> 
>                     
> 
>             
> 
> 
> 
> 
> 
>                  
>                  
>                  
>                  
>                 
>                  
>                 
>                     
> 
>             
> 
> 
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help <at> lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-
help

I am not particularly opposed or otherwise, to duplicate column names, 
although I do see the issues that creates.

I think that whatever you, as custodians of data.table decide with respect 
to column names, the behaviour of numeric indices to indicate columns 
included in .SD needs to be fixed when duplicate column names are present. 
As a user I'd expect the following to return two columns with the values 2 
and 6 respectively:

Example:

dt <- data.table( 1,2,3,4 )
setnames(dt , rep( c("a", "b") , 2 ) )
   a b a b
1: 1 2 3 4

dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
   a a
1: 2 2

I hope that contributes in some small way to your decision making process. 
This is lifted from a question I asked on Stack Overflow here;

http://stackoverflow.com/questions/19811644/can-data-table-handle-identical-
column-names-when-using-sdcols


Thanks,


Simon


From lianoglou.steve at gene.com  Fri Nov  8 15:03:12 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 06:03:12 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <loom.20131108T141659-567@post.gmane.org>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
Message-ID: <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>

Hi Simon,

On Fri, Nov 8, 2013 at 5:30 AM, Simon O'Hanlon
<simon.ohanlon at imperial.ac.uk> wrote:

> I am not particularly opposed or otherwise, to duplicate column names,
> although I do see the issues that creates.
>
> I think that whatever you, as custodians of data.table decide with respect
> to column names, the behaviour of numeric indices to indicate columns
> included in .SD needs to be fixed when duplicate column names are present.
> As a user I'd expect the following to return two columns with the values 2
> and 6 respectively:
>
> Example:
>
> dt <- data.table( 1,2,3,4 )
> setnames(dt , rep( c("a", "b") , 2 ) )
>    a b a b
> 1: 1 2 3 4
>
> dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
>    a a
> 1: 2 2
>
> I hope that contributes in some small way to your decision making process.
> This is lifted from a question I asked on Stack Overflow here;
>
> http://stackoverflow.com/questions/19811644/can-data-table-handle-identical-
> column-names-when-using-sdcols

I agree -- when using numeric columns, this is clearly wrong and I
would expect an answer of 2 and 6.

I'm curious what you think, however, when you use the names of the
columns in .SDcols

If you ask .SDcols="a" would you expect the first "a" column to be
used, or all of them? To use all of them, would you expect to use
.SDcols=c('a', 'a')?

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From simon.ohanlon at imperial.ac.uk  Fri Nov  8 15:47:13 2013
From: simon.ohanlon at imperial.ac.uk (Simon =?utf-8?b?T1wnSGFubG9u?=)
Date: Fri, 8 Nov 2013 14:47:13 +0000 (UTC)
Subject: [datatable-help] Unexpected behavior in setnames()
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
Message-ID: <loom.20131108T151301-250@post.gmane.org>

Steve Lianoglou <lianoglou.steve <at> gene.com> writes:

> > As a user I'd expect the following to return two columns with the values 
2
> > and 6 respectively:
> >
> > Example:
> >
> > dt <- data.table( 1,2,3,4 )
> > setnames(dt , rep( c("a", "b") , 2 ) )
> >    a b a b
> > 1: 1 2 3 4
> >
> > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
> >    a a
> > 1: 2 2
 
> I agree -- when using numeric columns, this is clearly wrong and I
> would expect an answer of 2 and 6.
> 
> I'm curious what you think, however, when you use the names of the
> columns in .SDcols
> 
> If you ask .SDcols="a" would you expect the first "a" column to be
> used, or all of them? To use all of them, would you expect to use
> .SDcols=c('a', 'a')?
> 
> -steve

Hi Steve,
That I guess is the big question. Approaching it from the point of view that 
duplicate column names are allowed... If I use from the above example, 
.SDcols = "a" there are a number of things that *could* happen:

1) data.table ignores dupe names and uses the first such matching column up 
to the number of times that name appears and gives no warning (as I 
understand it, current behaviour and probably least desirable IMHO).

2) As above with a warning - least work from a developer standpoint I guess!

3) both columns are used piece-wise from left to right and have a unique 
suffix appended with a warning that this occured due to duplicate column 
names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however 
there is the complication that you then need to ensure you are not creating 
a new duplicate from an existing column name). This precludes you from 
referring to a specific column name in the j function though (but this could 
be part of the warning forcing a user to give  a column a unique name if 
they want to refer to it directly)

4) Most work/most flexible(?); On instantiation all columns in a data.table 
have an hidden attribute created that is a unique column name, which may be 
referred to in the j with an accessor function, for example "a" and "a" 
could be differentiated as .(a.1) and .(a.2) but return results under "a" 
and "a". There would also need to be a function to view the mapping of 
printed names to the unique attribute names, e.g. colnames( dt , 
include.hidden = TRUE ) then returns a list of the column names and the 
underlying unique names allowing a 'power-user' to refer to duplicate column 
names with a  unique identifier using the accessor function. IMHO opinion 
this is a huge amount of work, probably unsafe and prone to many bugs. Not 
sure I'd even attempt it, but thought it worth bringing up.

In conclusion my vote would be for current behaviour but with a warning 
about needing to set unique column names for calculations, or using numeric 
indices, in which case the handling of numeric indices should probably be 
"fixed" (I use that loosely because one might argue that it is not broken it 
just doesn't do what one might intuitively expect!).


From aragorn168b at gmail.com  Fri Nov  8 16:09:12 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 8 Nov 2013 16:09:12 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <loom.20131108T151301-250@post.gmane.org>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
Message-ID: <9996C9517A244BFF81E353BEC9CD130C@gmail.com>

Simon, 
I've replied your last post inline:

> 1) data.table ignores dupe names and uses the first such matching column up
> to the number of times that name appears and gives no warning (as I
> understand it, current behaviour and probably least desirable IMHO).


FYI, this is what data.frame does:

DF <- data.frame(x=1:5, x=6:10, check.names=FALSE)
DF[, c("x")]
DF[, c("x", "x")]

In fact, while doing this subsetting, it automatically makes the column names unique.

*Admittedly, DF[, 1:2] gives the right columns, but still the names are made unique.*

> 3) both columns are used piece-wise from left to right and have a unique
> suffix appended with a warning that this occured due to duplicate column
> names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however
> there is the complication that you then need to ensure you are not creating
> a new duplicate from an existing column name). This precludes you from
> referring to a specific column name in the j function though (but this could
> be part of the warning forcing a user to give a column a unique name if
> they want to refer to it directly)


This'll be a problem to evaluate expressions in `j`. Suppose you've:


DT <- data.table(x=1:5, x=6:10, ID=1:5)

And you do: DT[, list(x=x*2), by=ID], then, while creating the data.table DT, the names are not changed (or so far, the consensus is not to). So, if during an operation, we were to change the dup names to unique names, we'll have trouble in mapping expressions in `j` accordingly. Note that even if we dint, this expression is ill-posed. 

Also think about `setkey` function.

> 4) Most work/most flexible(?); On instantiation all columns in a data.table
> have an hidden attribute created that is a unique column name, which may be
> referred to in the j with an accessor function, for example "a" and "a"
> could be differentiated as .(a.1) and .(a.2) but return results under "a"
> and "a". There would also need to be a function to view the mapping of
> printed names to the unique attribute names, e.g. colnames( dt ,
> include.hidden = TRUE ) then returns a list of the column names and the
> underlying unique names allowing a 'power-user' to refer to duplicate column
> names with a unique identifier using the accessor function. IMHO opinion
> this is a huge amount of work, probably unsafe and prone to many bugs. Not
> sure I'd even attempt it, but thought it worth bringing up.


I agree with your conclusion. This is not feasible even, as the mapping is ill-posed. IF the expression in `j` contains only one of the duplicate columns, which one would you map to (.a.1) or (.a.2)? 

> In conclusion my vote would be for current behaviour but with a warning
> about needing to set unique column names for calculations, or using numeric
> indices, in which case the handling of numeric indices should probably be
> "fixed" (I use that loosely because one might argue that it is not broken it
> just doesn't do what one might intuitively expect!).


In my opinion, the dup-names should be allowed *only* during creation of data.table, and setting names (using `setnames`, `setattr` or the bad form `names(dt) <- `). Other than that, *ALL* operations should fail (end up in error), and that includes subsetting operation. The `setnames` gives the option for the user to set the names back before writing to a file, should he choose to keep it at the end. 


I think it's much better this way (strict, but avoids confusion). For example, in data.frames, doing DF$x (when x occurs twice) implicitly prints only the first (no warning/error). Also, split(DF$x, DF$x) uses the first column and so does split(DF, DF$x).


Arun


On Friday, November 8, 2013 at 3:47 PM, Simon O\'Hanlon wrote:

> Steve Lianoglou <lianoglou.steve <at> gene.com (http://gene.com)> writes:
> 
> > > As a user I'd expect the following to return two columns with the values 
> 2
> > > and 6 respectively:
> > > 
> > > Example:
> > > 
> > > dt <- data.table( 1,2,3,4 )
> > > setnames(dt , rep( c("a", "b") , 2 ) )
> > > a b a b
> > > 1: 1 2 3 4
> > > 
> > > dt[ , lapply( .SD ,function(x) x*2 ) , .SDcols = c(1,3) ]
> > > a a
> > > 1: 2 2
> > > 
> > 
> 
> 
> > I agree -- when using numeric columns, this is clearly wrong and I
> > would expect an answer of 2 and 6.
> > 
> > I'm curious what you think, however, when you use the names of the
> > columns in .SDcols
> > 
> > If you ask .SDcols="a" would you expect the first "a" column to be
> > used, or all of them? To use all of them, would you expect to use
> > .SDcols=c('a', 'a')?
> > 
> > -steve
> 
> Hi Steve,
> That I guess is the big question. Approaching it from the point of view that 
> duplicate column names are allowed... If I use from the above example, 
> .SDcols = "a" there are a number of things that *could* happen:
> 
> 1) data.table ignores dupe names and uses the first such matching column up 
> to the number of times that name appears and gives no warning (as I 
> understand it, current behaviour and probably least desirable IMHO).
> 
> 2) As above with a warning - least work from a developer standpoint I guess!
> 
> 3) both columns are used piece-wise from left to right and have a unique 
> suffix appended with a warning that this occured due to duplicate column 
> names, so e.g. "a.1" "a.2", in a similar fashion to data.frames (however 
> there is the complication that you then need to ensure you are not creating 
> a new duplicate from an existing column name). This precludes you from 
> referring to a specific column name in the j function though (but this could 
> be part of the warning forcing a user to give a column a unique name if 
> they want to refer to it directly)
> 
> 4) Most work/most flexible(?); On instantiation all columns in a data.table 
> have an hidden attribute created that is a unique column name, which may be 
> referred to in the j with an accessor function, for example "a" and "a" 
> could be differentiated as .(a.1) and .(a.2) but return results under "a" 
> and "a". There would also need to be a function to view the mapping of 
> printed names to the unique attribute names, e.g. colnames( dt , 
> include.hidden = TRUE ) then returns a list of the column names and the 
> underlying unique names allowing a 'power-user' to refer to duplicate column 
> names with a unique identifier using the accessor function. IMHO opinion 
> this is a huge amount of work, probably unsafe and prone to many bugs. Not 
> sure I'd even attempt it, but thought it worth bringing up.
> 
> In conclusion my vote would be for current behaviour but with a warning 
> about needing to set unique column names for calculations, or using numeric 
> indices, in which case the handling of numeric indices should probably be 
> "fixed" (I use that loosely because one might argue that it is not broken it 
> just doesn't do what one might intuitively expect!).
> 
> 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/77934922/attachment-0001.html>

From lianoglou.steve at gene.com  Fri Nov  8 21:02:07 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 12:02:07 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
Message-ID: <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>

Hi,

I wanted to point out that I'm in Arun's camp on this one:

On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:

> In my opinion, the dup-names should be allowed *only* during creation of
> data.table, and setting names (using `setnames`, `setattr` or the bad form
> `names(dt) <- `). Other than that, *ALL* operations should fail (end up in
> error), and that includes subsetting operation. The `setnames` gives the
> option for the user to set the names back before writing to a file, should
> he choose to keep it at the end.
>
> I think it's much better this way (strict, but avoids confusion). For
> example, in data.frames, doing DF$x (when x occurs twice) implicitly prints
> only the first (no warning/error). Also, split(DF$x, DF$x) uses the first
> column and so does split(DF, DF$x).

As an opinionated footnote: I can acquiesce that since data.frames
allow duplicated column names, I *guess* data.table should *allow*
them, however as is clear (to me) from this long chain of
"possibilities" that one can do, I strongly feel that computing over a
data.table w/ duplicated columns is a fundamentally broken idea as it
is ambiguous as to what the right behavior should be ... forget about
even the (surely fun) book-keeping code required to make it happen.

You want to import a table with duplicate names? Fine (we should warn
on import if it was `fread` or `as.data.table`d).

You want to set some names to duplicates? Fine -- warn there too.

Want to do any computation inside the data.table via `j` or as a
column in `by`? Throw an error and punt the problem to the user to
figure out how they would like to disambiguate the first column named
"a" from the 10th one -- I don't think we need another FAQ explaining
what "the right" way that this should be done is, and why we picked
it.

Or if you really want to compute over a data.table with duplicate
names, you might be better served by having the table in "long" format
-- perhaps that's why there are duplicate column names to begin with
(I'm guessing -- I still don't think I would ever want to have duped
names on purpose)

My two cents,

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From eduard.antonyan at gmail.com  Fri Nov  8 21:08:05 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 8 Nov 2013 14:08:05 -0600
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
Message-ID: <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>

Ditto - having dups, but spitting out an error on all ambiguous operations
seems like a robust strategy.


On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:

> Hi,
>
> I wanted to point out that I'm in Arun's camp on this one:
>
> On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> > In my opinion, the dup-names should be allowed *only* during creation of
> > data.table, and setting names (using `setnames`, `setattr` or the bad
> form
> > `names(dt) <- `). Other than that, *ALL* operations should fail (end up
> in
> > error), and that includes subsetting operation. The `setnames` gives the
> > option for the user to set the names back before writing to a file,
> should
> > he choose to keep it at the end.
> >
> > I think it's much better this way (strict, but avoids confusion). For
> > example, in data.frames, doing DF$x (when x occurs twice) implicitly
> prints
> > only the first (no warning/error). Also, split(DF$x, DF$x) uses the first
> > column and so does split(DF, DF$x).
>
> As an opinionated footnote: I can acquiesce that since data.frames
> allow duplicated column names, I *guess* data.table should *allow*
> them, however as is clear (to me) from this long chain of
> "possibilities" that one can do, I strongly feel that computing over a
> data.table w/ duplicated columns is a fundamentally broken idea as it
> is ambiguous as to what the right behavior should be ... forget about
> even the (surely fun) book-keeping code required to make it happen.
>
> You want to import a table with duplicate names? Fine (we should warn
> on import if it was `fread` or `as.data.table`d).
>
> You want to set some names to duplicates? Fine -- warn there too.
>
> Want to do any computation inside the data.table via `j` or as a
> column in `by`? Throw an error and punt the problem to the user to
> figure out how they would like to disambiguate the first column named
> "a" from the 10th one -- I don't think we need another FAQ explaining
> what "the right" way that this should be done is, and why we picked
> it.
>
> Or if you really want to compute over a data.table with duplicate
> names, you might be better served by having the table in "long" format
> -- perhaps that's why there are duplicate column names to begin with
> (I'm guessing -- I still don't think I would ever want to have duped
> names on purpose)
>
> My two cents,
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/ebf6147d/attachment.html>

From lianoglou.steve at gene.com  Fri Nov  8 21:16:05 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 12:16:05 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
Message-ID: <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>

Wow ... did we just reach a consensus? :-)

-steve

On Fri, Nov 8, 2013 at 12:08 PM, Eduard Antonyan
<eduard.antonyan at gmail.com> wrote:
> Ditto - having dups, but spitting out an error on all ambiguous operations
> seems like a robust strategy.
>
>
> On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <lianoglou.steve at gene.com>
> wrote:
>>
>> Hi,
>>
>> I wanted to point out that I'm in Arun's camp on this one:
>>
>> On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
>> <aragorn168b at gmail.com> wrote:
>>
>> > In my opinion, the dup-names should be allowed *only* during creation of
>> > data.table, and setting names (using `setnames`, `setattr` or the bad
>> > form
>> > `names(dt) <- `). Other than that, *ALL* operations should fail (end up
>> > in
>> > error), and that includes subsetting operation. The `setnames` gives the
>> > option for the user to set the names back before writing to a file,
>> > should
>> > he choose to keep it at the end.
>> >
>> > I think it's much better this way (strict, but avoids confusion). For
>> > example, in data.frames, doing DF$x (when x occurs twice) implicitly
>> > prints
>> > only the first (no warning/error). Also, split(DF$x, DF$x) uses the
>> > first
>> > column and so does split(DF, DF$x).
>>
>> As an opinionated footnote: I can acquiesce that since data.frames
>> allow duplicated column names, I *guess* data.table should *allow*
>> them, however as is clear (to me) from this long chain of
>> "possibilities" that one can do, I strongly feel that computing over a
>> data.table w/ duplicated columns is a fundamentally broken idea as it
>> is ambiguous as to what the right behavior should be ... forget about
>> even the (surely fun) book-keeping code required to make it happen.
>>
>> You want to import a table with duplicate names? Fine (we should warn
>> on import if it was `fread` or `as.data.table`d).
>>
>> You want to set some names to duplicates? Fine -- warn there too.
>>
>> Want to do any computation inside the data.table via `j` or as a
>> column in `by`? Throw an error and punt the problem to the user to
>> figure out how they would like to disambiguate the first column named
>> "a" from the 10th one -- I don't think we need another FAQ explaining
>> what "the right" way that this should be done is, and why we picked
>> it.
>>
>> Or if you really want to compute over a data.table with duplicate
>> names, you might be better served by having the table in "long" format
>> -- perhaps that's why there are duplicate column names to begin with
>> (I'm guessing -- I still don't think I would ever want to have duped
>> names on purpose)
>>
>> My two cents,
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From aragorn168b at gmail.com  Fri Nov  8 21:19:38 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 8 Nov 2013 21:19:38 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
Message-ID: <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>

Steve, 
Maybe, but it's just getting started :) - we now have to decide what's ambiguous! 
Ex: Is subsetting by column number considered ambiguous (By definition of ambiguous, probably not)? But then it'd be inconsistent with subsetting when column names are provided.. So, should we prioritise consistency over function in this scenario?


Arun


On Friday, November 8, 2013 at 9:16 PM, Steve Lianoglou wrote:

> Wow ... did we just reach a consensus? :-)
> 
> -steve
> 
> On Fri, Nov 8, 2013 at 12:08 PM, Eduard Antonyan
> <eduard.antonyan at gmail.com (mailto:eduard.antonyan at gmail.com)> wrote:
> > Ditto - having dups, but spitting out an error on all ambiguous operations
> > seems like a robust strategy.
> > 
> > 
> > On Fri, Nov 8, 2013 at 2:02 PM, Steve Lianoglou <lianoglou.steve at gene.com (mailto:lianoglou.steve at gene.com)>
> > wrote:
> > > 
> > > Hi,
> > > 
> > > I wanted to point out that I'm in Arun's camp on this one:
> > > 
> > > On Fri, Nov 8, 2013 at 7:09 AM, Arunkumar Srinivasan
> > > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > 
> > > > In my opinion, the dup-names should be allowed *only* during creation of
> > > > data.table, and setting names (using `setnames`, `setattr` or the bad
> > > > form
> > > > `names(dt) <- `). Other than that, *ALL* operations should fail (end up
> > > > in
> > > > error), and that includes subsetting operation. The `setnames` gives the
> > > > option for the user to set the names back before writing to a file,
> > > > should
> > > > he choose to keep it at the end.
> > > > 
> > > > I think it's much better this way (strict, but avoids confusion). For
> > > > example, in data.frames, doing DF$x (when x occurs twice) implicitly
> > > > prints
> > > > only the first (no warning/error). Also, split(DF$x, DF$x) uses the
> > > > first
> > > > column and so does split(DF, DF$x).
> > > > 
> > > 
> > > 
> > > As an opinionated footnote: I can acquiesce that since data.frames
> > > allow duplicated column names, I *guess* data.table should *allow*
> > > them, however as is clear (to me) from this long chain of
> > > "possibilities" that one can do, I strongly feel that computing over a
> > > data.table w/ duplicated columns is a fundamentally broken idea as it
> > > is ambiguous as to what the right behavior should be ... forget about
> > > even the (surely fun) book-keeping code required to make it happen.
> > > 
> > > You want to import a table with duplicate names? Fine (we should warn
> > > on import if it was `fread` or `as.data.table`d).
> > > 
> > > You want to set some names to duplicates? Fine -- warn there too.
> > > 
> > > Want to do any computation inside the data.table via `j` or as a
> > > column in `by`? Throw an error and punt the problem to the user to
> > > figure out how they would like to disambiguate the first column named
> > > "a" from the 10th one -- I don't think we need another FAQ explaining
> > > what "the right" way that this should be done is, and why we picked
> > > it.
> > > 
> > > Or if you really want to compute over a data.table with duplicate
> > > names, you might be better served by having the table in "long" format
> > > -- perhaps that's why there are duplicate column names to begin with
> > > (I'm guessing -- I still don't think I would ever want to have duped
> > > names on purpose)
> > > 
> > > My two cents,
> > > 
> > > -steve
> > > 
> > > --
> > > Steve Lianoglou
> > > Computational Biologist
> > > Bioinformatics and Computational Biology
> > > Genentech
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > 
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> 
> 
> 
> 
> -- 
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/d05320ef/attachment-0001.html>

From lianoglou.steve at gene.com  Fri Nov  8 21:29:44 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 12:29:44 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
Message-ID: <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>

> Steve,
> Maybe, but it's just getting started :) - we now have to decide what's
> ambiguous!

So close, yet so far ...

> Ex: Is subsetting by column number considered ambiguous (By definition of
> ambiguous, probably not)? But then it'd be inconsistent with subsetting when
> column names are provided.. So, should we prioritise consistency over
> function in this scenario?

Sorry, can you provide examples of each?

I'd imagine doing anything by column number is unambiguous, but I'm
not sure how you can subset by column index and by column name in a
"similar" fashion.

I mean dt[[1]] should work no matter what, dt[['a']] would work only
if there is only one column named 'a' ... but I don't think this is
what you are talking about?

Sorry if I'm being obtuse,

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From aragorn168b at gmail.com  Fri Nov  8 21:37:26 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 8 Nov 2013 21:37:26 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
Message-ID: <2A4352C9C46747AAB3951B71B4942E47@gmail.com>

Sure, here's an example of what I was trying to explain: 

Suppose: 
DT <- data.table(x=1:5, y=1:5, x=6:10)

Then, 

DT[, c(1,3), with=FALSE] # gives correct subset
DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong

DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result
DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but *should provide right result if "ambiguity" is the only concern.


Arun


On Friday, November 8, 2013 at 9:29 PM, Steve Lianoglou wrote:

> > Steve,
> > Maybe, but it's just getting started :) - we now have to decide what's
> > ambiguous!
> > 
> 
> 
> So close, yet so far ...
> 
> > Ex: Is subsetting by column number considered ambiguous (By definition of
> > ambiguous, probably not)? But then it'd be inconsistent with subsetting when
> > column names are provided.. So, should we prioritise consistency over
> > function in this scenario?
> > 
> 
> 
> Sorry, can you provide examples of each?
> 
> I'd imagine doing anything by column number is unambiguous, but I'm
> not sure how you can subset by column index and by column name in a
> "similar" fashion.
> 
> I mean dt[[1]] should work no matter what, dt[['a']] would work only
> if there is only one column named 'a' ... but I don't think this is
> what you are talking about?
> 
> Sorry if I'm being obtuse,
> 
> -steve
> 
> -- 
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/6a257c87/attachment.html>

From lianoglou.steve at gene.com  Fri Nov  8 21:41:19 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 12:41:19 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
 <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
Message-ID: <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>

My gut reaction is:

On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Sure, here's an example of what I was trying to explain:
>
> Suppose:
> DT <- data.table(x=1:5, y=1:5, x=6:10)
>
> Then,
>
> DT[, c(1,3), with=FALSE] # gives correct subset

This is "OK", we just do what the user asks, here, as they are being
very specific.

> DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong

stop() -- we don't try to disambiguate (even if it "seems" specific)

> DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result

stop()

Also stop() on DT[, ..., .SDcols="x"]

> DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but

Do what the user asks for.

No?

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From aragorn168b at gmail.com  Fri Nov  8 21:45:26 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 8 Nov 2013 21:45:26 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
 <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
 <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>
Message-ID: <40662613E7BB48CB9C57365D9C6522D2@gmail.com>

Oh I can certainly agree with that. I guess we'll have to make some changes to the code to use index based subsetting when .SDcols or j-value is number then. 

Arun


On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote:

> My gut reaction is:
> 
> On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Sure, here's an example of what I was trying to explain:
> > 
> > Suppose:
> > DT <- data.table(x=1:5, y=1:5, x=6:10)
> > 
> > Then,
> > 
> > DT[, c(1,3), with=FALSE] # gives correct subset
> 
> This is "OK", we just do what the user asks, here, as they are being
> very specific.
> 
> > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong
> 
> stop() -- we don't try to disambiguate (even if it "seems" specific)
> 
> > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result
> 
> stop()
> 
> Also stop() on DT[, ..., .SDcols="x"]
> 
> > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but
> 
> Do what the user asks for.
> 
> No?
> 
> -steve
> 
> -- 
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/7564c243/attachment.html>

From lianoglou.steve at gene.com  Fri Nov  8 21:47:57 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 12:47:57 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <40662613E7BB48CB9C57365D9C6522D2@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
 <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
 <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>
 <40662613E7BB48CB9C57365D9C6522D2@gmail.com>
Message-ID: <CAHA9McO96tEUU75xFu8U1Ei6keNPA96UFAnNr3_ADyj3gS50tA@mail.gmail.com>

On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Oh I can certainly agree with that. I guess we'll have to make some changes
> to the code to use index based subsetting when .SDcols or j-value is number
> then.

Not sure what you mean by j-value -- the examples you gave didn't
compute on the .SD, it just returned it.

I think if there is a `j` expression that computes on an .SD that has
duplicated colnames, I think we just stop().

Or did you mean something else?

-steve

> On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote:
>
> My gut reaction is:
>
> On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> Sure, here's an example of what I was trying to explain:
>
> Suppose:
> DT <- data.table(x=1:5, y=1:5, x=6:10)
>
> Then,
>
> DT[, c(1,3), with=FALSE] # gives correct subset
>
>
> This is "OK", we just do what the user asks, here, as they are being
> very specific.
>
> DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong
>
>
> stop() -- we don't try to disambiguate (even if it "seems" specific)
>
> DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result
>
>
> stop()
>
> Also stop() on DT[, ..., .SDcols="x"]
>
> DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but
>
>
> Do what the user asks for.
>
> No?
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From aragorn168b at gmail.com  Fri Nov  8 21:53:18 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Fri, 8 Nov 2013 21:53:18 +0100
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McO96tEUU75xFu8U1Ei6keNPA96UFAnNr3_ADyj3gS50tA@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
 <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
 <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>
 <40662613E7BB48CB9C57365D9C6522D2@gmail.com>
 <CAHA9McO96tEUU75xFu8U1Ei6keNPA96UFAnNr3_ADyj3gS50tA@mail.gmail.com>
Message-ID: <4CB7CC37475440B68A19220EF98C75EF@gmail.com>

Sorry, forget the j-value. For `.SDcols`, even when we provide integers (column numbers), internally, we compute the column name and subset to get `.SD`. And this'll have to change. 

Arun


On Friday, November 8, 2013 at 9:47 PM, Steve Lianoglou wrote:

> On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Oh I can certainly agree with that. I guess we'll have to make some changes
> > to the code to use index based subsetting when .SDcols or j-value is number
> > then.
> > 
> 
> 
> Not sure what you mean by j-value -- the examples you gave didn't
> compute on the .SD, it just returned it.
> 
> I think if there is a `j` expression that computes on an .SD that has
> duplicated colnames, I think we just stop().
> 
> Or did you mean something else?
> 
> -steve
> 
> > On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote:
> > 
> > My gut reaction is:
> > 
> > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan
> > <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > 
> > Sure, here's an example of what I was trying to explain:
> > 
> > Suppose:
> > DT <- data.table(x=1:5, y=1:5, x=6:10)
> > 
> > Then,
> > 
> > DT[, c(1,3), with=FALSE] # gives correct subset
> > 
> > 
> > This is "OK", we just do what the user asks, here, as they are being
> > very specific.
> > 
> > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong
> > 
> > 
> > stop() -- we don't try to disambiguate (even if it "seems" specific)
> > 
> > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result
> > 
> > 
> > stop()
> > 
> > Also stop() on DT[, ..., .SDcols="x"]
> > 
> > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but
> > 
> > 
> > Do what the user asks for.
> > 
> > No?
> > 
> > -steve
> > 
> > --
> > Steve Lianoglou
> > Computational Biologist
> > Bioinformatics and Computational Biology
> > Genentech
> > 
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> 
> 
> 
> 
> -- 
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/2527028d/attachment.html>

From lianoglou.steve at gene.com  Fri Nov  8 21:56:33 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 8 Nov 2013 12:56:33 -0800
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <4CB7CC37475440B68A19220EF98C75EF@gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
 <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
 <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>
 <40662613E7BB48CB9C57365D9C6522D2@gmail.com>
 <CAHA9McO96tEUU75xFu8U1Ei6keNPA96UFAnNr3_ADyj3gS50tA@mail.gmail.com>
 <4CB7CC37475440B68A19220EF98C75EF@gmail.com>
Message-ID: <CAHA9McPCayvfOM7zE47VdiKj13rmdOkFDOuAqfu6xEaHp5gr=A@mail.gmail.com>

Right, agreed.

On Fri, Nov 8, 2013 at 12:53 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Sorry, forget the j-value. For `.SDcols`, even when we provide integers
> (column numbers), internally, we compute the column name and subset to get
> `.SD`. And this'll have to change.
>
> Arun
>
> On Friday, November 8, 2013 at 9:47 PM, Steve Lianoglou wrote:
>
> On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> Oh I can certainly agree with that. I guess we'll have to make some changes
> to the code to use index based subsetting when .SDcols or j-value is number
> then.
>
>
> Not sure what you mean by j-value -- the examples you gave didn't
> compute on the .SD, it just returned it.
>
> I think if there is a `j` expression that computes on an .SD that has
> duplicated colnames, I think we just stop().
>
> Or did you mean something else?
>
> -steve
>
> On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote:
>
> My gut reaction is:
>
> On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> Sure, here's an example of what I was trying to explain:
>
> Suppose:
> DT <- data.table(x=1:5, y=1:5, x=6:10)
>
> Then,
>
> DT[, c(1,3), with=FALSE] # gives correct subset
>
>
> This is "OK", we just do what the user asks, here, as they are being
> very specific.
>
> DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong
>
>
> stop() -- we don't try to disambiguate (even if it "seems" specific)
>
> DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result
>
>
> stop()
>
> Also stop() on DT[, ..., .SDcols="x"]
>
> DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but
>
>
> Do what the user asks for.
>
> No?
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

From eduard.antonyan at gmail.com  Fri Nov  8 22:24:11 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 8 Nov 2013 15:24:11 -0600
Subject: [datatable-help] Unexpected behavior in setnames()
In-Reply-To: <CAHA9McPCayvfOM7zE47VdiKj13rmdOkFDOuAqfu6xEaHp5gr=A@mail.gmail.com>
References: <etPan.527421d3.3006c83e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOrUNks0akp_Aqk7cEb6U3yTuy7iz1be8UudgdJDaSM0pQ@mail.gmail.com>
 <957B1243714142278898647650EBF386@gmail.com>
 <CAHZcBOoiNQ3RQMw3bFEq4Zy+AEJxczN8mCt0H8LhmCorgVmAvQ@mail.gmail.com>
 <5E98018F047943DE89849EC57A7CF72A@gmail.com>
 <CAHZcBOp_kBZtrUGBXH0Op9zwWjnOyXGim-e+5d+uw1eyTnoz1g@mail.gmail.com>
 <D70F31E4E83842EF95F46C9565E7AEEA@gmail.com>
 <etPan.5274eba5.7724c67e.16a@MacBook-Pro-de-Alexandre-Sieira.local>
 <CAHZcBOqRjTrmr0THFD3Go6_fh4XU8rQKGMoHnJT4rA_d5AKVXg@mail.gmail.com>
 <94F078AB544B4757A58049C7DB7433AB@gmail.com>
 <CAHA9McMwvVZNkq=XrE0ALxtAPFztoKUGXB8+UUkBD8TgovkjGA@mail.gmail.com>
 <CAHZcBOoQHS-=hhd2xrp1R0dm_UvRU5NvCe4UXmpdSK5U2t5xYA@mail.gmail.com>
 <F9719F33B38C4C75A97B476179984868@gmail.com>
 <CAHZcBOrU8a24M+VRc_d_tYaCXDjOPbMBuYV3-Dhy8ik54ny1Cg@mail.gmail.com>
 <CAHA9McMf5qRRLhdY6g3scfdbDe_U3fMgqzdz1Obqzmgxi8jcyg@mail.gmail.com>
 <CAHZcBOq-3d0GvYQe=rPAxk4eF4EW-P1Tztmk9Fr0Fbv_RbqOzQ@mail.gmail.com>
 <CA+kDFFVrsD9cOHg+ozS+JQys3b=3hqQCZyTg1b6BccA+uiSQRg@mail.gmail.com>
 <CAHZcBOrvQ=w5ihevjGKbsv46uyF1AJ+LrNMYzbsJusLOeirJgg@mail.gmail.com>
 <9F7DC50A9B2C470C952973F162105BC4@gmail.com>
 <CAHZcBOrJCr5cwH1XK39jp0_BD3CkMUkn7Noh874WnkEB57jqUA@mail.gmail.com>
 <F032BA02EC23428D91D17F83936DBA57@gmail.com>
 <CAHZcBOqHzzsQykEbtzLyw1wzX4eD0AV_ODP1qnY6iRwMn=b-_g@mail.gmail.com>
 <loom.20131108T141659-567@post.gmane.org>
 <CAHA9McMnwL9W65-FokG7yyWqG=o5eQJqKzc4+4NBH6+sLeiicg@mail.gmail.com>
 <loom.20131108T151301-250@post.gmane.org>
 <9996C9517A244BFF81E353BEC9CD130C@gmail.com>
 <CAHA9McMv3ZOU+qFLu2VVKFApPGsvZs_PCLwFfeJpCOBdRc0JWw@mail.gmail.com>
 <CAHZcBOrw7UBxfhUpvKvHz9SFyr8LvQYXWf=u=Y7AMxRkEY_nGQ@mail.gmail.com>
 <CAHA9McPzmRzLQGViWm7_EFL2MhUmHM93u=xV1Fe=46pGt5DD-Q@mail.gmail.com>
 <41982B8ACC9E424786C3E5816C4D3D80@gmail.com>
 <CAHA9McMYHN7JeoKFJ8TNJG3THg+nxd1t+wW59urXHMXQdtc0BQ@mail.gmail.com>
 <2A4352C9C46747AAB3951B71B4942E47@gmail.com>
 <CAHA9McP5qymyQwd4E+CuuX2atOFUR=DALj-b0pzWq9dAw5dm7Q@mail.gmail.com>
 <40662613E7BB48CB9C57365D9C6522D2@gmail.com>
 <CAHA9McO96tEUU75xFu8U1Ei6keNPA96UFAnNr3_ADyj3gS50tA@mail.gmail.com>
 <4CB7CC37475440B68A19220EF98C75EF@gmail.com>
 <CAHA9McPCayvfOM7zE47VdiKj13rmdOkFDOuAqfu6xEaHp5gr=A@mail.gmail.com>
Message-ID: <CAHZcBOowbroM2H_gY-DLrasafHcyOTCCf88BKdgSeTeCaXghxg@mail.gmail.com>

>
> I think if there is a `j` expression that computes on an .SD that has
> duplicated colnames, I think we just stop().


I'm not entirely sure what you mean by this. The following *should* work
imo:

    dt = data.table(x = 1:10, x = 10:1)

    dt[, lapply(.SD, sum)]


On Fri, Nov 8, 2013 at 2:56 PM, Steve Lianoglou <lianoglou.steve at gene.com>wrote:

> Right, agreed.
>
> On Fri, Nov 8, 2013 at 12:53 PM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
> > Sorry, forget the j-value. For `.SDcols`, even when we provide integers
> > (column numbers), internally, we compute the column name and subset to
> get
> > `.SD`. And this'll have to change.
> >
> > Arun
> >
> > On Friday, November 8, 2013 at 9:47 PM, Steve Lianoglou wrote:
> >
> > On Fri, Nov 8, 2013 at 12:45 PM, Arunkumar Srinivasan
> > <aragorn168b at gmail.com> wrote:
> >
> > Oh I can certainly agree with that. I guess we'll have to make some
> changes
> > to the code to use index based subsetting when .SDcols or j-value is
> number
> > then.
> >
> >
> > Not sure what you mean by j-value -- the examples you gave didn't
> > compute on the .SD, it just returned it.
> >
> > I think if there is a `j` expression that computes on an .SD that has
> > duplicated colnames, I think we just stop().
> >
> > Or did you mean something else?
> >
> > -steve
> >
> > On Friday, November 8, 2013 at 9:41 PM, Steve Lianoglou wrote:
> >
> > My gut reaction is:
> >
> > On Fri, Nov 8, 2013 at 12:37 PM, Arunkumar Srinivasan
> > <aragorn168b at gmail.com> wrote:
> >
> > Sure, here's an example of what I was trying to explain:
> >
> > Suppose:
> > DT <- data.table(x=1:5, y=1:5, x=6:10)
> >
> > Then,
> >
> > DT[, c(1,3), with=FALSE] # gives correct subset
> >
> >
> > This is "OK", we just do what the user asks, here, as they are being
> > very specific.
> >
> > DT[, c("x", "x"), with=FALSE] # gives column 1 twice - wrong
> >
> >
> > stop() -- we don't try to disambiguate (even if it "seems" specific)
> >
> > DT[, .SD, .SDcols=c("x", "x")] # gives column 1 twice - wrong result
> >
> >
> > stop()
> >
> > Also stop() on DT[, ..., .SDcols="x"]
> >
> > DT[, .SD, .SDcols=c(1,3)] # gives column 1 twice - wrong result - but
> >
> >
> > Do what the user asks for.
> >
> > No?
> >
> > -steve
> >
> > --
> > Steve Lianoglou
> > Computational Biologist
> > Bioinformatics and Computational Biology
> > Genentech
> >
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
> >
> >
> > --
> > Steve Lianoglou
> > Computational Biologist
> > Bioinformatics and Computational Biology
> > Genentech
> >
> >
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
> --
> Steve Lianoglou
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131108/51b61002/attachment-0001.html>

From aragorn168b at gmail.com  Sat Nov  9 12:32:59 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 9 Nov 2013 12:32:59 +0100
Subject: [datatable-help] FR #748 discussion
Message-ID: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com>

Hey everybody, 

I've been wanting to implement this for a while. I dint know there was a FR lying around. All the more good!
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=748&group_id=240&atid=978

It's about filling unavailable values during "join" with values other than default NA.

Ex:

require(data.table)
DT <- data.table(x=c(1,2,3,6), y="A", key="x")
DT[J(1:6)] # at the moment, all "y" with no match to key entry will be NA_character_

The FR:

DT[J(1:6), fill = "bla"]

What I wanted to discuss about is the handling on "nomatch" parameter: At the moment we've a "nomatch" parameter that takes values NA or 0. NA being default and when it's 0, the no matches are *removed*. So how do we allow the "fill" argument?

I think "nomatch" should become logical with TRUE and FALSE mimicking the old functionality of filling with something or removing unavailable entries (that is, "nomatch=FALSE" = old "nomatch=0"). And if "nomatch=TRUE", then the value of "fill" (default = NA) will be used. For backwards compatibility, "nomatch" will be TRUE (keep no matches) and "fill=NA" (and assign them NA). Basically, "nomatch" has more priority than "fill". If "nomatch=FALSE", "fill" is ignored. 

Hm, do you find "nomatch=TRUE" as "keeping no matches" confusing? Maybe then we'll have to change this to "keep.nomatch".
I'm all ears for better ideas! So, please weigh in.

Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/e919d778/attachment.html>

From gsee000 at gmail.com  Sat Nov  9 17:50:48 2013
From: gsee000 at gmail.com (G See)
Date: Sat, 9 Nov 2013 10:50:48 -0600
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
Message-ID: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>

Hi,

Please note the inconsistency between the behavior of rbind() and
rbindlist() below.

m1 <- as.data.table(mtcars)
m2 <- copy(m1)
rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
bind by name

What do you think about making them have the same behavior and/or
warning?  Personally, I prefer the behavior of rbind(), and would
prefer to see a warning if column names are ignored like they are with
rbindlist().

Thanks,
Garrett

From eduard.antonyan at gmail.com  Sat Nov  9 18:06:33 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 9 Nov 2013 11:06:33 -0600
Subject: [datatable-help] FR #748 discussion
In-Reply-To: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com>
References: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com>
Message-ID: <CAHZcBOowHT2tQ=9eLEUfOhDr16QsZwuU=yRJPux7o7ir7KRg-g@mail.gmail.com>

Great, this is smth I was thinking about recently as well.

I do find nomatch=TRUE/FALSE confusing and keep.nomatch is better in that
respect, but that's a lot more characters to type, which downgrades it for
me a lot.

I think in a world where I was designing it from scratch I would have
nomatch=NA do what it does now, nomatch=NULL do what 0 does now, and then
the rest of the values would fill. Not sure if this can be done smoothly
though in the world we live in.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/a21ecf00/attachment.html>

From aragorn168b at gmail.com  Sat Nov  9 18:08:47 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 9 Nov 2013 18:08:47 +0100
Subject: [datatable-help] FR #748 discussion
In-Reply-To: <CAHZcBOowHT2tQ=9eLEUfOhDr16QsZwuU=yRJPux7o7ir7KRg-g@mail.gmail.com>
References: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com>
 <CAHZcBOowHT2tQ=9eLEUfOhDr16QsZwuU=yRJPux7o7ir7KRg-g@mail.gmail.com>
Message-ID: <DC620B9E1CA44D62B034D3AFC0A8F856@gmail.com>

Eddi, 
Gabor's suggestions here: http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001738.html are quite nice! We did agree with those changes at the time :). I'll have to write to Matthew to get his feedback.

Arun


On Saturday, November 9, 2013 at 6:06 PM, Eduard Antonyan wrote:

> Great, this is smth I was thinking about recently as well.
> I do find nomatch=TRUE/FALSE confusing and keep.nomatch is better in that respect, but that's a lot more characters to type, which downgrades it for me a lot.
> I think in a world where I was designing it from scratch I would have nomatch=NA do what it does now, nomatch=NULL do what 0 does now, and then the rest of the values would fill. Not sure if this can be done smoothly though in the world we live in.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/784aff48/attachment.html>

From eduard.antonyan at gmail.com  Sat Nov  9 18:29:39 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 9 Nov 2013 11:29:39 -0600
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
Message-ID: <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>

Fyi, it's not well documented, but setting use.names=FALSE in rbind would
replicate rbindlist behavior.

I think it's a reasonable FR - if/when all of rbind code goes into C, it
would be trivial to add.
 On Nov 9, 2013 10:51 AM, "G See" <gsee000 at gmail.com> wrote:

> Hi,
>
> Please note the inconsistency between the behavior of rbind() and
> rbindlist() below.
>
> m1 <- as.data.table(mtcars)
> m2 <- copy(m1)
> rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
> rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
> bind by name
>
> What do you think about making them have the same behavior and/or
> warning?  Personally, I prefer the behavior of rbind(), and would
> prefer to see a warning if column names are ignored like they are with
> rbindlist().
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/742700ec/attachment.html>

From aragorn168b at gmail.com  Sat Nov  9 18:33:49 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 9 Nov 2013 18:33:49 +0100
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
 <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
Message-ID: <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>

GSee, I find this a bit confusing at the moment as well - the convergence of "rbind" and "rbindlist" and therefore the future of "rbindlist". 

`rbindlist` gained speed (to some extent) by assuming things like this and skipping checks in the first place. So, should we include checks like this? Also, if "rbind" and/or "rbindlist" are made to do the exact same thing, then, what's the purpose of "rbindlist"?

Any thoughts? 

Arun


On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote:

> Fyi, it's not well documented, but setting use.names=FALSE in rbind would replicate rbindlist behavior.
> I think it's a reasonable FR - if/when all of rbind code goes into C, it would be trivial to add. 
> On Nov 9, 2013 10:51 AM, "G See" <gsee000 at gmail.com (mailto:gsee000 at gmail.com)> wrote:
> > Hi,
> > 
> > Please note the inconsistency between the behavior of rbind() and
> > rbindlist() below.
> > 
> > m1 <- as.data.table(mtcars)
> > m2 <- copy(m1)
> > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
> > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
> > bind by name
> > 
> > What do you think about making them have the same behavior and/or
> > warning?  Personally, I prefer the behavior of rbind(), and would
> > prefer to see a warning if column names are ignored like they are with
> > rbindlist().
> > 
> > Thanks,
> > Garrett
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/e489e457/attachment-0001.html>

From gsee000 at gmail.com  Sat Nov  9 18:38:12 2013
From: gsee000 at gmail.com (G See)
Date: Sat, 9 Nov 2013 11:38:12 -0600
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
 <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
 <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
Message-ID: <CA+xi=qZ4o2k47Vo-znYWfmmQMUDiavEUXqfYzKhxSG795ZFT3A@mail.gmail.com>

Isn't rbindlist(myList) faster than do.call(rbind, myList)?

Garrett

On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> GSee, I find this a bit confusing at the moment as well - the convergence of
> "rbind" and "rbindlist" and therefore the future of "rbindlist".
>
> `rbindlist` gained speed (to some extent) by assuming things like this and
> skipping checks in the first place. So, should we include checks like this?
> Also, if "rbind" and/or "rbindlist" are made to do the exact same thing,
> then, what's the purpose of "rbindlist"?
>
> Any thoughts?
>
> Arun
>
> On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote:
>
> Fyi, it's not well documented, but setting use.names=FALSE in rbind would
> replicate rbindlist behavior.
>
> I think it's a reasonable FR - if/when all of rbind code goes into C, it
> would be trivial to add.
>
> On Nov 9, 2013 10:51 AM, "G See" <gsee000 at gmail.com> wrote:
>
> Hi,
>
> Please note the inconsistency between the behavior of rbind() and
> rbindlist() below.
>
> m1 <- as.data.table(mtcars)
> m2 <- copy(m1)
> rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
> rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
> bind by name
>
> What do you think about making them have the same behavior and/or
> warning?  Personally, I prefer the behavior of rbind(), and would
> prefer to see a warning if column names are ignored like they are with
> rbindlist().
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

From aragorn168b at gmail.com  Sat Nov  9 18:44:19 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 9 Nov 2013 18:44:19 +0100
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <CA+xi=qZ4o2k47Vo-znYWfmmQMUDiavEUXqfYzKhxSG795ZFT3A@mail.gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
 <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
 <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
 <CA+xi=qZ4o2k47Vo-znYWfmmQMUDiavEUXqfYzKhxSG795ZFT3A@mail.gmail.com>
Message-ID: <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com>

I am not aware of the status now after eddi's recent edits. "rbindlist" initially only checked the type of the first data.table's columns. But now I guess with eddi's changes, it does look-down and decide based on class hierarchy. That is, if column 1 of dt1 is integer, but of dt2 is numeric, it's now "numeric", but before it was "integer". I guess this'll affect the speed. I've not done any benchmarking yet. But I'm guessing it'll be slower than at least the previous version. 

Eddi, any thoughts on this? 

Arun


On Saturday, November 9, 2013 at 6:38 PM, G See wrote:

> Isn't rbindlist(myList) faster than do.call(rbind, myList)?
> 
> Garrett
> 
> On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > GSee, I find this a bit confusing at the moment as well - the convergence of
> > "rbind" and "rbindlist" and therefore the future of "rbindlist".
> > 
> > `rbindlist` gained speed (to some extent) by assuming things like this and
> > skipping checks in the first place. So, should we include checks like this?
> > Also, if "rbind" and/or "rbindlist" are made to do the exact same thing,
> > then, what's the purpose of "rbindlist"?
> > 
> > Any thoughts?
> > 
> > Arun
> > 
> > On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote:
> > 
> > Fyi, it's not well documented, but setting use.names=FALSE in rbind would
> > replicate rbindlist behavior.
> > 
> > I think it's a reasonable FR - if/when all of rbind code goes into C, it
> > would be trivial to add.
> > 
> > On Nov 9, 2013 10:51 AM, "G See" <gsee000 at gmail.com (mailto:gsee000 at gmail.com)> wrote:
> > 
> > Hi,
> > 
> > Please note the inconsistency between the behavior of rbind() and
> > rbindlist() below.
> > 
> > m1 <- as.data.table(mtcars)
> > m2 <- copy(m1)
> > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
> > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
> > bind by name
> > 
> > What do you think about making them have the same behavior and/or
> > warning? Personally, I prefer the behavior of rbind(), and would
> > prefer to see a warning if column names are ignored like they are with
> > rbindlist().
> > 
> > Thanks,
> > Garrett
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> 
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/4f0b4175/attachment.html>

From gsee000 at gmail.com  Sat Nov  9 18:49:13 2013
From: gsee000 at gmail.com (G See)
Date: Sat, 9 Nov 2013 11:49:13 -0600
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
 <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
 <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
 <CA+xi=qZ4o2k47Vo-znYWfmmQMUDiavEUXqfYzKhxSG795ZFT3A@mail.gmail.com>
 <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com>
Message-ID: <CA+xi=qZANB5vBCE+EN0UfUT-1MBtv5rbTABCPaMmvFr=LLJpoQ@mail.gmail.com>

I really meant that I thought that do.call(rbind, list(a, b)) would be
slower than rbindlist(list(a, b)).  e.g. when you don't know the
length of the list of data.tables

On Sat, Nov 9, 2013 at 11:44 AM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> I am not aware of the status now after eddi's recent edits. "rbindlist"
> initially only checked the type of the first data.table's columns. But now I
> guess with eddi's changes, it does look-down and decide based on class
> hierarchy. That is, if column 1 of dt1 is integer, but of dt2 is numeric,
> it's now "numeric", but before it was "integer". I guess this'll affect the
> speed. I've not done any benchmarking yet. But I'm guessing it'll be slower
> than at least the previous version.
>
> Eddi, any thoughts on this?
>
> Arun
>
> On Saturday, November 9, 2013 at 6:38 PM, G See wrote:
>
> Isn't rbindlist(myList) faster than do.call(rbind, myList)?
>
> Garrett
>
> On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
>
> GSee, I find this a bit confusing at the moment as well - the convergence of
> "rbind" and "rbindlist" and therefore the future of "rbindlist".
>
> `rbindlist` gained speed (to some extent) by assuming things like this and
> skipping checks in the first place. So, should we include checks like this?
> Also, if "rbind" and/or "rbindlist" are made to do the exact same thing,
> then, what's the purpose of "rbindlist"?
>
> Any thoughts?
>
> Arun
>
> On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote:
>
> Fyi, it's not well documented, but setting use.names=FALSE in rbind would
> replicate rbindlist behavior.
>
> I think it's a reasonable FR - if/when all of rbind code goes into C, it
> would be trivial to add.
>
> On Nov 9, 2013 10:51 AM, "G See" <gsee000 at gmail.com> wrote:
>
> Hi,
>
> Please note the inconsistency between the behavior of rbind() and
> rbindlist() below.
>
> m1 <- as.data.table(mtcars)
> m2 <- copy(m1)
> rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
> rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
> bind by name
>
> What do you think about making them have the same behavior and/or
> warning? Personally, I prefer the behavior of rbind(), and would
> prefer to see a warning if column names are ignored like they are with
> rbindlist().
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>

From eduard.antonyan at gmail.com  Sat Nov  9 19:25:00 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 9 Nov 2013 12:25:00 -0600
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <CA+xi=qZANB5vBCE+EN0UfUT-1MBtv5rbTABCPaMmvFr=LLJpoQ@mail.gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
 <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
 <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
 <CA+xi=qZ4o2k47Vo-znYWfmmQMUDiavEUXqfYzKhxSG795ZFT3A@mail.gmail.com>
 <6CE1BD90684E4874B3BDBAAA9073CFAD@gmail.com>
 <CA+xi=qZANB5vBCE+EN0UfUT-1MBtv5rbTABCPaMmvFr=LLJpoQ@mail.gmail.com>
Message-ID: <CAHZcBOpY1eK=8H0dU=KQAF146CbKzKe=dks5HKQaC5wsAsucgw@mail.gmail.com>

Re speed: last I checked, new rbindlist was about 5% slower in no-coercion
cases and was quite a bit faster in cases where there was coercion.

do.call(rbind is indeed much slower than rbindlist and even if
.rbind.data.table took no time to do, it'll still be much slower than
rbindlist because of all the dispatching before it gets to
.rbind.data.table. That said, I'm pretty sure rbind is now faster than
rbind in 1.8.10 in all cases.


On Sat, Nov 9, 2013 at 11:49 AM, G See <gsee000 at gmail.com> wrote:

> I really meant that I thought that do.call(rbind, list(a, b)) would be
> slower than rbindlist(list(a, b)).  e.g. when you don't know the
> length of the list of data.tables
>
> On Sat, Nov 9, 2013 at 11:44 AM, Arunkumar Srinivasan
> <aragorn168b at gmail.com> wrote:
> > I am not aware of the status now after eddi's recent edits. "rbindlist"
> > initially only checked the type of the first data.table's columns. But
> now I
> > guess with eddi's changes, it does look-down and decide based on class
> > hierarchy. That is, if column 1 of dt1 is integer, but of dt2 is numeric,
> > it's now "numeric", but before it was "integer". I guess this'll affect
> the
> > speed. I've not done any benchmarking yet. But I'm guessing it'll be
> slower
> > than at least the previous version.
> >
> > Eddi, any thoughts on this?
> >
> > Arun
> >
> > On Saturday, November 9, 2013 at 6:38 PM, G See wrote:
> >
> > Isn't rbindlist(myList) faster than do.call(rbind, myList)?
> >
> > Garrett
> >
> > On Sat, Nov 9, 2013 at 11:33 AM, Arunkumar Srinivasan
> > <aragorn168b at gmail.com> wrote:
> >
> > GSee, I find this a bit confusing at the moment as well - the
> convergence of
> > "rbind" and "rbindlist" and therefore the future of "rbindlist".
> >
> > `rbindlist` gained speed (to some extent) by assuming things like this
> and
> > skipping checks in the first place. So, should we include checks like
> this?
> > Also, if "rbind" and/or "rbindlist" are made to do the exact same thing,
> > then, what's the purpose of "rbindlist"?
> >
> > Any thoughts?
> >
> > Arun
> >
> > On Saturday, November 9, 2013 at 6:29 PM, Eduard Antonyan wrote:
> >
> > Fyi, it's not well documented, but setting use.names=FALSE in rbind would
> > replicate rbindlist behavior.
> >
> > I think it's a reasonable FR - if/when all of rbind code goes into C, it
> > would be trivial to add.
> >
> > On Nov 9, 2013 10:51 AM, "G See" <gsee000 at gmail.com> wrote:
> >
> > Hi,
> >
> > Please note the inconsistency between the behavior of rbind() and
> > rbindlist() below.
> >
> > m1 <- as.data.table(mtcars)
> > m2 <- copy(m1)
> > rbind(m1[, .SD[1], by=cyl], m2) # Gives warning and binds by name
> > rbindlist(list(m1[, .SD[1], by=cyl], m2)) # no warning, and does NOT
> > bind by name
> >
> > What do you think about making them have the same behavior and/or
> > warning? Personally, I prefer the behavior of rbind(), and would
> > prefer to see a warning if column names are ignored like they are with
> > rbindlist().
> >
> > Thanks,
> > Garrett
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/a3d814a6/attachment-0001.html>

From eduard.antonyan at gmail.com  Sat Nov  9 19:55:36 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Sat, 9 Nov 2013 12:55:36 -0600
Subject: [datatable-help] FR #748 discussion
In-Reply-To: <DC620B9E1CA44D62B034D3AFC0A8F856@gmail.com>
References: <437054B2A27B4A49A3FE44301A60FE6A@gmail.com>
 <CAHZcBOowHT2tQ=9eLEUfOhDr16QsZwuU=yRJPux7o7ir7KRg-g@mail.gmail.com>
 <DC620B9E1CA44D62B034D3AFC0A8F856@gmail.com>
Message-ID: <CAHZcBOqwauKBTFHUtcKyNq3xhhvSadJsKd=_W9-5mR3D9gJSjg@mail.gmail.com>

Good point - I had forgotten about that, and I still do like that proposal!
:)


On Sat, Nov 9, 2013 at 11:08 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Eddi,
> Gabor's suggestions here:
> http://lists.r-forge.r-project.org/pipermail/datatable-help/2013-May/001738.html are
> quite nice! We did agree with those changes at the time :). I'll have to
> write to Matthew to get his feedback.
>
> Arun
>
> On Saturday, November 9, 2013 at 6:06 PM, Eduard Antonyan wrote:
>
> Great, this is smth I was thinking about recently as well.
>
> I do find nomatch=TRUE/FALSE confusing and keep.nomatch is better in that
> respect, but that's a lot more characters to type, which downgrades it for
> me a lot.
>
> I think in a world where I was designing it from scratch I would have
> nomatch=NA do what it does now, nomatch=NULL do what 0 does now, and then
> the rest of the values would fill. Not sure if this can be done smoothly
> though in the world we live in.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131109/37017626/attachment.html>

From aragorn168b at gmail.com  Sun Nov 10 12:43:55 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sun, 10 Nov 2013 12:43:55 +0100
Subject: [datatable-help] Revisiting scoping rules in "j" (reviving Gabor's
	post)
Message-ID: <C054237C08F7492E8C113BDDA5109706@gmail.com>

Hi everyone,  

To revive the discussion Gabor started here: http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the (related, but subtly different) FR mnel filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978

Suppose you have:

require(data.table)  
d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")  
d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")

Then as Gabor points out: `d1[d2, id1]`  should *not* result in an error, because FAQ 2.8 states (copied from Gabor's post linked above):

1. The scope of X's subset; i.e., X's column names.  
2. The scope of each row of Y; i.e., Y's column names (join inherited scope)  
?

In this case, the desired output for `d1[d2, id1]` should then be:
   id1 id1
1:   1   1
2:   2   2
3:   2   2
4:   4  NA


That's what I at least understand from what the documentation intends.  

However, this recommends a subtle change to the current method of referring to columns, if we were to keep this idea. That is, consider the data.table "d3" as follows:

d3 <- copy(d2)
setnames(d3, names(d1))

Now, what should `d1[d3, id1]` give? The answer, I believe, is same as `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be intended - here comes the "subtle change", then one should do:

d1[d3, i.d1] # referring to i's variables with the "i." notation.

I've managed to implement the first part where X's columns are looked up so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure that my understanding of the FAQ is right (and that the FAQ makes sense - it does to me).

Please let me know what you all think so that I can implement the second part and commit. This, I believe will let us get a step closer to the consistency in DT syntax.

Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131110/edb78e17/attachment.html>

From gsee000 at gmail.com  Sun Nov 10 20:39:15 2013
From: gsee000 at gmail.com (G See)
Date: Sun, 10 Nov 2013 13:39:15 -0600
Subject: [datatable-help] lapply without anonymous function
Message-ID: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>

Hi,

I have a list of data.tables and I am trying to extract a subset from
each of them.  I can achieve what I want with this:

> L <- list(data.table(BOD), data.table(BOD))
> lapply(L, function(x) x[Time==3L])
[[1]]
   Time demand
1:    3     19

[[2]]
   Time demand
1:    3     19

However, I'd rather not type have to create an anonymous function.  I
tried the below, but `[.data.frame` is being dispatched.

> lapply(L, "[", Time==3L)
Error in `[.data.frame`(x, i) : object 'Time' not found

Even if I am explicit, `[.data.table` does not get dispatched:

> lapply(L, data.table:::`[.data.table`, Time==3L)
Error in `[.data.frame`(x, i) : object 'Time' not found

I'm guessing this is due to where evaluation takes place.  Is there an
alternate syntax I should use?

Thanks,
Garrett

From eduard.antonyan at gmail.com  Mon Nov 11 14:40:03 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Mon, 11 Nov 2013 07:40:03 -0600
Subject: [datatable-help] lapply without anonymous function
In-Reply-To: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>
References: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>
Message-ID: <CAHZcBOrnURibzD9iDhJeWvs4Lz906dMW9W-VTeASLpWxG=b5_A@mail.gmail.com>

I think your last attempt's failure is a bug of the internal cedta
function, but note that if it did work, it'd be more symbols to type than
the anonymous function option :)
 On Nov 10, 2013 1:39 PM, "G See" <gsee000 at gmail.com> wrote:

> Hi,
>
> I have a list of data.tables and I am trying to extract a subset from
> each of them.  I can achieve what I want with this:
>
> > L <- list(data.table(BOD), data.table(BOD))
> > lapply(L, function(x) x[Time==3L])
> [[1]]
>    Time demand
> 1:    3     19
>
> [[2]]
>    Time demand
> 1:    3     19
>
> However, I'd rather not type have to create an anonymous function.  I
> tried the below, but `[.data.frame` is being dispatched.
>
> > lapply(L, "[", Time==3L)
> Error in `[.data.frame`(x, i) : object 'Time' not found
>
> Even if I am explicit, `[.data.table` does not get dispatched:
>
> > lapply(L, data.table:::`[.data.table`, Time==3L)
> Error in `[.data.frame`(x, i) : object 'Time' not found
>
> I'm guessing this is due to where evaluation takes place.  Is there an
> alternate syntax I should use?
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131111/9def63d6/attachment.html>

From eduard.antonyan at gmail.com  Mon Nov 11 14:41:38 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Mon, 11 Nov 2013 07:41:38 -0600
Subject: [datatable-help] lapply without anonymous function
In-Reply-To: <CAHZcBOrnURibzD9iDhJeWvs4Lz906dMW9W-VTeASLpWxG=b5_A@mail.gmail.com>
References: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>
 <CAHZcBOrnURibzD9iDhJeWvs4Lz906dMW9W-VTeASLpWxG=b5_A@mail.gmail.com>
Message-ID: <CAHZcBOroTi-bxyFxmCXBzgxz5Yin=ermMzFj8rGfLKP5s3bfKw@mail.gmail.com>

But I guess your second attempt should have worked - submit a bug request
imo.
 On Nov 11, 2013 7:40 AM, "Eduard Antonyan" <eduard.antonyan at gmail.com>
wrote:

> I think your last attempt's failure is a bug of the internal cedta
> function, but note that if it did work, it'd be more symbols to type than
> the anonymous function option :)
>  On Nov 10, 2013 1:39 PM, "G See" <gsee000 at gmail.com> wrote:
>
>> Hi,
>>
>> I have a list of data.tables and I am trying to extract a subset from
>> each of them.  I can achieve what I want with this:
>>
>> > L <- list(data.table(BOD), data.table(BOD))
>> > lapply(L, function(x) x[Time==3L])
>> [[1]]
>>    Time demand
>> 1:    3     19
>>
>> [[2]]
>>    Time demand
>> 1:    3     19
>>
>> However, I'd rather not type have to create an anonymous function.  I
>> tried the below, but `[.data.frame` is being dispatched.
>>
>> > lapply(L, "[", Time==3L)
>> Error in `[.data.frame`(x, i) : object 'Time' not found
>>
>> Even if I am explicit, `[.data.table` does not get dispatched:
>>
>> > lapply(L, data.table:::`[.data.table`, Time==3L)
>> Error in `[.data.frame`(x, i) : object 'Time' not found
>>
>> I'm guessing this is due to where evaluation takes place.  Is there an
>> alternate syntax I should use?
>>
>> Thanks,
>> Garrett
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131111/7f2547d8/attachment.html>

From eduard.antonyan at gmail.com  Mon Nov 11 14:45:38 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Mon, 11 Nov 2013 07:45:38 -0600
Subject: [datatable-help] Revisiting scoping rules in "j" (reviving
 Gabor's post)
In-Reply-To: <C054237C08F7492E8C113BDDA5109706@gmail.com>
References: <C054237C08F7492E8C113BDDA5109706@gmail.com>
Message-ID: <CAHZcBOoVVyj8RN-5Ost5b_9Ce=xAFUX5_OMo_2-4VVPxzTpseg@mail.gmail.com>

I haven't checked yet what it does currently but what you wrote makes
perfect sense.
 On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" <aragorn168b at gmail.com>
wrote:

>  Hi everyone,
>
> To revive the discussion Gabor started here:
> http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the
> (related, but subtly different) FR mnel filed here:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978
>
> Suppose you have:
>
> require(data.table)
> d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")
> d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")
>
> Then as Gabor points out: `d1[d2, id1]`  should *not* result in an error,
> because FAQ 2.8 states (copied from Gabor's post linked above):
>
> 1. The scope of X's subset; i.e., X's column names.
> 2. The scope of each row of Y; i.e., Y's column names (join inherited
> scope)
> ?
>
> In this case, the desired output for `d1[d2, id1]` should then be:
>    id1 id1
> 1:   1   1
> 2:   2   2
> 3:   2   2
> 4:   4  NA
>
> That's what I at least understand from what the documentation intends.
>
> However, this recommends a subtle change to the current method of
> referring to columns, if we were to keep this idea. That is, consider the
> data.table "d3" as follows:
>
> d3 <- copy(d2)
> setnames(d3, names(d1))
>
> Now, what should `d1[d3, id1]` give? The answer, I believe, is same as
> `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked
> up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the
> values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be
> intended - here comes the "subtle change", then one should do:
>
> d1[d3, i.d1] # referring to i's variables with the "i." notation.
>
> I've managed to implement the first part where X's columns are looked up
> so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure
> that my understanding of the FAQ is right (and that the FAQ makes sense -
> it does to me).
>
> Please let me know what you all think so that I can implement the second
> part and commit. This, I believe will let us get a step closer to the
> consistency in DT syntax.
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131111/081040fd/attachment.html>

From aragorn168b at gmail.com  Mon Nov 11 14:55:27 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 11 Nov 2013 14:55:27 +0100
Subject: [datatable-help] Revisiting scoping rules in "j" (reviving
 Gabor's post)
In-Reply-To: <CAHZcBOoVVyj8RN-5Ost5b_9Ce=xAFUX5_OMo_2-4VVPxzTpseg@mail.gmail.com>
References: <C054237C08F7492E8C113BDDA5109706@gmail.com>
 <CAHZcBOoVVyj8RN-5Ost5b_9Ce=xAFUX5_OMo_2-4VVPxzTpseg@mail.gmail.com>
Message-ID: <D1DF67EF64494AFFA72587DBEA9F696E@gmail.com>

Eddi,  

Thank you. However, I've realised something and made a slight change to the concept (at least I think that's the way to go).

Basically, if you've:

require(data.table)
d1 <- data.table(id1=c(1L, 2L, 2L, 3L), val=1:4, key="id1")

and you do:

d1[, print(id1), by=id1]
[1] 1
[1] 2
[1] 3


That is, while grouping, the grouping variables length for every group remains 1 (when grouping using "by"). for id=2, we don't get "2" two times. Going by the same logic, if we were to do:

d1[J(2), id1]
   id1 id1
1:   2   2


Here' the first "id1" is the grouping "id1" (from J(2)). Following FR #2693 from mnel, I've changed the names of J(.) when it has no names to resemble that of key columns of "d1". The second "id1" corresponds to the corresponding value of "id1" for "id1=2". And even though it's present 2 times, we print it only once. That is, it'll be identical to d1[, id1, by=id1], even though d1's "id1" is *not really* the grouping variable.  

If we've to refer to i's columns, then we've to explicitly state "i.id1". That is, here, it would be:

d1[J(2), i.id1] # identical results, but i.id1 corresponds to data.table from J(2) with column name = id1

To illustrate the difference nicely, let's consider these data.tables:
d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")  
d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")  
d3 <- copy(d2)
setnames(d3, names(d1))

d1[d2, list(id1)] # what Gabor's post highlighted should work (but it doesn't give 1,2,2,NA as pointed out in the earlier post)
   id1 id1
1:   1   1
2:   2   2
3:   4  NA


d1[d3, list(id1, i.id1)] # id1 refers to values from d1 and i.id1 to d3.
   id1 id1 i.id1
1:   1   1     1
2:   2   2     2
3:   4  NA     4


Note that for every (implicit) grouping value from d3, the only possible values for d1's grouping would be 1) identical to that of d3 or 2) NA.

Let me know what you guys think.  

Arun


On Monday, November 11, 2013 at 2:45 PM, Eduard Antonyan wrote:

> I haven't checked yet what it does currently but what you wrote makes perfect sense.  
> On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Hi everyone,  
> >  
> > To revive the discussion Gabor started here: http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the (related, but subtly different) FR mnel filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978  
> >  
> > Suppose you have:
> >  
> > require(data.table)  
> > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")  
> > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")
> >  
> > Then as Gabor points out: `d1[d2, id1]`  should *not* result in an error, because FAQ 2.8 states (copied from Gabor's post linked above):
> >  
> > 1. The scope of X's subset; i.e., X's column names.  
> > 2. The scope of each row of Y; i.e., Y's column names (join inherited scope)  
> > ?
> >  
> > In this case, the desired output for `d1[d2, id1]` should then be:
> >    id1 id1
> > 1:   1   1
> > 2:   2   2
> > 3:   2   2
> > 4:   4  NA
> >  
> >  
> > That's what I at least understand from what the documentation intends.  
> >  
> > However, this recommends a subtle change to the current method of referring to columns, if we were to keep this idea. That is, consider the data.table "d3" as follows:  
> >  
> > d3 <- copy(d2)
> > setnames(d3, names(d1))
> >  
> > Now, what should `d1[d3, id1]` give? The answer, I believe, is same as `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be intended - here comes the "subtle change", then one should do:  
> >  
> > d1[d3, i.d1] # referring to i's variables with the "i." notation.
> >  
> > I've managed to implement the first part where X's columns are looked up so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure that my understanding of the FAQ is right (and that the FAQ makes sense - it does to me).  
> >  
> > Please let me know what you all think so that I can implement the second part and commit. This, I believe will let us get a step closer to the consistency in DT syntax.
> >  
> > Arun  
> >  
> >  
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131111/9e3203c9/attachment-0001.html>

From ggrothendieck at gmail.com  Mon Nov 11 15:06:48 2013
From: ggrothendieck at gmail.com (Gabor Grothendieck)
Date: Mon, 11 Nov 2013 09:06:48 -0500
Subject: [datatable-help] lapply without anonymous function
In-Reply-To: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>
References: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>
Message-ID: <CAP01uRn9qkmyaAJSEwOKRmOxEgrSJ1vhLiA1Uznqp0RxAwWdbA@mail.gmail.com>

On Sun, Nov 10, 2013 at 2:39 PM, G See <gsee000 at gmail.com> wrote:
> Hi,
>
> I have a list of data.tables and I am trying to extract a subset from
> each of them.  I can achieve what I want with this:
>
>> L <- list(data.table(BOD), data.table(BOD))
>> lapply(L, function(x) x[Time==3L])
> [[1]]
>    Time demand
> 1:    3     19
>
> [[2]]
>    Time demand
> 1:    3     19
>
> However, I'd rather not type have to create an anonymous function.  I
> tried the below, but `[.data.frame` is being dispatched.
>
>> lapply(L, "[", Time==3L)
> Error in `[.data.frame`(x, i) : object 'Time' not found
>
> Even if I am explicit, `[.data.table` does not get dispatched:
>
>> lapply(L, data.table:::`[.data.table`, Time==3L)
> Error in `[.data.frame`(x, i) : object 'Time' not found
>
> I'm guessing this is due to where evaluation takes place.  Is there an
> alternate syntax I should use?
>

subset works:

> lapply(L, subset, Time == 3L)
[[1]]
   Time demand
1:    3     19

[[2]]
   Time demand
1:    3     19


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

From gsee000 at gmail.com  Mon Nov 11 15:40:34 2013
From: gsee000 at gmail.com (G See)
Date: Mon, 11 Nov 2013 08:40:34 -0600
Subject: [datatable-help] lapply without anonymous function
In-Reply-To: <CAP01uRn9qkmyaAJSEwOKRmOxEgrSJ1vhLiA1Uznqp0RxAwWdbA@mail.gmail.com>
References: <CA+xi=qamZrLA2pv_MJLbqL1x2=3KV4uxwpdDBkvfFdFOR7jGNQ@mail.gmail.com>
 <CAP01uRn9qkmyaAJSEwOKRmOxEgrSJ1vhLiA1Uznqp0RxAwWdbA@mail.gmail.com>
Message-ID: <CA+xi=qaDAJXEfhQo36m7Ot8Y0M8B9v3Df91QMcTcDtY=1+4evA@mail.gmail.com>

heh, after all my efforts to avoid subset(), it can be useful after all. :)

Bug report filed, per Eduard's suggestion.

On Mon, Nov 11, 2013 at 8:06 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> On Sun, Nov 10, 2013 at 2:39 PM, G See <gsee000 at gmail.com> wrote:
>> Hi,
>>
>> I have a list of data.tables and I am trying to extract a subset from
>> each of them.  I can achieve what I want with this:
>>
>>> L <- list(data.table(BOD), data.table(BOD))
>>> lapply(L, function(x) x[Time==3L])
>> [[1]]
>>    Time demand
>> 1:    3     19
>>
>> [[2]]
>>    Time demand
>> 1:    3     19
>>
>> However, I'd rather not type have to create an anonymous function.  I
>> tried the below, but `[.data.frame` is being dispatched.
>>
>>> lapply(L, "[", Time==3L)
>> Error in `[.data.frame`(x, i) : object 'Time' not found
>>
>> Even if I am explicit, `[.data.table` does not get dispatched:
>>
>>> lapply(L, data.table:::`[.data.table`, Time==3L)
>> Error in `[.data.frame`(x, i) : object 'Time' not found
>>
>> I'm guessing this is due to where evaluation takes place.  Is there an
>> alternate syntax I should use?
>>
>
> subset works:
>
>> lapply(L, subset, Time == 3L)
> [[1]]
>    Time demand
> 1:    3     19
>
> [[2]]
>    Time demand
> 1:    3     19
>
>
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com

From eduard.antonyan at gmail.com  Mon Nov 11 16:53:05 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Mon, 11 Nov 2013 09:53:05 -0600
Subject: [datatable-help] Revisiting scoping rules in "j" (reviving
 Gabor's post)
In-Reply-To: <D1DF67EF64494AFFA72587DBEA9F696E@gmail.com>
References: <C054237C08F7492E8C113BDDA5109706@gmail.com>
 <CAHZcBOoVVyj8RN-5Ost5b_9Ce=xAFUX5_OMo_2-4VVPxzTpseg@mail.gmail.com>
 <D1DF67EF64494AFFA72587DBEA9F696E@gmail.com>
Message-ID: <CAHZcBOoy0utw3tM5qdKO47zty-wcKUx=AwYFDDNQX=Tdo4_Q_g@mail.gmail.com>

Everything looks good to me. Note that there is also .BY[[1]] that one can
potentially also want to use in those examples (which is basically same as
i.id1).


On Mon, Nov 11, 2013 at 7:55 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Eddi,
>
> Thank you. However, I've realised something and made a slight change to
> the concept (at least I think that's the way to go).
>
> Basically, if you've:
>
> require(data.table)
> d1 <- data.table(id1=c(1L, 2L, 2L, 3L), val=1:4, key="id1")
>
> and you do:
>
> d1[, print(id1), by=id1]
> [1] 1
> [1] 2
> [1] 3
>
> That is, while grouping, the grouping variables length for every group
> remains 1 (when grouping using "by"). for id=2, we don't get "2" two times.
> Going by the same logic, if we were to do:
>
> d1[J(2), id1]
>    id1 id1
> 1:   2   2
>
> Here' the first "id1" is the grouping "id1" (from J(2)). Following FR
> #2693 from mnel, I've changed the names of J(.) when it has no names to
> resemble that of key columns of "d1". The second "id1" corresponds to the
> corresponding value of "id1" for "id1=2". And even though it's present 2
> times, we print it only once. That is, it'll be identical to d1[, id1,
> by=id1], even though d1's "id1" is *not really* the grouping variable.
>
> If we've to refer to i's columns, then we've to explicitly state "i.id1".
> That is, here, it would be:
>
> d1[J(2), i.id1] # identical results, but i.id1 corresponds to data.table
> from J(2) with column name = id1
>
> To illustrate the difference nicely, let's consider these data.tables:
> d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")
> d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")
> d3 <- copy(d2)
> setnames(d3, names(d1))
>
> d1[d2, list(id1)] # what Gabor's post highlighted should work (but it
> doesn't give 1,2,2,NA as pointed out in the earlier post)
>    id1 id1
> 1:   1   1
> 2:   2   2
> 3:   4  NA
>
> d1[d3, list(id1, i.id1)] # id1 refers to values from d1 and i.id1 to d3.
>    id1 id1 i.id1
> 1:   1   1     1
> 2:   2   2     2
> 3:   4  NA     4
>
> Note that for every (implicit) grouping value from d3, the only possible
> values for d1's grouping would be 1) identical to that of d3 or 2) NA.
>
> Let me know what you guys think.
>
> Arun
>
> On Monday, November 11, 2013 at 2:45 PM, Eduard Antonyan wrote:
>
> I haven't checked yet what it does currently but what you wrote makes
> perfect sense.
>  On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" <aragorn168b at gmail.com>
> wrote:
>
>  Hi everyone,
>
> To revive the discussion Gabor started here:
> http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the
> (related, but subtly different) FR mnel filed here:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978
>
> Suppose you have:
>
> require(data.table)
> d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")
> d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")
>
> Then as Gabor points out: `d1[d2, id1]`  should *not* result in an error,
> because FAQ 2.8 states (copied from Gabor's post linked above):
>
> 1. The scope of X's subset; i.e., X's column names.
> 2. The scope of each row of Y; i.e., Y's column names (join inherited
> scope)
> ?
>
> In this case, the desired output for `d1[d2, id1]` should then be:
>    id1 id1
> 1:   1   1
> 2:   2   2
> 3:   2   2
> 4:   4  NA
>
> That's what I at least understand from what the documentation intends.
>
> However, this recommends a subtle change to the current method of
> referring to columns, if we were to keep this idea. That is, consider the
> data.table "d3" as follows:
>
> d3 <- copy(d2)
> setnames(d3, names(d1))
>
> Now, what should `d1[d3, id1]` give? The answer, I believe, is same as
> `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked
> up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the
> values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be
> intended - here comes the "subtle change", then one should do:
>
> d1[d3, i.d1] # referring to i's variables with the "i." notation.
>
> I've managed to implement the first part where X's columns are looked up
> so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure
> that my understanding of the FAQ is right (and that the FAQ makes sense -
> it does to me).
>
> Please let me know what you all think so that I can implement the second
> part and commit. This, I believe will let us get a step closer to the
> consistency in DT syntax.
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131111/ca08c945/attachment.html>

From aragorn168b at gmail.com  Mon Nov 11 16:55:04 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 11 Nov 2013 16:55:04 +0100
Subject: [datatable-help] Revisiting scoping rules in "j" (reviving
 Gabor's post)
In-Reply-To: <CAHZcBOoy0utw3tM5qdKO47zty-wcKUx=AwYFDDNQX=Tdo4_Q_g@mail.gmail.com>
References: <C054237C08F7492E8C113BDDA5109706@gmail.com>
 <CAHZcBOoVVyj8RN-5Ost5b_9Ce=xAFUX5_OMo_2-4VVPxzTpseg@mail.gmail.com>
 <D1DF67EF64494AFFA72587DBEA9F696E@gmail.com>
 <CAHZcBOoy0utw3tM5qdKO47zty-wcKUx=AwYFDDNQX=Tdo4_Q_g@mail.gmail.com>
Message-ID: <AC71EA48AEE4431FBF17B7EB082F9F1C@gmail.com>

Great! I'll commit then and see how it goes!
Yes, you're right about .BY[[1]]. But `i.id1` was already there - in SDenv$.iSD part of the code.  

Arun


On Monday, November 11, 2013 at 4:53 PM, Eduard Antonyan wrote:

> Everything looks good to me. Note that there is also .BY[[1]] that one can potentially also want to use in those examples (which is basically same as i.id1).
>  
>  
>  
> On Mon, Nov 11, 2013 at 7:55 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Eddi,  
> >  
> > Thank you. However, I've realised something and made a slight change to the concept (at least I think that's the way to go).
> >  
> > Basically, if you've:  
> >  
> > require(data.table)
> > d1 <- data.table(id1=c(1L, 2L, 2L, 3L), val=1:4, key="id1")
> >  
> > and you do:
> >  
> > d1[, print(id1), by=id1]
> > [1] 1
> > [1] 2
> > [1] 3
> >  
> >  
> > That is, while grouping, the grouping variables length for every group remains 1 (when grouping using "by"). for id=2, we don't get "2" two times. Going by the same logic, if we were to do:  
> >  
> > d1[J(2), id1]
> >    id1 id1
> > 1:   2   2
> >  
> >  
> > Here' the first "id1" is the grouping "id1" (from J(2)). Following FR #2693 from mnel, I've changed the names of J(.) when it has no names to resemble that of key columns of "d1". The second "id1" corresponds to the corresponding value of "id1" for "id1=2". And even though it's present 2 times, we print it only once. That is, it'll be identical to d1[, id1, by=id1], even though d1's "id1" is *not really* the grouping variable.   
> >  
> > If we've to refer to i's columns, then we've to explicitly state "i.id1". That is, here, it would be:
> >  
> > d1[J(2), i.id1] # identical results, but i.id1 corresponds to data.table from J(2) with column name = id1  
> >  
> > To illustrate the difference nicely, let's consider these data.tables:
> > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")  
> > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")  
> > d3 <- copy(d2)
> > setnames(d3, names(d1))
> >  
> > d1[d2, list(id1)] # what Gabor's post highlighted should work (but it doesn't give 1,2,2,NA as pointed out in the earlier post)  
> >    id1 id1
> > 1:   1   1
> > 2:   2   2
> >  
> > 3:   4  NA
> >  
> >  
> > d1[d3, list(id1, i.id1)] # id1 refers to values from d1 and i.id1 to d3.
> >    id1 id1 i.id1
> > 1:   1   1     1
> > 2:   2   2     2
> > 3:   4  NA     4
> >  
> >  
> > Note that for every (implicit) grouping value from d3, the only possible values for d1's grouping would be 1) identical to that of d3 or 2) NA.  
> >  
> > Let me know what you guys think.  
> >  
> > Arun
> >  
> >  
> > On Monday, November 11, 2013 at 2:45 PM, Eduard Antonyan wrote:
> >  
> > > I haven't checked yet what it does currently but what you wrote makes perfect sense.  
> > > On Nov 10, 2013 5:44 AM, "Arunkumar Srinivasan" <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > Hi everyone,  
> > > >  
> > > > To revive the discussion Gabor started here: http://r.789695.n4.nabble.com/Problem-with-FAQ-2-8-tt4668878.html and the (related, but subtly different) FR mnel filed here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978  
> > > >  
> > > > Suppose you have:
> > > >  
> > > > require(data.table)  
> > > > d1 <- data.table(id1 = c(1L, 2L, 2L, 3L), val = 1:4, key = "id1")  
> > > > d2 <- data.table(id2 = c(1L, 2L, 4L), val2 = c(11, 12, 14),key = "id2")
> > > >  
> > > > Then as Gabor points out: `d1[d2, id1]`  should *not* result in an error, because FAQ 2.8 states (copied from Gabor's post linked above):
> > > >  
> > > > 1. The scope of X's subset; i.e., X's column names.  
> > > > 2. The scope of each row of Y; i.e., Y's column names (join inherited scope)  
> > > > ?
> > > >  
> > > > In this case, the desired output for `d1[d2, id1]` should then be:
> > > >    id1 id1
> > > > 1:   1   1
> > > > 2:   2   2
> > > > 3:   2   2
> > > > 4:   4  NA
> > > >  
> > > >  
> > > > That's what I at least understand from what the documentation intends.  
> > > >  
> > > > However, this recommends a subtle change to the current method of referring to columns, if we were to keep this idea. That is, consider the data.table "d3" as follows:  
> > > >  
> > > > d3 <- copy(d2)
> > > > setnames(d3, names(d1))
> > > >  
> > > > Now, what should `d1[d3, id1]` give? The answer, I believe, is same as `d1[d2, id1]`. Why? Because, X's (here d1's) column names should be looked up first (as per FAQ 2.8). Therefore, corresponding to d2=c(1,2,4), the values for "id1" are c(1, (2,2), NA). Now, if the old behaviour is to be intended - here comes the "subtle change", then one should do:  
> > > >  
> > > > d1[d3, i.d1] # referring to i's variables with the "i." notation.
> > > >  
> > > > I've managed to implement the first part where X's columns are looked up so that `d1[d2, id1]` doesn't result in error. However, I'd like to ensure that my understanding of the FAQ is right (and that the FAQ makes sense - it does to me).  
> > > >  
> > > > Please let me know what you all think so that I can implement the second part and commit. This, I believe will let us get a step closer to the consistency in DT syntax.
> > > >  
> > > > Arun  
> > > >  
> > > >  
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >  
>  

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131111/df707d52/attachment-0001.html>

From aragorn168b at gmail.com  Wed Nov 13 22:24:41 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Wed, 13 Nov 2013 22:24:41 +0100
Subject: [datatable-help] FR #5072 reg.
Message-ID: <C7C2536DCFAD426AAC51F55C2337E9F5@gmail.com>

Hi everybody,  
Regarding FR #5072 here: https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5072&group_id=240&atid=975

Let's take two data.tables X and Y with key set to one column, "V1". data.table currently deals with Y[X] differently when Y is a factor and 1) X is a factor and 2) X is not a factor. Let me illustrate this:

case 1:
# X and Y are factors
require(data.table)
X <- data.table(V1=factor(c("A", "B", "C")))
Y <- data.table(V1=factor(c("B", "D", "E")), key="V1")

> Y[X] # X is a factor
  V1
1:  A
2:  B
3:  C

> Y[X]$V1
[1] A B C
Levels: A B C


** Note that when both X and Y are factors, only the levels of X are in the join'd result (no D/E).

case 2:
# X is **not** a factor
require(data.table)
X <- data.table(V1=c("A", "B", "C"))
Y <- data.table(V1=factor(c("B", "D", "E")), key="V1")

> Y[X] # x is not a factor
   V1
1: NA
2:  B
3: NA


> Y[X]$V1
[1] <NA> B    <NA>
Levels: B D E


** Note that the results have "NA" in them as the join is concerned with retaining levels from "Y".

The first question is: Why this difference? Should there be a difference between when X is or is not a factor? What do you guys think should be the intended result?

The side-effect comes during "merge" as it internally uses this principle (and hence FR #5072). For example:

merge(X, Y, by="V1", all=TRUE)
   V1
1: NA
2: NA
3:  B
4:  D
5:  E


> merge(X, Y, by="V1", all=TRUE)$V1
[1] <NA> <NA> B    D    E
Levels: B D E


The second question is: Is this intended result?

Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131113/46a09be5/attachment.html>

From eduard.antonyan at gmail.com  Wed Nov 13 22:55:52 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Wed, 13 Nov 2013 15:55:52 -0600
Subject: [datatable-help] FR #5072 reg.
In-Reply-To: <C7C2536DCFAD426AAC51F55C2337E9F5@gmail.com>
References: <C7C2536DCFAD426AAC51F55C2337E9F5@gmail.com>
Message-ID: <CAHZcBOpY=NhF95LBMpdv5OVgFiqeNf3bqWVQ1tGscMSozzPyEg@mail.gmail.com>

I think case 1 and case 2 should have same output and I think that the
merge should combine factor levels similar to how rbind does.

Btw another issue about factors exists in rbind'ing the j-expression:

dt = data.table(a = 1:2)

dt[, factor('a', levels = letters[1:.I]), by = a]$V1
#[1] a a
#Levels: a

but if you print out the j-expression it's evident that factor information
gets lost:

dt[, print(factor('a', levels = letters[1:.I])), by = a]
#[1] a
#Levels: a
#[1] a
#Levels: a b


On Wed, Nov 13, 2013 at 3:24 PM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Hi everybody,
> Regarding FR #5072 here:
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5072&group_id=240&atid=975
>
> Let's take two data.tables X and Y with key set to one column, "V1".
> data.table currently deals with Y[X] differently when Y is a factor and 1)
> X is a factor and 2) X is not a factor. Let me illustrate this:
>
> case 1:
> # X and Y are factors
> require(data.table)
> X <- data.table(V1=factor(c("A", "B", "C")))
> Y <- data.table(V1=factor(c("B", "D", "E")), key="V1")
>
> > Y[X] # X is a factor
>   V1
> 1:  A
> 2:  B
> 3:  C
> > Y[X]$V1
> [1] A B C
> Levels: A B C
>
> ** Note that when both X and Y are factors, only the levels of X are in
> the join'd result (no D/E).
>
> case 2:
> # X is **not** a factor
> require(data.table)
> X <- data.table(V1=c("A", "B", "C"))
> Y <- data.table(V1=factor(c("B", "D", "E")), key="V1")
> > Y[X] # x is not a factor
>    V1
> 1: NA
> 2:  B
> 3: NA
>
> > Y[X]$V1
> [1] <NA> B    <NA>
> Levels: B D E
>
> ** Note that the results have "NA" in them as the join is concerned with
> retaining levels from "Y".
>
> The first question is: Why this difference? Should there be a difference
> between when X is or is not a factor? What do you guys think should be the
> intended result?
>
> The side-effect comes during "merge" as it internally uses this principle
> (and hence FR #5072). For example:
>
> merge(X, Y, by="V1", all=TRUE)
>    V1
> 1: NA
> 2: NA
> 3:  B
> 4:  D
> 5:  E
>
> > merge(X, Y, by="V1", all=TRUE)$V1
> [1] <NA> <NA> B    D    E
> Levels: B D E
>
> The second question is: Is this intended result?
>
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131113/254b9ec1/attachment.html>

From aragorn168b at gmail.com  Thu Nov 14 13:09:01 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Thu, 14 Nov 2013 13:09:01 +0100
Subject: [datatable-help] Bug report #5100 reg.
Message-ID: <C2A623F497574A86AA84A00E5191036B@gmail.com>

Hi everybody, 

It'd be nice if you could weigh-in on the bug report filed by Bill here: 
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975

The gist of it is:

require(data.table)
DT <- data.table(x=1:5, y=6:10, z=11:15)
DT[, y] # returns a vector
DT[, "y", with=FALSE] # returns a data.table

The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?"

My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? 

Thank you,
Arun

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/4a316971/attachment.html>

From eduard.antonyan at gmail.com  Thu Nov 14 17:25:04 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Thu, 14 Nov 2013 10:25:04 -0600
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <C2A623F497574A86AA84A00E5191036B@gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
Message-ID: <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>

DT[, y] returning a vector is I think the only correct behavior, given the
understanding of j-expression as something evaluated in the DT environment.
If they want a data.table they should simply use DT[, list(y)] or DT[,
data.table(y)].

I haven't thought about DT[, "y", with = FALSE] before as I pretty much
never use that form, but I see an argument for it staying as is, because
"y" and c("y") are the same and since we all presumably agree that DT[,
c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with
= FALSE] returned a different type that would mean inconsistent return
types which makes life much harder for users (as evidenced by the periodic
drop=FALSE questions that come up on SO).

Going back to DT[, y], note that y and list(y) actually produce *different*
results (in e.g. base_env), so there is no type consistency issue there
between DT[, y] and DT[, list(y, z)].


On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  Hi everybody,
>
> It'd be nice if you could weigh-in on the bug report filed by Bill here:
>
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
>
> The gist of it is:
>
> require(data.table)
> DT <- data.table(x=1:5, y=6:10, z=11:15)
> DT[, y] # returns a vector
> DT[, "y", with=FALSE] # returns a data.table
>
> The question from the bug report basically is: "why is that in the first
> case, 'j' has only one column and we get a vector, but in the second case,
> we get a data.table?"
>
> My question is: Is this behaviour okay or do you prefer that the first one
> returns a data.table as well or the second one (with "with=FALSE") returns
> a vector?
>
> Thank you,
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/466853e2/attachment.html>

From aragorn168b at gmail.com  Thu Nov 14 17:33:19 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Thu, 14 Nov 2013 17:33:19 +0100
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
 <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
Message-ID: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>

Eddi, At the least, I think the documentation needs to be clearer on the use of "with=FALSE". It does feel inconsistent with the fact that "j" with a single column should return a vector. In data.frames, the type in "j" being column names, if it's just one column name, would return a vector, unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, c("x", "y")] will return a data.frame. So, it is inconsistent with data.frame here, I think. 


Arun


On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote:

> DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)].
> 
> I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). 
> 
> Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)].
> 
> 
> On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Hi everybody, 
> > 
> > It'd be nice if you could weigh-in on the bug report filed by Bill here: 
> > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
> > 
> > The gist of it is:
> > 
> > require(data.table)
> > DT <- data.table(x=1:5, y=6:10, z=11:15)
> > DT[, y] # returns a vector
> > DT[, "y", with=FALSE] # returns a data.table
> > 
> > The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?"
> > 
> > My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? 
> > 
> > Thank you,
> > Arun
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/d93886cb/attachment.html>

From eduard.antonyan at gmail.com  Thu Nov 14 17:39:25 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Thu, 14 Nov 2013 10:39:25 -0600
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
 <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
 <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>
Message-ID: <CAHZcBOp7CUEJoMzRuxG-A59_stKoSED+--bsX92PpvZsYNJ_6A@mail.gmail.com>

I agree that it's inconsistent with data.frame, and imo that's a good
thing. We don't replicate the drop argument, so it wouldn't be possible to
return a data.table when with=FALSE and either way drop=TRUE by default is
a bad design choice in data.frame and matrix (that is unlikely to change
given R-core's attitude towards that type of a thing).

I'm always pro more and better documentation :)


On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan <
aragorn168b at gmail.com> wrote:

>  Eddi, At the least, I think the documentation needs to be clearer on the
> use of "with=FALSE". It does feel inconsistent with the fact that "j" with
> a single column should return a vector. In data.frames, the type in "j"
> being column names, if it's just one column name, would return a vector,
> unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[,
> c("x", "y")] will return a data.frame. So, it is inconsistent with
> data.frame here, I think.
>
>
> Arun
>
> On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote:
>
> DT[, y] returning a vector is I think the only correct behavior, given the
> understanding of j-expression as something evaluated in the DT environment.
> If they want a data.table they should simply use DT[, list(y)] or DT[,
> data.table(y)].
>
> I haven't thought about DT[, "y", with = FALSE] before as I pretty much
> never use that form, but I see an argument for it staying as is, because
> "y" and c("y") are the same and since we all presumably agree that DT[,
> c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with
> = FALSE] returned a different type that would mean inconsistent return
> types which makes life much harder for users (as evidenced by the periodic
> drop=FALSE questions that come up on SO).
>
> Going back to DT[, y], note that y and list(y) actually produce
> *different* results (in e.g. base_env), so there is no type consistency
> issue there between DT[, y] and DT[, list(y, z)].
>
>
> On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>  Hi everybody,
>
> It'd be nice if you could weigh-in on the bug report filed by Bill here:
>
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
>
> The gist of it is:
>
> require(data.table)
> DT <- data.table(x=1:5, y=6:10, z=11:15)
> DT[, y] # returns a vector
> DT[, "y", with=FALSE] # returns a data.table
>
> The question from the bug report basically is: "why is that in the first
> case, 'j' has only one column and we get a vector, but in the second case,
> we get a data.table?"
>
> My question is: Is this behaviour okay or do you prefer that the first one
> returns a data.table as well or the second one (with "with=FALSE") returns
> a vector?
>
> Thank you,
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/350b54e5/attachment-0001.html>

From aragorn168b at gmail.com  Thu Nov 14 17:46:50 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Thu, 14 Nov 2013 17:46:50 +0100
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <CAHZcBOp7CUEJoMzRuxG-A59_stKoSED+--bsX92PpvZsYNJ_6A@mail.gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
 <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
 <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>
 <CAHZcBOp7CUEJoMzRuxG-A59_stKoSED+--bsX92PpvZsYNJ_6A@mail.gmail.com>
Message-ID: <862AB4459A55499EB1DA0AB24D04A890@gmail.com>

Glad that we agree on better-ing the documentation. However, I don't find it a sound argument that we deviate from data.frame because the design is bad, *when we inherit from data.frame*. The choice is already made! Too many such trivial inconsistencies piles up pretty quickly and could potentially result in a steep learning curve - as there are different set of rules to be memorised.  

Tackling the point of "inheriting from data.frame", *but* this, this, this.. and many other things are different, if can't be avoided, should be *very clearly* documented (in the beginning, maybe as a cheat sheet) so that people aren't confused.


Arun


On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote:

> I agree that it's inconsistent with data.frame, and imo that's a good thing. We don't replicate the drop argument, so it wouldn't be possible to return a data.table when with=FALSE and either way drop=TRUE by default is a bad design choice in data.frame and matrix (that is unlikely to change given R-core's attitude towards that type of a thing).
> 
> I'm always pro more and better documentation :)
> 
> 
> On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > Eddi, At the least, I think the documentation needs to be clearer on the use of "with=FALSE". It does feel inconsistent with the fact that "j" with a single column should return a vector. In data.frames, the type in "j" being column names, if it's just one column name, would return a vector, unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, c("x", "y")] will return a data.frame. So, it is inconsistent with data.frame here, I think. 
> > 
> > 
> > Arun
> > 
> > 
> > On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote:
> > 
> > > DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)].
> > > 
> > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). 
> > > 
> > > Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)].
> > > 
> > > 
> > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > Hi everybody, 
> > > > 
> > > > It'd be nice if you could weigh-in on the bug report filed by Bill here: 
> > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
> > > > 
> > > > The gist of it is:
> > > > 
> > > > require(data.table)
> > > > DT <- data.table(x=1:5, y=6:10, z=11:15)
> > > > DT[, y] # returns a vector
> > > > DT[, "y", with=FALSE] # returns a data.table
> > > > 
> > > > The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?"
> > > > 
> > > > My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? 
> > > > 
> > > > Thank you,
> > > > Arun
> > > > 
> > > > 
> > > > _______________________________________________
> > > > datatable-help mailing list
> > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > 
> > 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/6c6d3cf5/attachment.html>

From aragorn168b at gmail.com  Thu Nov 14 17:47:51 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Thu, 14 Nov 2013 17:47:51 +0100
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <862AB4459A55499EB1DA0AB24D04A890@gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
 <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
 <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>
 <CAHZcBOp7CUEJoMzRuxG-A59_stKoSED+--bsX92PpvZsYNJ_6A@mail.gmail.com>
 <862AB4459A55499EB1DA0AB24D04A890@gmail.com>
Message-ID: <1D2952C1F9244FF5A473FCAED0B03920@gmail.com>

I'll try to make a list of places where data.table != data.frame operation. 

Arun


On Thursday, November 14, 2013 at 5:46 PM, Arunkumar Srinivasan wrote:

> Glad that we agree on better-ing the documentation. However, I don't find it a sound argument that we deviate from data.frame because the design is bad, *when we inherit from data.frame*. The choice is already made! Too many such trivial inconsistencies piles up pretty quickly and could potentially result in a steep learning curve - as there are different set of rules to be memorised.  
> 
> Tackling the point of "inheriting from data.frame", *but* this, this, this.. and many other things are different, if can't be avoided, should be *very clearly* documented (in the beginning, maybe as a cheat sheet) so that people aren't confused.
> 
> 
> Arun
> 
> 
> On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote:
> 
> > I agree that it's inconsistent with data.frame, and imo that's a good thing. We don't replicate the drop argument, so it wouldn't be possible to return a data.table when with=FALSE and either way drop=TRUE by default is a bad design choice in data.frame and matrix (that is unlikely to change given R-core's attitude towards that type of a thing).
> > 
> > I'm always pro more and better documentation :)
> > 
> > 
> > On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > Eddi, At the least, I think the documentation needs to be clearer on the use of "with=FALSE". It does feel inconsistent with the fact that "j" with a single column should return a vector. In data.frames, the type in "j" being column names, if it's just one column name, would return a vector, unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[, c("x", "y")] will return a data.frame. So, it is inconsistent with data.frame here, I think. 
> > > 
> > > 
> > > Arun
> > > 
> > > 
> > > On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote:
> > > 
> > > > DT[, y] returning a vector is I think the only correct behavior, given the understanding of j-expression as something evaluated in the DT environment. If they want a data.table they should simply use DT[, list(y)] or DT[, data.table(y)].
> > > > 
> > > > I haven't thought about DT[, "y", with = FALSE] before as I pretty much never use that form, but I see an argument for it staying as is, because "y" and c("y") are the same and since we all presumably agree that DT[, c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with = FALSE] returned a different type that would mean inconsistent return types which makes life much harder for users (as evidenced by the periodic drop=FALSE questions that come up on SO). 
> > > > 
> > > > Going back to DT[, y], note that y and list(y) actually produce *different* results (in e.g. base_env), so there is no type consistency issue there between DT[, y] and DT[, list(y, z)].
> > > > 
> > > > 
> > > > On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <aragorn168b at gmail.com (mailto:aragorn168b at gmail.com)> wrote:
> > > > > Hi everybody, 
> > > > > 
> > > > > It'd be nice if you could weigh-in on the bug report filed by Bill here: 
> > > > > https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
> > > > > 
> > > > > The gist of it is:
> > > > > 
> > > > > require(data.table)
> > > > > DT <- data.table(x=1:5, y=6:10, z=11:15)
> > > > > DT[, y] # returns a vector
> > > > > DT[, "y", with=FALSE] # returns a data.table
> > > > > 
> > > > > The question from the bug report basically is: "why is that in the first case, 'j' has only one column and we get a vector, but in the second case, we get a data.table?"
> > > > > 
> > > > > My question is: Is this behaviour okay or do you prefer that the first one returns a data.table as well or the second one (with "with=FALSE") returns a vector? 
> > > > > 
> > > > > Thank you,
> > > > > Arun
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > datatable-help mailing list
> > > > > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > 
> > > 
> > 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/e1711423/attachment.html>

From eduard.antonyan at gmail.com  Thu Nov 14 17:59:09 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Thu, 14 Nov 2013 10:59:09 -0600
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <1D2952C1F9244FF5A473FCAED0B03920@gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
 <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
 <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>
 <CAHZcBOp7CUEJoMzRuxG-A59_stKoSED+--bsX92PpvZsYNJ_6A@mail.gmail.com>
 <862AB4459A55499EB1DA0AB24D04A890@gmail.com>
 <1D2952C1F9244FF5A473FCAED0B03920@gmail.com>
Message-ID: <CAHZcBOrL7_fmXO7PCXK91JjPq9XPU3Oqk9GiJa+1GnyJM0vhbA@mail.gmail.com>

Perhaps a simple sentence along the lines of "drop argument is absent and
should be considered as FALSE when comparing with data.frame in with=FALSE
mode" would suffice. The fact that i-expression is a full-on data.table
i-expression in with=FALSE mode will probably also cause inconsistencies.


On Thu, Nov 14, 2013 at 10:47 AM, Arunkumar Srinivasan <
aragorn168b at gmail.com> wrote:

>  I'll try to make a list of places where data.table != data.frame
> operation.
>
> Arun
>
> On Thursday, November 14, 2013 at 5:46 PM, Arunkumar Srinivasan wrote:
>
>  Glad that we agree on better-ing the documentation. However, I don't
> find it a sound argument that we deviate from data.frame because the design
> is bad, *when we inherit from data.frame*. The choice is already made! Too
> many such trivial inconsistencies piles up pretty quickly and could
> potentially result in a steep learning curve - as there are different set
> of rules to be memorised.
>
> Tackling the point of "inheriting from data.frame", *but* this, this,
> this.. and many other things are different, if can't be avoided, should be
> *very clearly* documented (in the beginning, maybe as a cheat sheet) so
> that people aren't confused.
>
>
> Arun
>
> On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote:
>
> I agree that it's inconsistent with data.frame, and imo that's a good
> thing. We don't replicate the drop argument, so it wouldn't be possible to
> return a data.table when with=FALSE and either way drop=TRUE by default is
> a bad design choice in data.frame and matrix (that is unlikely to change
> given R-core's attitude towards that type of a thing).
>
> I'm always pro more and better documentation :)
>
>
> On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>  Eddi, At the least, I think the documentation needs to be clearer on the
> use of "with=FALSE". It does feel inconsistent with the fact that "j" with
> a single column should return a vector. In data.frames, the type in "j"
> being column names, if it's just one column name, would return a vector,
> unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[,
> c("x", "y")] will return a data.frame. So, it is inconsistent with
> data.frame here, I think.
>
>
> Arun
>
> On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote:
>
> DT[, y] returning a vector is I think the only correct behavior, given the
> understanding of j-expression as something evaluated in the DT environment.
> If they want a data.table they should simply use DT[, list(y)] or DT[,
> data.table(y)].
>
> I haven't thought about DT[, "y", with = FALSE] before as I pretty much
> never use that form, but I see an argument for it staying as is, because
> "y" and c("y") are the same and since we all presumably agree that DT[,
> c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with
> = FALSE] returned a different type that would mean inconsistent return
> types which makes life much harder for users (as evidenced by the periodic
> drop=FALSE questions that come up on SO).
>
> Going back to DT[, y], note that y and list(y) actually produce
> *different* results (in e.g. base_env), so there is no type consistency
> issue there between DT[, y] and DT[, list(y, z)].
>
>
> On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>  Hi everybody,
>
> It'd be nice if you could weigh-in on the bug report filed by Bill here:
>
> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
>
> The gist of it is:
>
> require(data.table)
> DT <- data.table(x=1:5, y=6:10, z=11:15)
> DT[, y] # returns a vector
> DT[, "y", with=FALSE] # returns a data.table
>
> The question from the bug report basically is: "why is that in the first
> case, 'j' has only one column and we get a vector, but in the second case,
> we get a data.table?"
>
> My question is: Is this behaviour okay or do you prefer that the first one
> returns a data.table as well or the second one (with "with=FALSE") returns
> a vector?
>
> Thank you,
> Arun
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/a55eb2cf/attachment-0001.html>

From FErickson at psu.edu  Thu Nov 14 19:45:52 2013
From: FErickson at psu.edu (Frank Erickson)
Date: Thu, 14 Nov 2013 13:45:52 -0500
Subject: [datatable-help] Bug report #5100 reg.
In-Reply-To: <CAHZcBOrL7_fmXO7PCXK91JjPq9XPU3Oqk9GiJa+1GnyJM0vhbA@mail.gmail.com>
References: <C2A623F497574A86AA84A00E5191036B@gmail.com>
 <CAHZcBOqT+J03WafYsLsn6JQLv2exu38-0K9SmEByx0A2mskNfw@mail.gmail.com>
 <3DA08D85331E45209E9E1C2D7202FD00@gmail.com>
 <CAHZcBOp7CUEJoMzRuxG-A59_stKoSED+--bsX92PpvZsYNJ_6A@mail.gmail.com>
 <862AB4459A55499EB1DA0AB24D04A890@gmail.com>
 <1D2952C1F9244FF5A473FCAED0B03920@gmail.com>
 <CAHZcBOrL7_fmXO7PCXK91JjPq9XPU3Oqk9GiJa+1GnyJM0vhbA@mail.gmail.com>
Message-ID: <CAJd-hdktoE9raD4rH2DdEMvQWsUidZXWT3uUs6BjpD5c2B_ihA@mail.gmail.com>

For what it's worth, I use the with=FALSE version frequently without
knowing how many columns I have selected, so I like the implicit wrapping
of the columns in a list() (or implicit drop=FALSE). An example (almost)
from something I did yesterday:

mycols <- grep("^Vbar",names(DT),value=TRUE)
DT1 <- DT[,mycols,with=FALSE]

-- Frank


On Thu, Nov 14, 2013 at 11:59 AM, Eduard Antonyan <eduard.antonyan at gmail.com
> wrote:

> Perhaps a simple sentence along the lines of "drop argument is absent and
> should be considered as FALSE when comparing with data.frame in with=FALSE
> mode" would suffice. The fact that i-expression is a full-on data.table
> i-expression in with=FALSE mode will probably also cause inconsistencies.
>
>
> On Thu, Nov 14, 2013 at 10:47 AM, Arunkumar Srinivasan <
> aragorn168b at gmail.com> wrote:
>
>>  I'll try to make a list of places where data.table != data.frame
>> operation.
>>
>> Arun
>>
>> On Thursday, November 14, 2013 at 5:46 PM, Arunkumar Srinivasan wrote:
>>
>>  Glad that we agree on better-ing the documentation. However, I don't
>> find it a sound argument that we deviate from data.frame because the design
>> is bad, *when we inherit from data.frame*. The choice is already made! Too
>> many such trivial inconsistencies piles up pretty quickly and could
>> potentially result in a steep learning curve - as there are different set
>> of rules to be memorised.
>>
>> Tackling the point of "inheriting from data.frame", *but* this, this,
>> this.. and many other things are different, if can't be avoided, should be
>> *very clearly* documented (in the beginning, maybe as a cheat sheet) so
>> that people aren't confused.
>>
>>
>> Arun
>>
>> On Thursday, November 14, 2013 at 5:39 PM, Eduard Antonyan wrote:
>>
>> I agree that it's inconsistent with data.frame, and imo that's a good
>> thing. We don't replicate the drop argument, so it wouldn't be possible to
>> return a data.table when with=FALSE and either way drop=TRUE by default is
>> a bad design choice in data.frame and matrix (that is unlikely to change
>> given R-core's attitude towards that type of a thing).
>>
>> I'm always pro more and better documentation :)
>>
>>
>> On Thu, Nov 14, 2013 at 10:33 AM, Arunkumar Srinivasan <
>> aragorn168b at gmail.com> wrote:
>>
>>  Eddi, At the least, I think the documentation needs to be clearer on the
>> use of "with=FALSE". It does feel inconsistent with the fact that "j" with
>> a single column should return a vector. In data.frames, the type in "j"
>> being column names, if it's just one column name, would return a vector,
>> unless drop = FALSE. That is, DF[, "y"] will return a vector while DF[,
>> c("x", "y")] will return a data.frame. So, it is inconsistent with
>> data.frame here, I think.
>>
>>
>> Arun
>>
>> On Thursday, November 14, 2013 at 5:25 PM, Eduard Antonyan wrote:
>>
>> DT[, y] returning a vector is I think the only correct behavior, given
>> the understanding of j-expression as something evaluated in the DT
>> environment. If they want a data.table they should simply use DT[, list(y)]
>> or DT[, data.table(y)].
>>
>> I haven't thought about DT[, "y", with = FALSE] before as I pretty much
>> never use that form, but I see an argument for it staying as is, because
>> "y" and c("y") are the same and since we all presumably agree that DT[,
>> c("y", "z"), with = FALSE] should return a data.table. If DT[, c("y"), with
>> = FALSE] returned a different type that would mean inconsistent return
>> types which makes life much harder for users (as evidenced by the periodic
>> drop=FALSE questions that come up on SO).
>>
>> Going back to DT[, y], note that y and list(y) actually produce
>> *different* results (in e.g. base_env), so there is no type consistency
>> issue there between DT[, y] and DT[, list(y, z)].
>>
>>
>> On Thu, Nov 14, 2013 at 6:09 AM, Arunkumar Srinivasan <
>> aragorn168b at gmail.com> wrote:
>>
>>  Hi everybody,
>>
>> It'd be nice if you could weigh-in on the bug report filed by Bill here:
>>
>> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=5100&group_id=240&atid=975
>>
>> The gist of it is:
>>
>> require(data.table)
>> DT <- data.table(x=1:5, y=6:10, z=11:15)
>> DT[, y] # returns a vector
>> DT[, "y", with=FALSE] # returns a data.table
>>
>> The question from the bug report basically is: "why is that in the first
>> case, 'j' has only one column and we get a vector, but in the second case,
>> we get a data.table?"
>>
>> My question is: Is this behaviour okay or do you prefer that the first
>> one returns a data.table as well or the second one (with "with=FALSE")
>> returns a vector?
>>
>> Thank you,
>> Arun
>>
>>
>> _______________________________________________
>> datatable-help mailing list
>> datatable-help at lists.r-forge.r-project.org
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131114/0ad5823b/attachment.html>

From bogaso.christofer at gmail.com  Sat Nov 16 23:53:40 2013
From: bogaso.christofer at gmail.com (Christofer Bogaso)
Date: Sun, 17 Nov 2013 04:38:40 +0545
Subject: [datatable-help] 'OR' operation with data.table
Message-ID: <CA+dpOJmxs6dMU43BGWXJHYGAnz+vuGBANTM9uSsnhN8GE9Cvvg@mail.gmail.com>

Hello all,

I am a new user of data.table and really started to liking it :)

I am seeking some suggestion on how I can implement 'OR'/AND' operator to
fetch a subset of a data.table.

Below is my example data.table (my actual data.table is quite big):

DT = data.table(x = 1:20, y1 = rep(letters[1:4], 5), y2 = rep(LETTERS[1:4],
each = 5))
setkey(DT, y1, y2)

> DT
     x y1 y2
 1:  1  a  A
 2:  5  a  A
 3:  9  a  B
 4: 13  a  C
 5: 17  a  D
 6:  2  b  A
 7:  6  b  B
 8: 10  b  B
 9: 14  b  C
10: 18  b  D
11:  3  c  A
12:  7  c  B
13: 11  c  C
14: 15  c  C
15: 19  c  D
16:  4  d  A
17:  8  d  B
18: 12  d  C
19: 16  d  D
20: 20  d  D


Now I want to fetch those rows for which "y1 = a OR b  AND y2 = B OR D"

with ordinary data.frame, this is straightforward to achieve, however I am
wondering what could be the data.table way for fast computation.

I would really appreciate for your help/pointer.

Thanks and regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131117/dd31c765/attachment.html>

From aragorn168b at gmail.com  Sat Nov 16 23:56:43 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Sat, 16 Nov 2013 23:56:43 +0100
Subject: [datatable-help] 'OR' operation with data.table
In-Reply-To: <CA+dpOJmxs6dMU43BGWXJHYGAnz+vuGBANTM9uSsnhN8GE9Cvvg@mail.gmail.com>
References: <CA+dpOJmxs6dMU43BGWXJHYGAnz+vuGBANTM9uSsnhN8GE9Cvvg@mail.gmail.com>
Message-ID: <118DC425D96346DEAFCCEF77379D6687@gmail.com>

How about this? 
DT[CJ(c("a", "b"), c("B", "D"))]

Arun


On Saturday, November 16, 2013 at 11:53 PM, Christofer Bogaso wrote:

> Hello all,
> 
> I am a new user of data.table and really started to liking it :)
> 
> I am seeking some suggestion on how I can implement 'OR'/AND' operator to fetch a subset of a data.table. 
> 
> Below is my example data.table (my actual data.table is quite big):
> 
> DT = data.table(x = 1:20, y1 = rep(letters[1:4], 5), y2 = rep(LETTERS[1:4], each = 5))
> setkey(DT, y1, y2)
> 
> 
> > DT
>      x y1 y2
>  1:  1  a  A
>  2:  5  a  A
>  3:  9  a  B
>  4: 13  a  C
>  5: 17  a  D
>  6:  2  b  A
>  7:  6  b  B
>  8: 10  b  B
>  9: 14  b  C
> 10: 18  b  D
> 11:  3  c  A
> 12:  7  c  B
> 13: 11  c  C
> 14: 15  c  C
> 15: 19  c  D
> 16:  4  d  A
> 17:  8  d  B
> 18: 12  d  C
> 19: 16  d  D
> 20: 20  d  D
> 
> 
> 
> Now I want to fetch those rows for which "y1 = a OR b  AND y2 = B OR D"
> 
> with ordinary data.frame, this is straightforward to achieve, however I am wondering what could be the data.table way for fast computation. 
> 
> I would really appreciate for your help/pointer.
> 
> Thanks and regards,
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131116/7dc50bc2/attachment.html>

From bogaso.christofer at gmail.com  Sun Nov 17 00:00:33 2013
From: bogaso.christofer at gmail.com (Christofer Bogaso)
Date: Sun, 17 Nov 2013 04:45:33 +0545
Subject: [datatable-help] 'OR' operation with data.table
In-Reply-To: <118DC425D96346DEAFCCEF77379D6687@gmail.com>
References: <CA+dpOJmxs6dMU43BGWXJHYGAnz+vuGBANTM9uSsnhN8GE9Cvvg@mail.gmail.com>
 <118DC425D96346DEAFCCEF77379D6687@gmail.com>
Message-ID: <CA+dpOJm7X-R59mJF9W-dhfJv0GB_sB2yLVrMV=wzkBRYgWdmvA@mail.gmail.com>

Thanks a lot. This is working for me.

Thanks and regards,


On Sun, Nov 17, 2013 at 4:41 AM, Arunkumar Srinivasan <aragorn168b at gmail.com
> wrote:

>  How about this?
> DT[CJ(c("a", "b"), c("B", "D"))]
>
> Arun
>
> On Saturday, November 16, 2013 at 11:53 PM, Christofer Bogaso wrote:
>
> Hello all,
>
> I am a new user of data.table and really started to liking it :)
>
> I am seeking some suggestion on how I can implement 'OR'/AND' operator to
> fetch a subset of a data.table.
>
> Below is my example data.table (my actual data.table is quite big):
>
> DT = data.table(x = 1:20, y1 = rep(letters[1:4], 5), y2 =
> rep(LETTERS[1:4], each = 5))
> setkey(DT, y1, y2)
>
> > DT
>      x y1 y2
>  1:  1  a  A
>  2:  5  a  A
>  3:  9  a  B
>  4: 13  a  C
>  5: 17  a  D
>  6:  2  b  A
>  7:  6  b  B
>  8: 10  b  B
>  9: 14  b  C
> 10: 18  b  D
> 11:  3  c  A
> 12:  7  c  B
> 13: 11  c  C
> 14: 15  c  C
> 15: 19  c  D
> 16:  4  d  A
> 17:  8  d  B
> 18: 12  d  C
> 19: 16  d  D
> 20: 20  d  D
>
>
> Now I want to fetch those rows for which "y1 = a OR b  AND y2 = B OR D"
>
> with ordinary data.frame, this is straightforward to achieve, however I am
> wondering what could be the data.table way for fast computation.
>
> I would really appreciate for your help/pointer.
>
> Thanks and regards,
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131117/882c2c46/attachment.html>

From gsee000 at gmail.com  Sun Nov 17 23:32:24 2013
From: gsee000 at gmail.com (G See)
Date: Sun, 17 Nov 2013 16:32:24 -0600
Subject: [datatable-help] .SD is locked
Message-ID: <CA+xi=qZUC6WOtiWjmW9bV+w39XBvfq5k8J9VNidDDQd6BjGHow@mail.gmail.com>

Hi,

Is the following error expected?

> library(data.table)
data.table 1.8.11  For help type: help("data.table")
> x <- as.data.table(BOD)
> xx <- x[, .SD, .SDcols="Time"]
> xx[, Time:=as.numeric(Time)]
Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) :
  .SD is locked. Using := in .SD's j is reserved for possible future
use; a tortuously flexible way to modify by group. Use := in j
directly to modify by group by reference.
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.11

loaded via a namespace (and not attached):
[1] plyr_1.8       reshape2_1.2.2 stringr_0.6.2


Thanks,
Garrett

From michael.nelson at sydney.edu.au  Mon Nov 18 00:11:30 2013
From: michael.nelson at sydney.edu.au (Michael Nelson)
Date: Sun, 17 Nov 2013 23:11:30 +0000
Subject: [datatable-help] .SD is locked
In-Reply-To: <CA+xi=qZUC6WOtiWjmW9bV+w39XBvfq5k8J9VNidDDQd6BjGHow@mail.gmail.com>
References: <CA+xi=qZUC6WOtiWjmW9bV+w39XBvfq5k8J9VNidDDQd6BjGHow@mail.gmail.com>
Message-ID: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05>

I don't believe this is to be expected.

A bug report should be filed (it is present in 1.8.10 on CRAN as well)

.SD is locked so you can't "mess" with it within a call to `[.data.table`, but this "locked" status should not be retained following the completion of that call


________________________________________
From: datatable-help-bounces at lists.r-forge.r-project.org [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of G See [gsee000 at gmail.com]
Sent: Monday, 18 November 2013 9:32 AM
To: datatable-help at lists.r-forge.r-project.org
Subject: [datatable-help] .SD is locked

Hi,

Is the following error expected?

> library(data.table)
data.table 1.8.11  For help type: help("data.table")
> x <- as.data.table(BOD)
> xx <- x[, .SD, .SDcols="Time"]
> xx[, Time:=as.numeric(Time)]
Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) :
  .SD is locked. Using := in .SD's j is reserved for possible future
use; a tortuously flexible way to modify by group. Use := in j
directly to modify by group by reference.
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.8.11

loaded via a namespace (and not attached):
[1] plyr_1.8       reshape2_1.2.2 stringr_0.6.2


Thanks,
Garrett
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

From aragorn168b at gmail.com  Mon Nov 18 00:29:14 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 18 Nov 2013 00:29:14 +0100
Subject: [datatable-help] .SD is locked
In-Reply-To: <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05>
References: <CA+xi=qZUC6WOtiWjmW9bV+w39XBvfq5k8J9VNidDDQd6BjGHow@mail.gmail.com>
 <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05>
Message-ID: <5E6D77DD478449CCA64B6544F607A161@gmail.com>

Hm, nice catch! In this special case, the value returned is from this code: 

jval = eval(jsub, SDenv, parent.frame())

Since `jsub = .SD`, this evaluates to .SD ('s value). However, since `jval` remains untouched, a copy is not made (I think). This can be seen with a `tracemem` statement:

x <- as.data.table(BOD)
xx <- x[, {print(tracemem(.SD)); .SD}, .SDcols="Time"]
[1] "<0x7fa4e9a518f0>"
tracemem(xx)
[1] "<0x7fa4e9a518f0>"


Basically `xx` is `.SD` and therefore is 'locked'. I guess a fix would be to check this and make a copy on return. Not sure.

Arun


On Monday, November 18, 2013 at 12:11 AM, Michael Nelson wrote:

> I don't believe this is to be expected.
> 
> A bug report should be filed (it is present in 1.8.10 on CRAN as well)
> 
> .SD is locked so you can't "mess" with it within a call to `[.data.table`, but this "locked" status should not be retained following the completion of that call
> 
> 
> ________________________________________
> From: datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org) [datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org)] on behalf of G See [gsee000 at gmail.com (mailto:gsee000 at gmail.com)]
> Sent: Monday, 18 November 2013 9:32 AM
> To: datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> Subject: [datatable-help] .SD is locked
> 
> Hi,
> 
> Is the following error expected?
> 
> > library(data.table)
> data.table 1.8.11 For help type: help("data.table")
> > x <- as.data.table(BOD)
> > xx <- x[, .SD, .SDcols="Time"]
> > xx[, Time:=as.numeric(Time)]
> > 
> 
> Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) :
> .SD is locked. Using := in .SD's j is reserved for possible future
> use; a tortuously flexible way to modify by group. Use := in j
> directly to modify by group by reference.
> > sessionInfo()
> 
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> 
> other attached packages:
> [1] data.table_1.8.11
> 
> loaded via a namespace (and not attached):
> [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2
> 
> 
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131118/0a833430/attachment.html>

From aragorn168b at gmail.com  Mon Nov 18 00:43:19 2013
From: aragorn168b at gmail.com (Arunkumar Srinivasan)
Date: Mon, 18 Nov 2013 00:43:19 +0100
Subject: [datatable-help] .SD is locked
In-Reply-To: <5E6D77DD478449CCA64B6544F607A161@gmail.com>
References: <CA+xi=qZUC6WOtiWjmW9bV+w39XBvfq5k8J9VNidDDQd6BjGHow@mail.gmail.com>
 <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05>
 <5E6D77DD478449CCA64B6544F607A161@gmail.com>
Message-ID: <5EEF6BC312F64D0C9E121AA8DDEA1DAC@gmail.com>

Gsee, just adding the line: 

if (identical(jval, SDenv$.SD)) jval = copy(jval)

before `return(jval)` seems to fix this (and all tests also complete without any issues). If you're in a hurry for fix, you could just add it for now.

I'll test it again later and commit with other changes I've staged locally. It'd still be nice to file this as a bug so that it could be tracked. 

Best,
Arun


On Monday, November 18, 2013 at 12:29 AM, Arunkumar Srinivasan wrote:

> Hm, nice catch! In this special case, the value returned is from this code: 
> 
> jval = eval(jsub, SDenv, parent.frame())
> 
> Since `jsub = .SD`, this evaluates to .SD ('s value). However, since `jval` remains untouched, a copy is not made (I think). This can be seen with a `tracemem` statement:
> 
> x <- as.data.table(BOD)
> xx <- x[, {print(tracemem(.SD)); .SD}, .SDcols="Time"]
> [1] "<0x7fa4e9a518f0>"
> tracemem(xx)
> [1] "<0x7fa4e9a518f0>"
> 
> 
> Basically `xx` is `.SD` and therefore is 'locked'. I guess a fix would be to check this and make a copy on return. Not sure.
> 
> Arun
> 
> 
> On Monday, November 18, 2013 at 12:11 AM, Michael Nelson wrote:
> 
> > I don't believe this is to be expected.
> > 
> > A bug report should be filed (it is present in 1.8.10 on CRAN as well)
> > 
> > .SD is locked so you can't "mess" with it within a call to `[.data.table`, but this "locked" status should not be retained following the completion of that call
> > 
> > 
> > ________________________________________
> > From: datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org) [datatable-help-bounces at lists.r-forge.r-project.org (mailto:datatable-help-bounces at lists.r-forge.r-project.org)] on behalf of G See [gsee000 at gmail.com (mailto:gsee000 at gmail.com)]
> > Sent: Monday, 18 November 2013 9:32 AM
> > To: datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > Subject: [datatable-help] .SD is locked
> > 
> > Hi,
> > 
> > Is the following error expected?
> > 
> > > library(data.table)
> > data.table 1.8.11 For help type: help("data.table")
> > > x <- as.data.table(BOD)
> > > xx <- x[, .SD, .SDcols="Time"]
> > > xx[, Time:=as.numeric(Time)]
> > > 
> > 
> > Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) :
> > .SD is locked. Using := in .SD's j is reserved for possible future
> > use; a tortuously flexible way to modify by group. Use := in j
> > directly to modify by group by reference.
> > > sessionInfo()
> > 
> > R version 3.0.2 (2013-09-25)
> > Platform: x86_64-pc-linux-gnu (64-bit)
> > 
> > locale:
> > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> > [9] LC_ADDRESS=C LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> > 
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> > 
> > other attached packages:
> > [1] data.table_1.8.11
> > 
> > loaded via a namespace (and not attached):
> > [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2
> > 
> > 
> > Thanks,
> > Garrett
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org (mailto:datatable-help at lists.r-forge.r-project.org)
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > 
> > 
> > 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131118/b8f2ee18/attachment-0001.html>

From gsee000 at gmail.com  Mon Nov 18 00:48:50 2013
From: gsee000 at gmail.com (G See)
Date: Sun, 17 Nov 2013 17:48:50 -0600
Subject: [datatable-help] .SD is locked
In-Reply-To: <5EEF6BC312F64D0C9E121AA8DDEA1DAC@gmail.com>
References: <CA+xi=qZUC6WOtiWjmW9bV+w39XBvfq5k8J9VNidDDQd6BjGHow@mail.gmail.com>
 <6FB5193A6CDCDF499486A833B7AFBDCDA31EE017@ex-mbx-pro-05>
 <5E6D77DD478449CCA64B6544F607A161@gmail.com>
 <5EEF6BC312F64D0C9E121AA8DDEA1DAC@gmail.com>
Message-ID: <CA+xi=qY-gjeFSwDgUMJLGbjWXF=QVOrYzBX02rbvAtB5Rc08_w@mail.gmail.com>

Thanks guys.  Bug report filed.

On Sun, Nov 17, 2013 at 5:43 PM, Arunkumar Srinivasan
<aragorn168b at gmail.com> wrote:
> Gsee, just adding the line:
>
> if (identical(jval, SDenv$.SD)) jval = copy(jval)
>
> before `return(jval)` seems to fix this (and all tests also complete without
> any issues). If you're in a hurry for fix, you could just add it for now.
>
> I'll test it again later and commit with other changes I've staged locally.
> It'd still be nice to file this as a bug so that it could be tracked.
>
> Best,
> Arun
>
> On Monday, November 18, 2013 at 12:29 AM, Arunkumar Srinivasan wrote:
>
> Hm, nice catch! In this special case, the value returned is from this code:
>
> jval = eval(jsub, SDenv, parent.frame())
>
> Since `jsub = .SD`, this evaluates to .SD ('s value). However, since `jval`
> remains untouched, a copy is not made (I think). This can be seen with a
> `tracemem` statement:
>
> x <- as.data.table(BOD)
> xx <- x[, {print(tracemem(.SD)); .SD}, .SDcols="Time"]
> [1] "<0x7fa4e9a518f0>"
> tracemem(xx)
> [1] "<0x7fa4e9a518f0>"
>
> Basically `xx` is `.SD` and therefore is 'locked'. I guess a fix would be to
> check this and make a copy on return. Not sure.
>
> Arun
>
> On Monday, November 18, 2013 at 12:11 AM, Michael Nelson wrote:
>
> I don't believe this is to be expected.
>
> A bug report should be filed (it is present in 1.8.10 on CRAN as well)
>
> .SD is locked so you can't "mess" with it within a call to `[.data.table`,
> but this "locked" status should not be retained following the completion of
> that call
>
>
> ________________________________________
> From: datatable-help-bounces at lists.r-forge.r-project.org
> [datatable-help-bounces at lists.r-forge.r-project.org] on behalf of G See
> [gsee000 at gmail.com]
> Sent: Monday, 18 November 2013 9:32 AM
> To: datatable-help at lists.r-forge.r-project.org
> Subject: [datatable-help] .SD is locked
>
> Hi,
>
> Is the following error expected?
>
> library(data.table)
>
> data.table 1.8.11 For help type: help("data.table")
>
> x <- as.data.table(BOD)
> xx <- x[, .SD, .SDcols="Time"]
> xx[, Time:=as.numeric(Time)]
>
> Error in `[.data.table`(xx, , `:=`(Time, as.numeric(Time))) :
> .SD is locked. Using := in .SD's j is reserved for possible future
> use; a tortuously flexible way to modify by group. Use := in j
> directly to modify by group by reference.
>
> sessionInfo()
>
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] data.table_1.8.11
>
> loaded via a namespace (and not attached):
> [1] plyr_1.8 reshape2_1.2.2 stringr_0.6.2
>
>
> Thanks,
> Garrett
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>

From danielrlabar at gmail.com  Tue Nov 19 17:39:08 2013
From: danielrlabar at gmail.com (dnlbrky)
Date: Tue, 19 Nov 2013 08:39:08 -0800 (PST)
Subject: [datatable-help] rbind() vs. rbindlist() behavior/warning
In-Reply-To: <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
References: <CA+xi=qZEMYdvs0zHZNm7PCbfX+_kf_AAUEaCUDWiraR4YCxpMw@mail.gmail.com>
 <CAHZcBOoghTOB3kuTS=yDXcWvZy_kUNp9Usxb_6gkCDnMFE1nGg@mail.gmail.com>
 <EA07D5CEEC31426FB0F4A4ACDE3E8458@gmail.com>
Message-ID: <1384879148805-4680743.post@n4.nabble.com>

Arunkumar Srinivasan wrote
> `rbindlist` gained speed (to some extent) by assuming things like this and
> skipping checks in the first place. So, should we include checks like
> this? Also, if "rbind" and/or "rbindlist" are made to do the exact same
> thing, then, what's the purpose of "rbindlist"?

My vote for the purpose of rbindlist is to continue to be a fast version of
rbind for data.tables, while providing as much functionality as possible. 
Could the functionality be optional?  In "bare bones mode" it would be super
fast, and in "full featured mode" it would probably be faster than rbind but
slower than "bare bones".

Like Garrett, I would like to have the option of binding by column names in
rbindlist.  In addition, it would be great if rbindlist could handle missing
columns.  The smartbind function in  gtools
<http://cran.stat.ucla.edu/web/packages/gtools/index.html>   does both of
these.


--
View this message in context: http://r.789695.n4.nabble.com/rbind-vs-rbindlist-behavior-warning-tp4680116p4680743.html
Sent from the datatable-help mailing list archive at Nabble.com.

From saporta at scarletmail.rutgers.edu  Tue Nov 19 23:29:33 2013
From: saporta at scarletmail.rutgers.edu (Ricardo Saporta)
Date: Tue, 19 Nov 2013 17:29:33 -0500
Subject: [datatable-help] Help with code efficieny
Message-ID: <CAE7Aa4TwWi7Od4PKXP1nJA0gzKkJTy6RHTX=yZnTyWp7SosTnA@mail.gmail.com>

Hey guys,

I am working with some code that is taking several hours to run.  I posted
a question on SO about it, if anyone has some thoughts I am open to
suggestions

http://stackoverflow.com/questions/20083432/increase-efficiency-in-finding-first-occurrence-of-events

Thanks
Rick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131119/c0c09c03/attachment.html>

From daniel.krizian at gmail.com  Fri Nov 22 15:15:37 2013
From: daniel.krizian at gmail.com (daniel.krizian)
Date: Fri, 22 Nov 2013 06:15:37 -0800 (PST)
Subject: [datatable-help] Key dropped when DT[, list(a, b)]
Message-ID: <1385129737870-4680965.post@n4.nabble.com>

Hello, I have:

DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b"))
key(DT) # [1] "a" "b"
key(DT[,list(a,b)]) # NULL

Note that DT loses its key when I select a subset of columns like above.

Is this a (known) bug/ expected result?

Maybe it is just me, but I would expect the data.table to retain its key in
the SELECT-like operation, otherwise it causes me to repeatedly call
(expensive) setkey(), when in fact I am not changing the structure of
rows/indices significantly.

Thanks,
Daniel


--
View this message in context: http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965.html
Sent from the datatable-help mailing list archive at Nabble.com.

From lianoglou.steve at gene.com  Fri Nov 22 15:45:20 2013
From: lianoglou.steve at gene.com (Steve Lianoglou)
Date: Fri, 22 Nov 2013 06:45:20 -0800
Subject: [datatable-help] Key dropped when DT[, list(a, b)]
In-Reply-To: <1385129737870-4680965.post@n4.nabble.com>
References: <1385129737870-4680965.post@n4.nabble.com>
Message-ID: <CAHA9McNHfzGu3kSCiG3Faf-DOa=unOEpSQgSfzDe607R=h0U5Q@mail.gmail.com>

Hi Daniel,

On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian
<daniel.krizian at gmail.com> wrote:
> Hello, I have:
>
> DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b"))
> key(DT) # [1] "a" "b"
> key(DT[,list(a,b)]) # NULL
>
> Note that DT loses its key when I select a subset of columns like above.
>
> Is this a (known) bug/ expected result?

The key is retained for me when I run your code:

R> key(DT[,list(a,b)])
[1] "a" "b"

What version of data.table are you using?

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

From daniel.krizian at gmail.com  Fri Nov 22 17:32:26 2013
From: daniel.krizian at gmail.com (Daniel Krizian)
Date: Fri, 22 Nov 2013 16:32:26 +0000
Subject: [datatable-help] Key dropped when DT[, list(a, b)]
Message-ID: <CALqbvp4XQ3=9mYzu1LqKLUQ6A7HxkDvt3=iULObparEnN-Caaw@mail.gmail.com>

Hello Steve and thanks for your reply. I am running data.table_1.8.11

Full details below:

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C

[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] timeDate_3010.98           data.table_1.8.11          quantstrat_0.7.8

 [4] foreach_1.4.1              blotter_0.8.15
PerformanceAnalytics_1.1.1
 [7] FinancialInstrument_1.1    quantmod_0.4-0             Defaults_1.1-1

[10] TTR_0.22-0                 xts_0.9-5                  zoo_1.7-10


loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_3.0.2      iterators_1.0.6 lattice_0.20-23
plyr_1.8
[6] reshape2_1.2.2  stringr_0.6.2   tools_3.0.2


On Fri, Nov 22, 2013 at 2:45 PM, Steve Lianoglou
<lianoglou.steve at gene.com>wrote:

> Hi Daniel,
>
> On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian
> <daniel.krizian at gmail.com> wrote:
> > Hello, I have:
> >
> > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b"))
> > key(DT) # [1] "a" "b"
> > key(DT[,list(a,b)]) # NULL
> >
> > Note that DT loses its key when I select a subset of columns like above.
> >
> > Is this a (known) bug/ expected result?
>
> The key is retained for me when I run your code:
>
> R> key(DT[,list(a,b)])
> [1] "a" "b"
>
> What version of data.table are you using?
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Genentech
>


-- 
*____________________________*
*Daniel Krizian, CFA, CAIA*
T: +44 74 5372 1101
M: daniel.krizian at gmail.com
uk.linkedin.com/in/danielkrizian
B: quantology.wordpress.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131122/f3da88b6/attachment.html>

From daniel.krizian at gmail.com  Fri Nov 22 17:33:26 2013
From: daniel.krizian at gmail.com (daniel.krizian)
Date: Fri, 22 Nov 2013 08:33:26 -0800 (PST)
Subject: [datatable-help] Key dropped when DT[, list(a, b)]
In-Reply-To: <CAHA9McNHfzGu3kSCiG3Faf-DOa=unOEpSQgSfzDe607R=h0U5Q@mail.gmail.com>
References: <1385129737870-4680965.post@n4.nabble.com>
 <CAHA9McNHfzGu3kSCiG3Faf-DOa=unOEpSQgSfzDe607R=h0U5Q@mail.gmail.com>
Message-ID: <CALqbvp4eCeN-=tQ3SnzN_w=fkj=QHaEOWWrYvFusPzzeFa65Zg@mail.gmail.com>

Hello Steve and thanks for your reply. I am running data.table_1.8.11

Full details below:

> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C

[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] timeDate_3010.98           data.table_1.8.11          quantstrat_0.7.8

 [4] foreach_1.4.1              blotter_0.8.15
PerformanceAnalytics_1.1.1
 [7] FinancialInstrument_1.1    quantmod_0.4-0             Defaults_1.1-1

[10] TTR_0.22-0                 xts_0.9-5                  zoo_1.7-10


loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_3.0.2      iterators_1.0.6 lattice_0.20-23
plyr_1.8
[6] reshape2_1.2.2  stringr_0.6.2   tools_3.0.2


On Fri, Nov 22, 2013 at 2:50 PM, Steve Lianoglou-2 [via R] <
ml-node+s789695n4680967h38 at n4.nabble.com> wrote:

> Hi Daniel,
>
> On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian
> <[hidden email] <http://user/SendEmail.jtp?type=node&node=4680967&i=0>>
> wrote:
> > Hello, I have:
> >
> > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b"))
> > key(DT) # [1] "a" "b"
> > key(DT[,list(a,b)]) # NULL
> >
> > Note that DT loses its key when I select a subset of columns like above.
> >
> > Is this a (known) bug/ expected result?
>
> The key is retained for me when I run your code:
>
> R> key(DT[,list(a,b)])
> [1] "a" "b"
>
> What version of data.table are you using?
>
> -steve
>
> --
> Steve Lianoglou
> Computational Biologist
> Genentech
> _______________________________________________
> datatable-help mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=4680967&i=1>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680967.html
>  To unsubscribe from Key dropped when DT[, list(a, b)], click here<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4680965&code=ZGFuaWVsLmtyaXppYW5AZ21haWwuY29tfDQ2ODA5NjV8LTkyNDk4NDY4>
> .
> NAML<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


-- 
*____________________________*
*Daniel Krizian, CFA, CAIA*
T: +44 74 5372 1101
M: daniel.krizian at gmail.com
uk.linkedin.com/in/danielkrizian
B: quantology.wordpress.com


--
View this message in context: http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680970.html
Sent from the datatable-help mailing list archive at Nabble.com.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131122/e8c996c3/attachment.html>

From eduard.antonyan at gmail.com  Fri Nov 22 18:01:27 2013
From: eduard.antonyan at gmail.com (Eduard Antonyan)
Date: Fri, 22 Nov 2013 11:01:27 -0600
Subject: [datatable-help] Key dropped when DT[, list(a, b)]
In-Reply-To: <CALqbvp4eCeN-=tQ3SnzN_w=fkj=QHaEOWWrYvFusPzzeFa65Zg@mail.gmail.com>
References: <1385129737870-4680965.post@n4.nabble.com>
 <CAHA9McNHfzGu3kSCiG3Faf-DOa=unOEpSQgSfzDe607R=h0U5Q@mail.gmail.com>
 <CALqbvp4eCeN-=tQ3SnzN_w=fkj=QHaEOWWrYvFusPzzeFa65Zg@mail.gmail.com>
Message-ID: <CAHZcBOqwhNrrJepkYysdb0zb0rd=NMmfmYmR5AcAKpMOUGChuA@mail.gmail.com>

This was fixed relatively recently in revision 999, so try updating your
build.


On Fri, Nov 22, 2013 at 10:33 AM, daniel.krizian
<daniel.krizian at gmail.com>wrote:

> Hello Steve and thanks for your reply. I am running data.table_1.8.11
>
> Full details below:
>
> > sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] timeDate_3010.98           data.table_1.8.11
>  quantstrat_0.7.8
>  [4] foreach_1.4.1              blotter_0.8.15
> PerformanceAnalytics_1.1.1
>  [7] FinancialInstrument_1.1    quantmod_0.4-0             Defaults_1.1-1
>
> [10] TTR_0.22-0                 xts_0.9-5                  zoo_1.7-10
>
>
> loaded via a namespace (and not attached):
> [1] codetools_0.2-8 grid_3.0.2      iterators_1.0.6 lattice_0.20-23
> plyr_1.8
> [6] reshape2_1.2.2  stringr_0.6.2   tools_3.0.2
>
>
> On Fri, Nov 22, 2013 at 2:50 PM, Steve Lianoglou-2 [via R] <[hidden email]<http://user/SendEmail.jtp?type=node&node=4680970&i=0>
> > wrote:
>
>> Hi Daniel,
>>
>> On Fri, Nov 22, 2013 at 6:15 AM, daniel.krizian
>> <[hidden email] <http://user/SendEmail.jtp?type=node&node=4680967&i=0>>
>> wrote:
>> > Hello, I have:
>> >
>> > DT <- data.table(a=1:10, b=1:10,c=1:10, key=c("a","b"))
>> > key(DT) # [1] "a" "b"
>> > key(DT[,list(a,b)]) # NULL
>> >
>> > Note that DT loses its key when I select a subset of columns like
>> above.
>> >
>> > Is this a (known) bug/ expected result?
>>
>> The key is retained for me when I run your code:
>>
>> R> key(DT[,list(a,b)])
>> [1] "a" "b"
>>
>> What version of data.table are you using?
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Genentech
>> _______________________________________________
>> datatable-help mailing list
>> [hidden email] <http://user/SendEmail.jtp?type=node&node=4680967&i=1>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680967.html
>>  To unsubscribe from Key dropped when DT[, list(a, b)], click here.
>> NAML<http://r.789695.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> --
> *____________________________*
> *Daniel Krizian, CFA, CAIA*
> T: +44 74 5372 1101
> M: [hidden email] <http://user/SendEmail.jtp?type=node&node=4680970&i=1>
> uk.linkedin.com/in/danielkrizian
> B: quantology.wordpress.com
>
> ------------------------------
> View this message in context: Re: Key dropped when DT[, list(a, b)]<http://r.789695.n4.nabble.com/Key-dropped-when-DT-list-a-b-tp4680965p4680970.html>
>
> Sent from the datatable-help mailing list archive<http://r.789695.n4.nabble.com/datatable-help-f2315188.html>at Nabble.com.
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20131122/f9ad9c17/attachment-0001.html>