[datatable-help] changing data.table by-without-by syntax to require a "by"

Sadao Milberg s_milberg at hotmail.com
Mon Apr 29 22:21:11 CEST 2013


Also, the issue isn't that data.table has different behavior given different types of inputs.  I don't think there is anything wrong with doing that.  After all, I think everyone here is okay with a data.table as `i` vs. a vector or a variable name producing different outcomes.

The concern here is about which other behavior gets triggered.  The default behavior when using a data.table for `i` and nothing for `by` is a somewhat advanced outcome that can't be easily predicted or understood by people who understand the basic operation of data.table (i.e. `i` is for join/indexing, `j` is for evaluating expressions in the context of DT, `by` is for split-apply-combine).  As a result usage and documentation become more inaccessible than they could be.

S.

Date: Mon, 29 Apr 2013 08:43:19 -0500
From: eduard.antonyan at gmail.com
To: aragorn168b at gmail.com
CC: datatable-help at lists.r-forge.r-project.org
Subject: Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

It might help to think of this as an improvement proposal rather than a problem fix proposal.

On Mon, Apr 29, 2013 at 8:40 AM, Eduard Antonyan <eduard.antonyan at gmail.com> wrote:

Thanks Arun, the examples you give are probably interesting in their own right, but your post doesn't address advantages/disadvantages of either current or proposed syntaxes and simply points out the (obvious) fact that current (and other, similar in some ways to current) behavior is possible to implement in R.



On Sat, Apr 27, 2013 at 10:49 AM, Arunkumar Srinivasan <aragorn168b at gmail.com> wrote:


Hello, 
I thought I'd also chip-in my thoughts to eddi's feature request. 


Short answer: I don't think this feature is necessary. I basically agree with mnel's reply. 
Long answer: My argument goes along these lines (in addition to the S3/S4 methods mnel mentions). If you for example type `[.data.frame` in your R-session, you'd see this snippet:




        if (is.matrix(i)) 
            return(as.matrix(x)[i])

That is, if you do: 

    df <- data.frame(x=1:5, y=1:5, z=1:5)
    mm <- matrix(1:12, ncol=3)
    df[mm] # gives
    [1] 1 2 3 4 5 1 2 3 4 5 1 2




    df <- data.frame(x=1:2, y=1:2, z=1:2)
    df[mm] # gives
    [1]  1  2  1  2  1  2 NA NA NA NA NA NA

Here, the indexing is a matrix. It's obvious. Now, should this behaviour be changed because people would be confused that subsetting a data.frame resulted in a vector? Or because it's not user friendly? Even better, try out `df[mm, ]`. If `i` is a matrix, this is what the code does. I am not convinced this is "bad" design. Functions take arguments of different types ALL the time and they return outputs *depending on the type of input*. This is why I am not sold on the point of "bad design". It's essential to know the type of objects `i` can take and *understand* it. 




If a function is designed that takes several types of objects for `i` and their behaviour is documented, and the documented behaviour is consistent, then I can't accept there's a problem. 

I agree there are people who don't read the manual and "try" things out. But they are going to have problems with every other function in R. 




For example, "unstack" is a function for which same input type gives different output type. That is, it provides a data.frame if the columns are equal after unstaking and list if they are not. That is, compare the outputs of:




    df <- data.frame(x=rep(1:3, each=3), y=1:9)
    unstack(df, y ~ x)

with

    df <- data.frame(x=c(rep(1:3, each=3), 3), y=1:10)
    unstack(df, y ~ x)

But if people don't read the documentation, they wouldn't know this difference until they land up on errors. Now, making it user-friendly would mean that it "always" returns a list. 




Now, is this "bad" design because it gives two object types for same input? Does it require a change? I personally don't think so.

To sum up, what eddi points out as "not being user-friendly" (or arguably "bad design") is everywhere inside R if you look closely. My view is that it's very clear that there should be some effort in understanding a function before using it. Not all functions are plain simple. Some functions have exceptions and some packages have a steep learning curve.




Best,
Arun.


On Sat, Apr 27, 2013 at 12:00 PM, <datatable-help-request at lists.r-forge.r-project.org> wrote:


>
> Send datatable-help mailing list submissions to

>         datatable-help at lists.r-forge.r-project.org
>
> To subscribe or unsubscribe via the World Wide Web, visit


>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

>
> or, via email, send a message with subject or body 'help' to
>         datatable-help-request at lists.r-forge.r-project.org



>
> You can reach the person managing the list at
>         datatable-help-owner at lists.r-forge.r-project.org


>
> When replying, please edit your Subject line so it is more specific

> than "Re: Contents of datatable-help digest..."
>
>
> Today's Topics:
>
>    1. Re: changing data.table by-without-by syntax to require a
>       "by" (Frank Erickson)



>    2. Re: variable column names (Sam Steingold)
>    3. Re: variable column names (Matthew Dowle)
>    4. Re: changing data.table by-without-by syntax to require a
>       "by" (Matthew Dowle)



>    5. Re: variable column names (Victor Kryukov)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 26 Apr 2013 15:34:39 -0500



> From: Frank Erickson <FErickson at psu.edu>
> To: "data.table source forge"
>         <datatable-help at lists.r-forge.r-project.org>



> Subject: Re: [datatable-help] changing data.table by-without-by syntax
>         to require a "by"
> Message-ID:
>         <CAJd-hdkv1oiSjfA625oBxmXwr5YuVUzz==3GLaWJTakAtzMJVw at mail.gmail.com>



> Content-Type: text/plain; charset="iso-8859-1"
>
> I disagree with the criticism of data.table's complexity (in the OP).
> There's nothing wrong with overloading the syntax (that is what CS people



> call it, right?). As long as Matthew's in control of it, it's likely to
> have some internal consistency (which, of course, he could explain).
> However, I like the suggestion to add options (defaulting to something



> globally adjustable) to disable some of the overloading. Along similar
> lines (I think), I find unique.data.table very unintuitive. I can see how
> it could be useful, but strongly prefer base::unique for my current



> applications.
>
> Anyway, I have nothing particular to say about the piece of syntax you all
> are currently discussing. I just registered with this list to chime in
> here, instead of further cluttering SO (where eddi answered one of my



> questions yesterday). These emails sure are wide; must be like 1500px!
> Interesting to try out this ancient mailing-list form of communication.
> Please let me know if I should be using "Reply All" or actually quoting



> that massive thread (as everyone else seems to be doing with each post).
>
> Frank
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130426/eb6556ae/attachment-0001.html>



>
> ------------------------------
>
> Message: 2
> Date: Fri, 26 Apr 2013 18:02:31 -0400
> From: Sam Steingold <sds at gnu.org>


> To: datatable-help at lists.r-forge.r-project.org

> Subject: Re: [datatable-help] variable column names
> Message-ID: <87wqrpj6h4.fsf at gnu.org>
> Content-Type: text/plain
>

> > * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:


> >
> >> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53 +0100]:
> >>
> >> S.O. is probably better for this kind of question then.
> >> But if you don't get an answer there, then come back to datatable-help.



> >
> > http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns



>
> downvoted, unlikely to be answered.
>
> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.10 (quantal) X 11.0.11300000
> http://www.childpsy.net/ http://iris.org.il http://think-israel.org



> http://americancensorship.org http://pmw.org.il http://mideasttruth.com


> We have preferences. You have biases. They have prejudices.

>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 26 Apr 2013 23:47:55 +0100
> From: Matthew Dowle <mdowle at mdowle.plus.com>



> To: <sds at gnu.org>
> Cc: datatable-help at lists.r-forge.r-project.org


> Subject: Re: [datatable-help] variable column names

> Message-ID: <30d6ae8f1a0d6974ebbd54da0d86f3b2 at imap.plus.net>
> Content-Type: text/plain; charset=UTF-8; format=flowed


>
> On 26.04.2013 23:02, Sam Steingold wrote:

> >> * Sam Steingold <fqf at tah.bet> [2013-04-26 13:05:39 -0400]:
> >>
> >>> * Matthew Dowle <zqbjyr at zqbjyr.cyhf.pbz> [2013-04-26 17:45:53
> >>> +0100]:


> >>>

> >>> S.O. is probably better for this kind of question then.
> >>> But if you don't get an answer there, then come back to
> >>> datatable-help.
> >>
> >>



> >> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns


> >

> > downvoted, unlikely to be answered.
>
> I've read it through.
>
> Perhaps sleep on it, don't look for 24hrs and look again as if you were
> trying to answer it yourself. Are there any small changes you can make



> to make it easier to answer?  It wasn't me that downvoted but I suspect
> it's been done to encourage you to improve the question. Downvotes can
> (and often are) reversed.  I've had many more downvotes than you once,



> but then I improved it and it went to +10.
>
> And, it's Friday and we've all had a long week!
>
> Matthew
>
>
>
>
> ------------------------------
>



> Message: 4
> Date: Sat, 27 Apr 2013 00:35:17 +0100
> From: Matthew Dowle <mdowle at mdowle.plus.com>
> To: Frank Erickson <FErickson at psu.edu>



> Cc: "data.table source forge"
>         <datatable-help at lists.r-forge.r-project.org>

> Subject: Re: [datatable-help] changing data.table by-without-by syntax


>         to require a "by"
> Message-ID: <be967ecd9c927ade15c15eb9985d919e at imap.plus.net>


> Content-Type: text/plain; charset="utf-8"

>
>
>
> Thanks for your comments Frank.
>
> Ha, yes it's ancient but still has
> a place. Yes "reply all": Back To: sender (if it's to someone in
> particular) and cc the list. But on general topics where lots of people



> are on the thread, just To: datatable-help alone is fine. Personally I
> prefer "top posting". Like I'm doing now. I only scroll down if I need
> to. I didn't notice the history was building up. If you comment inline



> later, then say "scroll down for comments inline" or something at the
> top. Note that Nabble collapses the history for you so threads are much
> easier to read there. Or I tend to read via RSS (gmane) in Outlook, so



> it feels like an email inbox which turns bold on new posts. You only
> need to subscribe to post (spam control). Most people turn off mail
> delivery pretty quickly I imagine (or setup an auto rule to move into a



> folder, but then you might as well subscribe to RSS I guess).
>
> S.O. is
> quite strict: must be clear questions with a clear answer, only one of
> which can be accepted. No opinion, voting, discussing or notices (enter



> mailing lists). Chat room is good but for quick chat when people are in
> the room at the same time. Many companies (sensibly) block chat access,
> though. Mailing lists allows all timezones a chance at a slower pace.



> Anonymity is just as acceptable and as easy in both places.
>
> Matthew
>
>
> On 26.04.2013 21:34, Frank Erickson wrote:


>
> > I disagree with the
> criticism of data.table's complexity (in the OP). There's nothing wrong

> with overloading the syntax (that is what CS people call it, right?). As
> long as Matthew's in control of it, it's likely to have some internal
> consistency (which, of course, he could explain). However, I like the



> suggestion to add options (defaulting to something globally adjustable)
> to disable some of the overloading. Along similar lines (I think), I
> find unique.data.table very unintuitive. I can see how it could be



> useful, but strongly prefer base::unique for my current applications.
> >
> Anyway, I have nothing particular to say about the piece of syntax you
> all are currently discussing. I just registered with this list to chime



> in here, instead of further cluttering SO (where eddi answered one of my
> questions yesterday). These emails sure are wide; must be like 1500px!
> Interesting to try out this ancient mailing-list form of communication.



> Please let me know if I should be using "Reply All" or actually quoting
> that massive thread (as everyone else seems to be doing with each post).
>
> > Frank
>
>
> -------------- next part --------------



> An HTML attachment was scrubbed...
> URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130427/260f3119/attachment-0001.html>



>
> ------------------------------
>
> Message: 5
> Date: Fri, 26 Apr 2013 16:42:04 -0700
> From: Victor Kryukov <victor.kryukov at gmail.com>



> To: Matthew Dowle <mdowle at mdowle.plus.com>
> Cc: datatable-help at lists.r-forge.r-project.org, sds at gnu.org



> Subject: Re: [datatable-help] variable column names
> Message-ID:
>         <CANJmMqTz5+6djLEwpZxsub6LB=3L37=JB3xt5AhG1XgWG=nJgw at mail.gmail.com>


> Content-Type: text/plain; charset=ISO-8859-1

>
> On Fri, Apr 26, 2013 at 3:47 PM, Matthew Dowle <mdowle at mdowle.plus.com> wrote:
> > On 26.04.2013 23:02, Sam Steingold wrote:


> >>> http://stackoverflow.com/questions/16241687/summarize-a-data-table-across-multiple-columns



> >>
> >> downvoted, unlikely to be answered.
> >
> > I've read it through.
> >
> > Perhaps sleep on it, don't look for 24hrs and look again as if you were



> > trying to answer it yourself. Are there any small changes you can make to
> > make it easier to answer?  It wasn't me that downvoted but I suspect it's
> > been done to encourage you to improve the question. Downvotes can (and often



> > are) reversed.  I've had many more downvotes than you once, but then I
> > improved it and it went to +10.
> >
> > And, it's Friday and we've all had a long week!
>



> Beautiful advice, Matthew!
>
> Sam - I've provided my answer (and even used Reduce since you seem to
> be coming from Lisp land), but I also think some of the down
> votes/comments have their merit.



>
>
> ------------------------------
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org



> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
> End of datatable-help Digest, Vol 38, Issue 26



> **********************************************



_______________________________________________

datatable-help mailing list

datatable-help at lists.r-forge.r-project.org

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help





_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20130429/ccc1e4a6/attachment-0001.html>


More information about the datatable-help mailing list