Joe,<div><br></div><div>We might be coming from different experiences or backgrounds so after some point it might become necessary to agree to disagree and move on because we are repeating ourselves. I'll give it one more try. I made a mistake and entered R code that was accepted by data.table, and a result was returned without any warning. SInce I was grouping, this result seemed obviously wrong considering my intentions. I expected to see less rows. However, the program did NOT understand my intent or expectations. That IS the point.</div>
<div><br></div><div>This exact point is about "usability". There is a branch of studies you can find and read about this topic. I am not talking about functional correctness. A program might be 100% correct in terms of its way of making calculations, performing functions. However, at the same time, it may still force users to make mistakes (such as deleting files) and/or draw incorrect conclusions, or it may not prevent their mistakes (ignorance), or it may not recover from their mistakes. Sometimes, users can think that the program is not functioning even when it functions correctly (or vice versa). To make programs more usable, the programmer has to go an extra mile; i.e., the programmer must understand user's intents and think about possible "user errors". This is different from program errors or bugs that can result functional failures or bugs. Here "user" refers to all of us because we continuously use R and its libraries.</div>
<div><br></div><div>By no means, I am suggesting that usability is a trivial problem. And, usability problems with open source sw arise naturally, I think. A lot of open source sw comes from communities who want to solve their own problems. So, making sw appealing to others may not always be an important concern. A second problem is that technical people take a certain level of pride in solving problems in clever ways and they tend to stick with those solutions and their way of thinking, even if those clever ways might not be easily understandable and intuitive to others.</div>
<div><br></div><div>I think open source and statistical communities are full of wonderful and technically strong people but it is time to start paying attention to usability issues too. If a program is not usable, eventually it will not realize its full potential. </div>
<div><br></div><div>Thanks & good luck,</div><div><br></div><div>Steve</div><div><br><div class="gmail_quote">On Fri, May 6, 2011 at 11:52 PM, Joseph Voelkel <span dir="ltr"><<a href="mailto:jgvcqa@rit.edu" target="_blank">jgvcqa@rit.edu</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div lang="EN-US" link="blue" vlink="purple"><div><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Steve,</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Afraid I don’t see what your are saying. You entered in a perfectly reasonable set of code. Why would you expect the program to know that is not what you intended?</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Joe</span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in"><p class="MsoNormal"><b><span style="font-size:10.0pt">From:</span></b><span style="font-size:10.0pt"> Steve Harman [mailto:<a href="mailto:stvharman@gmail.com" target="_blank">stvharman@gmail.com</a>] <br>
<b>Sent:</b> Friday, May 06, 2011 4:50 PM<br><b>To:</b> Joseph Voelkel</span></p><div><div></div><div><br><b>Subject:</b> Re: [datatable-help] using paste function while grouping gives strange results</div></div>
<p></p></div><div><div></div><div><p class="MsoNormal"> </p><p class="MsoNormal">Joseph H.,</p><div><p class="MsoNormal"> </p></div><div><p class="MsoNormal">Humans will always make mistakes. That's why one of the most important non-functional attribute of a program, "usability", also includes "recovery from errors". Consider this, you will make your good software great; ignore it, you will suffer or inflict suffer on users. </p>
</div><div><p class="MsoNormal"> </p></div><div><p class="MsoNormal">This is almost like a natural law; you can't break it, you can only break yourself against it.</p></div><div><p class="MsoNormal"> </p></div><div><p class="MsoNormal">
Steve</p></div><div><p class="MsoNormal"> </p></div><div><div><p class="MsoNormal">On Fri, May 6, 2011 at 3:26 PM, Joseph Voelkel <<a href="mailto:jgvcqa@rit.edu" target="_blank">jgvcqa@rit.edu</a>> wrote:</p><div>
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Steve H, </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">As a R user, I sometimes make fundamental mistakes (like forgetting to use collapse with the paste function when I want to collapse).</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">However, R is a powerful language. It assumes the user knows what he or she is doing unless something is almost certainly wrong (Steve L provided some examples. This seems like the 80-90% you mentioned, but it’s probably more in the 95%-99% range.) In my opinion, it is unrealistic for you to make what are really programming mistakes on your part (for what you INTENDED—if you INTENDED something else it would not be a mistake) and then expect the software to be able to read your INTENT. </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">I am not a great programmer, but having worked with software that prints out too many warnings—or worse, that will not let you do some things because the programmers decided a user would be unlikely to want to do this—I prefer R’s approach.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Regarding the recycling note recently posted—yes, that may be a nice option. (But will you need to need to have a third option: “don’t print out recycling warnings for vectors of length 1”? That’s usually done intentionally.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Regards,</span></p><p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">Joe V.</span></p><div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in"><p class="MsoNormal"><b><span style="font-size:10.0pt">From:</span></b><span style="font-size:10.0pt"> <a href="mailto:datatable-help-bounces@r-forge.wu-wien.ac.at" target="_blank">datatable-help-bounces@r-forge.wu-wien.ac.at</a> [mailto:<a href="mailto:datatable-help-bounces@r-forge.wu-wien.ac.at" target="_blank">datatable-help-bounces@r-forge.wu-wien.ac.at</a>] <b>On Behalf Of </b>Steve Harman<br>
<b>Sent:</b> Friday, May 06, 2011 2:05 PM<br><b>To:</b> Steve Lianoglou</span></p><div><p class="MsoNormal"><br><b>Cc:</b> <a href="mailto:datatable-help@r-forge.wu-wien.ac.at" target="_blank">datatable-help@r-forge.wu-wien.ac.at</a><br>
<b>Subject:</b> Re: [datatable-help] using paste function while grouping gives strange results</p></div></div><p class="MsoNormal"> </p><div><p class="MsoNormal">Steve,</p></div><div><div><div><p class="MsoNormal"> </p></div>
<p class="MsoNormal">These are good examples of confusing statements. </p><div><p class="MsoNormal">In same cases, people might prefer to use them intentionally for certain purposes, </p></div><div><p class="MsoNormal">(even in that case, it would detract from the readability or maintainability of programs).</p>
</div><div><p class="MsoNormal">On the other side of the coin, they are masking program errors.</p></div><div><p class="MsoNormal">It is a mistake that R overlooked such usability issues (i.e., programmer usability).</p>
</div>
<div><p class="MsoNormal">And, two wrongs will not make a right.</p></div><div><p class="MsoNormal"> </p><div><p class="MsoNormal">I wouldn't go as much as saying that R should have been</p></div><div><p class="MsoNormal">
a typed language, but I do strongly believe that R libraries can be made</p></div><div><p class="MsoNormal">more user or developer friendly (still using the command line).</p></div><div><p class="MsoNormal">Using appropriate warnings in the places where you suspect that, with 80-90%</p>
</div><div><p class="MsoNormal">probability, the user or programmer might be doing something unexpected,</p></div><div><p class="MsoNormal">just issue a warning.</p></div><div><p class="MsoNormal"> </p></div><div><p class="MsoNormal">
</p><div><p class="MsoNormal">On Fri, May 6, 2011 at 10:48 AM, Steve Lianoglou <<a href="mailto:mailinglist.honeypot@gmail.com" target="_blank">mailinglist.honeypot@gmail.com</a>> wrote:</p><p class="MsoNormal">Hi Steve,<br>
<br>As (another :-) aside -- make sure you use "reply-all" when replying<br>to messages from this (and pretty much all other R-related) mailing<br>lists, otherwise your mail goes straight to the person, and not back<br>
to the list.<br><br>Other comments in line:<br><br>On Fri, May 6, 2011 at 10:29 AM, Steve Harman <<a href="mailto:stvharman@gmail.com" target="_blank">stvharman@gmail.com</a>> wrote:<br>> Steve, this works.<br><br>
Great! Glad to hear it.</p><div><p class="MsoNormal" style="margin-bottom:12.0pt"><br>> However, this discussion shows that we need some error or<br>> at least warning messages in this case.</p></div><p class="MsoNormal">
For this particular case, I'd respectfully have to disagree.</p><div><p class="MsoNormal" style="margin-bottom:12.0pt"><br>> It is important to pay attention to user (in this case programmer)<br>> experience and facilitate recovery from<br>
> mistakes by providing the user with meaningful and timely messages.<br>> thanks for all your help,</p></div><p class="MsoNormal">I would argue that what happened to you is actually "expected behavior."<br>
<br>You'll find that in many contexts, if "R" thinks it can figure out<br>what you intended to do with two vectors that aren't the same length,<br>it will try to be smart and do it.<br><br>For instance, this is similar to what happened to you -- notice how<br>
TRUE is recycled to be as long as the first column here:<br><br>R> data.frame(id=letters[1:5], huh=TRUE)<br> id huh<br>1 a TRUE<br>2 b TRUE<br>3 c TRUE<br>4 d TRUE<br>5 e TRUE<br><br>Perhaps more strangely, but still "R-correct" (note no warning):<br>
<br>R> 1:3 + 1:6 ## == c(1:3,1:3) + 1:6<br>[1] 2 4 6 5 7 9 8<br><br>R thinks this is strange, but still does "something" for you (but<br>gives a warning since the 2nd vector isn't a multiple of the first<br>
<br>R> 1:3 + 1:7<br>[1] 2 4 6 5 7 9 8<br>Warning message:<br>In 1:3 + 1:7 :<br> longer object length is not a multiple of shorter object length<br><br>Often times I actually take advantage of the situation that happened<br>
to you to expand a result into several rows (instead of just into 1)<br>when doing split/summarize/merge stuff with data.table's [,<br>by='something'] mojo.<br><br>My 2 cents,<br><span style="color:#888888"><br>
-steve</span></p><div><div><p class="MsoNormal" style="margin-bottom:12.0pt"><br>> On Fri, May 6, 2011 at 9:44 AM, Steve Harman <<a href="mailto:stvharman@gmail.com" target="_blank">stvharman@gmail.com</a>> wrote:<br>
>><br>>> Thanks, I'll try it today and let you know.<br>>><br>>> On Fri, May 6, 2011 at 12:22 AM, Steve Lianoglou<br>>> <<a href="mailto:mailinglist.honeypot@gmail.com" target="_blank">mailinglist.honeypot@gmail.com</a>> wrote:<br>
>>><br>>>> Hi,<br>>>><br>>>> As an aside -- in the future, please provide some data in a form that<br>>>> we can just copy and paste from your email into an R session so that<br>
>>> we can get a working object up quickly.<br>>>><br>>>> For example:<br>>>><br>>>> R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102, 103),<br>>>> student_id=c(1, 1, 1, 1, 1, 2, 2, 2),<br>
>>> key='student_id')<br>>>><br>>>> On Thu, May 5, 2011 at 10:54 PM, Steve Harman <<a href="mailto:stvharman@gmail.com" target="_blank">stvharman@gmail.com</a>><br>>>> wrote:<br>
>>> > Hello<br>>>> ><br>>>> > I have a data table called dt in which each student can have multiple<br>>>> > records (created using data.table)<br>>>> ><br>>>> > coursecode student_id<br>
>>> > ---------------- ----------------<br>>>> > NA 1<br>>>> > NA 1<br>>>> > NA 1<br>>>> > .... 1<br>
>>> > .... 1<br>>>> > NA 2<br>>>> > 101 2<br>>>> > 102 2<br>>>> > NA 2<br>>>> > 103 2<br>
>>> ><br>>>> > I am trying to group by student id and concatenate the coursecode<br>>>> > strings in<br>>>> > student records. This string is mostly NA but it can also be real<br>
>>> > course code<br>>>> > (because of messy real life data coursecode was not always entered)<br>>>> > There are 999999 records.<br>>>> ><br>>>> > So, I thought I would get results like<br>
>>> ><br>>>> > 1 NA NA NA .....<br>>>> > 2 NA 101 102 NA 123 ....<br>>>><br>>>> What type of object are you expecting that result to be?<br>>>><br>>>> > However, as seen below, it brings me a result with 999999 rows<br>
>>> > and it fails to concatenate the coursecode's.<br>>>> ><br>>>> >> codes <- dt[,paste(coursecode),by=student_id]<br>>>> >> codes<br>>>> > student_id V1<br>
>>> > [1,] 1 NA<br>>>> > [2,] 1 NA<br>>>> > [3,] 1 NA<br>>>> > [4,] 1 NA<br>>>> > [5,] 1 NA<br>>>> > [6,] 1 NA<br>
>>> > [7,] 1 NA<br>>>> > [8,] 1 NA<br>>>> > [9,] 1 NA<br>>>> > [10,] 1 NA<br>>>> > First 10 rows of 999999 printed.<br>>>> ><br>
>>> > If I repeat the same example for a numeric attribute and use some math<br>>>> > aggregation functions such as sum, mean, etc., then the number of rows<br>>>> > returned is correct, it is indeed equal to the number of students.<br>
>>> ><br>>>> > I was wondering if the problem is with NA's or with the use of paste<br>>>> > as the aggregation function. I can alternatively use RMySQL with MySQL<br>>>> > to concatenate those strings but I would like to use data.table if<br>
>>> > possible.<br>>>><br>>>> What if you try this (using my `dt` example from above):<br>>>><br>>>> R> dt[, paste(coursecode, collapse=","), by=student_id]<br>
>>> student_id V1<br>>>> [1,] 1 NA,NA,NA,101,102<br>>>> [2,] 2 101,102,103<br>>>><br>>>> Note that each element in the $V1 column is a character vector of<br>
>>> length 1 and not individual course codes.<br>>>><br>>>> Without using the `collapse` argument to your call to paste, you just<br>>>> get a character vector which is the same length as you passed in, eg:<br>
>>><br>>>> R> paste(c('A', 'B', NA, 'C'))<br>>>> [1] "A" "B" "NA" "C"<br>>>><br>>>> vs.<br>>>><br>
>>> R> paste(c('A', 'B', NA, 'C'), collapse=",")<br>>>> [1] "A,B,NA,C"<br>>>><br>>>> HTH,<br>>>><br>>>> -steve<br>>>><br>
>>> --<br>>>> Steve Lianoglou<br>>>> Graduate Student: Computational Systems Biology<br>>>> | Memorial Sloan-Kettering Cancer Center<br>>>> | Weill Medical College of Cornell University<br>
>>> Contact Info: <a href="http://cbio.mskcc.org/~lianos/contact" target="_blank">http://cbio.mskcc.org/~lianos/contact</a><br>>><br>><br>><br><br></p></div></div><p class="MsoNormal">--</p><div><div>
<p class="MsoNormal">Steve Lianoglou<br>Graduate Student: Computational Systems Biology<br> | Memorial Sloan-Kettering Cancer Center<br> | Weill Medical College of Cornell University<br>Contact Info: <a href="http://cbio.mskcc.org/~lianos/contact" target="_blank">http://cbio.mskcc.org/~lianos/contact</a></p>
</div></div></div><p class="MsoNormal"> </p></div></div></div></div></div></div></div><p class="MsoNormal"> </p></div></div></div></div></div></blockquote></div><br></div>