<div>Steve,</div><div><br></div>These are good examples of confusing statements. <div>In same cases, people might prefer to use them intentionally for certain purposes, </div><div>(even in that case, it would detract from the readability or maintainability of programs).</div>
<div>On the other side of the coin, they are masking program errors.</div><div>It is a mistake that R overlooked such usability issues (i.e., programmer usability).</div><div>And, two wrongs will not make a right.</div><div>
<br><div>I wouldn't go as much as saying that R should have been</div><div>a typed language, but I do strongly believe that R libraries can be made</div><div>more user or developer friendly (still using the command line).</div>
<div>Using appropriate warnings in the places where you suspect that, with 80-90%</div><div>probability, the user or programmer might be doing something unexpected,</div><div>just issue a warning.</div><div><br></div><div>
<br><div class="gmail_quote">On Fri, May 6, 2011 at 10:48 AM, Steve Lianoglou <span dir="ltr"><<a href="mailto:mailinglist.honeypot@gmail.com">mailinglist.honeypot@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi Steve,<br>
<br>
As (another :-) aside -- make sure you use "reply-all" when replying<br>
to messages from this (and pretty much all other R-related) mailing<br>
lists, otherwise your mail goes straight to the person, and not back<br>
to the list.<br>
<br>
Other comments in line:<br>
<br>
On Fri, May 6, 2011 at 10:29 AM, Steve Harman <<a href="mailto:stvharman@gmail.com">stvharman@gmail.com</a>> wrote:<br>
> Steve, this works.<br>
<br>
Great! Glad to hear it.<br>
<div class="im"><br>
> However, this discussion shows that we need some error or<br>
> at least warning messages in this case.<br>
<br>
</div>For this particular case, I'd respectfully have to disagree.<br>
<div class="im"><br>
> It is important to pay attention to user (in this case programmer)<br>
> experience and facilitate recovery from<br>
> mistakes by providing the user with meaningful and timely messages.<br>
> thanks for all your help,<br>
<br>
</div>I would argue that what happened to you is actually "expected behavior."<br>
<br>
You'll find that in many contexts, if "R" thinks it can figure out<br>
what you intended to do with two vectors that aren't the same length,<br>
it will try to be smart and do it.<br>
<br>
For instance, this is similar to what happened to you -- notice how<br>
TRUE is recycled to be as long as the first column here:<br>
<br>
R> data.frame(id=letters[1:5], huh=TRUE)<br>
id huh<br>
1 a TRUE<br>
2 b TRUE<br>
3 c TRUE<br>
4 d TRUE<br>
5 e TRUE<br>
<br>
Perhaps more strangely, but still "R-correct" (note no warning):<br>
<br>
R> 1:3 + 1:6 ## == c(1:3,1:3) + 1:6<br>
[1] 2 4 6 5 7 9 8<br>
<br>
R thinks this is strange, but still does "something" for you (but<br>
gives a warning since the 2nd vector isn't a multiple of the first<br>
<br>
R> 1:3 + 1:7<br>
[1] 2 4 6 5 7 9 8<br>
Warning message:<br>
In 1:3 + 1:7 :<br>
longer object length is not a multiple of shorter object length<br>
<br>
Often times I actually take advantage of the situation that happened<br>
to you to expand a result into several rows (instead of just into 1)<br>
when doing split/summarize/merge stuff with data.table's [,<br>
by='something'] mojo.<br>
<br>
My 2 cents,<br>
<font color="#888888"><br>
-steve<br>
</font><div><div></div><div class="h5"><br>
> On Fri, May 6, 2011 at 9:44 AM, Steve Harman <<a href="mailto:stvharman@gmail.com">stvharman@gmail.com</a>> wrote:<br>
>><br>
>> Thanks, I'll try it today and let you know.<br>
>><br>
>> On Fri, May 6, 2011 at 12:22 AM, Steve Lianoglou<br>
>> <<a href="mailto:mailinglist.honeypot@gmail.com">mailinglist.honeypot@gmail.com</a>> wrote:<br>
>>><br>
>>> Hi,<br>
>>><br>
>>> As an aside -- in the future, please provide some data in a form that<br>
>>> we can just copy and paste from your email into an R session so that<br>
>>> we can get a working object up quickly.<br>
>>><br>
>>> For example:<br>
>>><br>
>>> R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102, 103),<br>
>>> student_id=c(1, 1, 1, 1, 1, 2, 2, 2),<br>
>>> key='student_id')<br>
>>><br>
>>> On Thu, May 5, 2011 at 10:54 PM, Steve Harman <<a href="mailto:stvharman@gmail.com">stvharman@gmail.com</a>><br>
>>> wrote:<br>
>>> > Hello<br>
>>> ><br>
>>> > I have a data table called dt in which each student can have multiple<br>
>>> > records (created using data.table)<br>
>>> ><br>
>>> > coursecode student_id<br>
>>> > ---------------- ----------------<br>
>>> > NA 1<br>
>>> > NA 1<br>
>>> > NA 1<br>
>>> > .... 1<br>
>>> > .... 1<br>
>>> > NA 2<br>
>>> > 101 2<br>
>>> > 102 2<br>
>>> > NA 2<br>
>>> > 103 2<br>
>>> ><br>
>>> > I am trying to group by student id and concatenate the coursecode<br>
>>> > strings in<br>
>>> > student records. This string is mostly NA but it can also be real<br>
>>> > course code<br>
>>> > (because of messy real life data coursecode was not always entered)<br>
>>> > There are 999999 records.<br>
>>> ><br>
>>> > So, I thought I would get results like<br>
>>> ><br>
>>> > 1 NA NA NA .....<br>
>>> > 2 NA 101 102 NA 123 ....<br>
>>><br>
>>> What type of object are you expecting that result to be?<br>
>>><br>
>>> > However, as seen below, it brings me a result with 999999 rows<br>
>>> > and it fails to concatenate the coursecode's.<br>
>>> ><br>
>>> >> codes <- dt[,paste(coursecode),by=student_id]<br>
>>> >> codes<br>
>>> > student_id V1<br>
>>> > [1,] 1 NA<br>
>>> > [2,] 1 NA<br>
>>> > [3,] 1 NA<br>
>>> > [4,] 1 NA<br>
>>> > [5,] 1 NA<br>
>>> > [6,] 1 NA<br>
>>> > [7,] 1 NA<br>
>>> > [8,] 1 NA<br>
>>> > [9,] 1 NA<br>
>>> > [10,] 1 NA<br>
>>> > First 10 rows of 999999 printed.<br>
>>> ><br>
>>> > If I repeat the same example for a numeric attribute and use some math<br>
>>> > aggregation functions such as sum, mean, etc., then the number of rows<br>
>>> > returned is correct, it is indeed equal to the number of students.<br>
>>> ><br>
>>> > I was wondering if the problem is with NA's or with the use of paste<br>
>>> > as the aggregation function. I can alternatively use RMySQL with MySQL<br>
>>> > to concatenate those strings but I would like to use data.table if<br>
>>> > possible.<br>
>>><br>
>>> What if you try this (using my `dt` example from above):<br>
>>><br>
>>> R> dt[, paste(coursecode, collapse=","), by=student_id]<br>
>>> student_id V1<br>
>>> [1,] 1 NA,NA,NA,101,102<br>
>>> [2,] 2 101,102,103<br>
>>><br>
>>> Note that each element in the $V1 column is a character vector of<br>
>>> length 1 and not individual course codes.<br>
>>><br>
>>> Without using the `collapse` argument to your call to paste, you just<br>
>>> get a character vector which is the same length as you passed in, eg:<br>
>>><br>
>>> R> paste(c('A', 'B', NA, 'C'))<br>
>>> [1] "A" "B" "NA" "C"<br>
>>><br>
>>> vs.<br>
>>><br>
>>> R> paste(c('A', 'B', NA, 'C'), collapse=",")<br>
>>> [1] "A,B,NA,C"<br>
>>><br>
>>> HTH,<br>
>>><br>
>>> -steve<br>
>>><br>
>>> --<br>
>>> Steve Lianoglou<br>
>>> Graduate Student: Computational Systems Biology<br>
>>> | Memorial Sloan-Kettering Cancer Center<br>
>>> | Weill Medical College of Cornell University<br>
>>> Contact Info: <a href="http://cbio.mskcc.org/~lianos/contact" target="_blank">http://cbio.mskcc.org/~lianos/contact</a><br>
>><br>
><br>
><br>
<br>
<br>
<br>
</div></div>--<br>
<div><div></div><div class="h5">Steve Lianoglou<br>
Graduate Student: Computational Systems Biology<br>
| Memorial Sloan-Kettering Cancer Center<br>
| Weill Medical College of Cornell University<br>
Contact Info: <a href="http://cbio.mskcc.org/~lianos/contact" target="_blank">http://cbio.mskcc.org/~lianos/contact</a><br>
</div></div></blockquote></div><br></div></div>