[datatable-help] using paste function while grouping gives strange results

Bacou, Melanie mel at mbacou.com
Fri May 6 21:09:01 CEST 2011


Hi Steve*2,

Interesting thread. R smart-recycling is something very odd indeed for folks
who come with a background in data management, social sciences/economics
(the SAS, STATA, SPSS, SQL types). Once you get your mind around it, you
wonder how you ever lived without, but overall I found the recycling rules
very hard to teach and a major source of potential errors.

I wonder if it would make sense to have a global option to prevent/turn off
that behavior. I agree with Steve that having a little more verbose when R
does in fact recycle vectors would be helpful as well.

Apologies if this is not the list to post generic comments, but in case
other data.table users have an opinion here...

--Mel.

 


-----Original Message-----
From: datatable-help-bounces at r-forge.wu-wien.ac.at
[mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Steve
Lianoglou
Sent: Friday, May 06, 2011 10:49 AM
To: Steve Harman
Cc: datatable-help at r-forge.wu-wien.ac.at
Subject: Re: [datatable-help] using paste function while grouping gives
strange results

Hi Steve,

As (another :-) aside -- make sure you use "reply-all" when replying
to messages from this (and pretty much all other R-related) mailing
lists, otherwise your mail goes straight to the person, and not back
to the list.

Other comments in line:

On Fri, May 6, 2011 at 10:29 AM, Steve Harman <stvharman at gmail.com> wrote:
> Steve, this works.

Great! Glad to hear it.

> However, this discussion shows that we need some error or
> at least warning messages in this case.

For this particular case, I'd respectfully have to disagree.

> It is important to pay attention to user (in this case programmer)
> experience and facilitate recovery from
> mistakes by providing the user with meaningful and timely messages.
> thanks for all your help,

I would argue that what happened to you is actually "expected behavior."

You'll find that in many contexts, if "R" thinks it can figure out
what you intended to do with two vectors that aren't the same length,
it will try to be smart and do it.

For instance, this is similar to what happened to you -- notice how
TRUE is recycled to be as long as the first column here:

R> data.frame(id=letters[1:5], huh=TRUE)
  id  huh
1  a TRUE
2  b TRUE
3  c TRUE
4  d TRUE
5  e TRUE

Perhaps more strangely, but still "R-correct" (note no warning):

R> 1:3 + 1:6 ## == c(1:3,1:3) + 1:6
[1] 2 4 6 5 7 9 8

R thinks this is strange, but still does "something" for you (but
gives a warning since the 2nd vector isn't a multiple of the first

R> 1:3 + 1:7
[1] 2 4 6 5 7 9 8
Warning message:
In 1:3 + 1:7 :
  longer object length is not a multiple of shorter object length

Often times I actually take advantage of the situation that happened
to you to expand a result into several rows (instead of just into 1)
when doing split/summarize/merge stuff with data.table's [,
by='something'] mojo.

My 2 cents,

-steve

> On Fri, May 6, 2011 at 9:44 AM, Steve Harman <stvharman at gmail.com> wrote:
>>
>> Thanks, I'll try it today and let you know.
>>
>> On Fri, May 6, 2011 at 12:22 AM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> As an aside -- in the future, please provide some data in a form that
>>> we can just copy and paste from your email into an R session so that
>>> we can get a working object up quickly.
>>>
>>> For example:
>>>
>>> R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102, 103),
>>>  student_id=c(1, 1, 1, 1, 1, 2, 2, 2),
>>>  key='student_id')
>>>
>>> On Thu, May 5, 2011 at 10:54 PM, Steve Harman <stvharman at gmail.com>
>>> wrote:
>>> > Hello
>>> >
>>> > I have a data table called dt in which each student can have multiple
>>> > records (created using data.table)
>>> >
>>> > coursecode    student_id
>>> > ----------------    ----------------
>>> > NA               1
>>> > NA               1
>>> > NA               1
>>> > ....                1
>>> > ....                1
>>> > NA                2
>>> > 101               2
>>> > 102               2
>>> > NA                2
>>> > 103                2
>>> >
>>> > I am trying to group by student id and concatenate the coursecode
>>> > strings in
>>> > student records. This string is mostly NA but it can also be real
>>> > course code
>>> > (because of messy real life data coursecode was not always entered)
>>> > There are 999999 records.
>>> >
>>> > So, I thought I would get results like
>>> >
>>> > 1 NA NA NA .....
>>> > 2 NA 101 102 NA 123 ....
>>>
>>> What type of object are you expecting that result to be?
>>>
>>> > However, as seen below, it  brings me a result with 999999 rows
>>> > and it fails to concatenate the coursecode's.
>>> >
>>> >>  codes <- dt[,paste(coursecode),by=student_id]
>>> >> codes
>>> >      student_id V1
>>> >  [1,]          1 NA
>>> >  [2,]          1 NA
>>> >  [3,]          1 NA
>>> >  [4,]          1 NA
>>> >  [5,]          1 NA
>>> >  [6,]          1 NA
>>> >  [7,]          1 NA
>>> >  [8,]          1 NA
>>> >  [9,]          1 NA
>>> > [10,]          1 NA
>>> > First 10 rows of 999999 printed.
>>> >
>>> > If I repeat the same example for a numeric attribute and use some math
>>> > aggregation functions such as sum, mean, etc., then the number of rows
>>> > returned is correct, it is indeed equal to the number of students.
>>> >
>>> > I was wondering if the problem is with NA's or with the use of paste
>>> > as the aggregation function. I can alternatively use RMySQL with MySQL
>>> > to concatenate those strings but I would like to use data.table if
>>> > possible.
>>>
>>> What if you try this (using my `dt` example from above):
>>>
>>> R> dt[, paste(coursecode, collapse=","), by=student_id]
>>>     student_id               V1
>>> [1,]          1 NA,NA,NA,101,102
>>> [2,]          2      101,102,103
>>>
>>> Note that each element in the $V1 column is a character vector of
>>> length 1 and not individual course codes.
>>>
>>> Without using the `collapse` argument to your call to paste, you just
>>> get a character vector which is the same length as you passed in, eg:
>>>
>>> R> paste(c('A', 'B', NA, 'C'))
>>> [1] "A"  "B"  "NA" "C"
>>>
>>> vs.
>>>
>>> R> paste(c('A', 'B', NA, 'C'), collapse=",")
>>> [1] "A,B,NA,C"
>>>
>>> HTH,
>>>
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
_______________________________________________
datatable-help mailing list
datatable-help at lists.r-forge.r-project.org
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



More information about the datatable-help mailing list