[datatable-help] using paste function while grouping gives strange results

Matthew Dowle mdowle at mdowle.plus.com
Sat May 7 00:10:45 CEST 2011


Steve H,

Interesting thread. To add fuel to the fire, have you seen R Inferno?
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Matthew


On Fri, 2011-05-06 at 15:26 -0400, Joseph Voelkel wrote:
> Steve H, 
> 
>  
> 
> As a R user, I sometimes make fundamental mistakes (like forgetting to
> use collapse with the paste function when I want to collapse).
> 
>  
> 
> However, R is a powerful language. It assumes the user knows what he
> or she is doing unless something is almost certainly wrong (Steve L
> provided some examples. This seems like the 80-90% you mentioned, but
> it’s probably more in the 95%-99% range.) In my opinion, it is
> unrealistic for you to make what are really programming mistakes on
> your part (for what you INTENDED—if you INTENDED something else it
> would not be a mistake) and then expect the software to be able to
> read your INTENT. 
> 
>  
> 
> I am not a great programmer, but having worked with software that
> prints out too many warnings—or worse, that will not let you do some
> things because the programmers decided a user would be unlikely to
> want to do this—I prefer R’s approach.
> 
>  
> 
> Regarding the recycling note recently posted—yes, that may be a nice
> option. (But will you need to need to have a third option: “don’t
> print out recycling warnings for vectors of length 1”? That’s usually
> done intentionally.
> 
>  
> 
> Regards,
> 
>  
> 
> Joe V.
> 
> From: datatable-help-bounces at r-forge.wu-wien.ac.at
> [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of
> Steve Harman
> Sent: Friday, May 06, 2011 2:05 PM
> To: Steve Lianoglou
> Cc: datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] using paste function while grouping
> gives strange results
> 
> 
>  
> 
> Steve,
> 
> 
>  
> 
> 
> These are good examples of confusing statements. 
> 
> In same cases, people might prefer to use them intentionally for
> certain purposes, 
> 
> 
> (even in that case, it would detract from the readability or
> maintainability of programs).
> 
> 
> On the other side of the coin, they are masking program errors.
> 
> 
> It is a mistake that R overlooked such usability issues (i.e.,
> programmer usability).
> 
> 
> And, two wrongs will not make a right.
> 
> 
>  
> 
> I wouldn't go as much as saying that R should have been
> 
> 
> a typed language, but I do strongly believe that R libraries can be
> made
> 
> 
> more user or developer friendly (still using the command line).
> 
> 
> Using appropriate warnings in the places where you suspect that, with
> 80-90%
> 
> 
> probability, the user or programmer might be doing something
> unexpected,
> 
> 
> just issue a warning.
> 
> 
>  
> 
> 
>  
> 
> On Fri, May 6, 2011 at 10:48 AM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
> 
> Hi Steve,
> 
> As (another :-) aside -- make sure you use "reply-all" when replying
> to messages from this (and pretty much all other R-related) mailing
> lists, otherwise your mail goes straight to the person, and not back
> to the list.
> 
> Other comments in line:
> 
> On Fri, May 6, 2011 at 10:29 AM, Steve Harman <stvharman at gmail.com>
> wrote:
> > Steve, this works.
> 
> Great! Glad to hear it.
> 
> 
> > However, this discussion shows that we need some error or
> > at least warning messages in this case.
> 
> 
> For this particular case, I'd respectfully have to disagree.
> 
> 
> > It is important to pay attention to user (in this case programmer)
> > experience and facilitate recovery from
> > mistakes by providing the user with meaningful and timely messages.
> > thanks for all your help,
> 
> 
> I would argue that what happened to you is actually "expected
> behavior."
> 
> You'll find that in many contexts, if "R" thinks it can figure out
> what you intended to do with two vectors that aren't the same length,
> it will try to be smart and do it.
> 
> For instance, this is similar to what happened to you -- notice how
> TRUE is recycled to be as long as the first column here:
> 
> R> data.frame(id=letters[1:5], huh=TRUE)
>  id  huh
> 1  a TRUE
> 2  b TRUE
> 3  c TRUE
> 4  d TRUE
> 5  e TRUE
> 
> Perhaps more strangely, but still "R-correct" (note no warning):
> 
> R> 1:3 + 1:6 ## == c(1:3,1:3) + 1:6
> [1] 2 4 6 5 7 9 8
> 
> R thinks this is strange, but still does "something" for you (but
> gives a warning since the 2nd vector isn't a multiple of the first
> 
> R> 1:3 + 1:7
> [1] 2 4 6 5 7 9 8
> Warning message:
> In 1:3 + 1:7 :
>  longer object length is not a multiple of shorter object length
> 
> Often times I actually take advantage of the situation that happened
> to you to expand a result into several rows (instead of just into 1)
> when doing split/summarize/merge stuff with data.table's [,
> by='something'] mojo.
> 
> My 2 cents,
> 
> -steve
> 
> 
> > On Fri, May 6, 2011 at 9:44 AM, Steve Harman <stvharman at gmail.com>
> wrote:
> >>
> >> Thanks, I'll try it today and let you know.
> >>
> >> On Fri, May 6, 2011 at 12:22 AM, Steve Lianoglou
> >> <mailinglist.honeypot at gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> As an aside -- in the future, please provide some data in a form
> that
> >>> we can just copy and paste from your email into an R session so
> that
> >>> we can get a working object up quickly.
> >>>
> >>> For example:
> >>>
> >>> R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102,
> 103),
> >>>  student_id=c(1, 1, 1, 1, 1, 2, 2, 2),
> >>>  key='student_id')
> >>>
> >>> On Thu, May 5, 2011 at 10:54 PM, Steve Harman
> <stvharman at gmail.com>
> >>> wrote:
> >>> > Hello
> >>> >
> >>> > I have a data table called dt in which each student can have
> multiple
> >>> > records (created using data.table)
> >>> >
> >>> > coursecode    student_id
> >>> > ----------------    ----------------
> >>> > NA               1
> >>> > NA               1
> >>> > NA               1
> >>> > ....                1
> >>> > ....                1
> >>> > NA                2
> >>> > 101               2
> >>> > 102               2
> >>> > NA                2
> >>> > 103                2
> >>> >
> >>> > I am trying to group by student id and concatenate the
> coursecode
> >>> > strings in
> >>> > student records. This string is mostly NA but it can also be
> real
> >>> > course code
> >>> > (because of messy real life data coursecode was not always
> entered)
> >>> > There are 999999 records.
> >>> >
> >>> > So, I thought I would get results like
> >>> >
> >>> > 1 NA NA NA .....
> >>> > 2 NA 101 102 NA 123 ....
> >>>
> >>> What type of object are you expecting that result to be?
> >>>
> >>> > However, as seen below, it  brings me a result with 999999 rows
> >>> > and it fails to concatenate the coursecode's.
> >>> >
> >>> >>  codes <- dt[,paste(coursecode),by=student_id]
> >>> >> codes
> >>> >      student_id V1
> >>> >  [1,]          1 NA
> >>> >  [2,]          1 NA
> >>> >  [3,]          1 NA
> >>> >  [4,]          1 NA
> >>> >  [5,]          1 NA
> >>> >  [6,]          1 NA
> >>> >  [7,]          1 NA
> >>> >  [8,]          1 NA
> >>> >  [9,]          1 NA
> >>> > [10,]          1 NA
> >>> > First 10 rows of 999999 printed.
> >>> >
> >>> > If I repeat the same example for a numeric attribute and use
> some math
> >>> > aggregation functions such as sum, mean, etc., then the number
> of rows
> >>> > returned is correct, it is indeed equal to the number of
> students.
> >>> >
> >>> > I was wondering if the problem is with NA's or with the use of
> paste
> >>> > as the aggregation function. I can alternatively use RMySQL with
> MySQL
> >>> > to concatenate those strings but I would like to use data.table
> if
> >>> > possible.
> >>>
> >>> What if you try this (using my `dt` example from above):
> >>>
> >>> R> dt[, paste(coursecode, collapse=","), by=student_id]
> >>>     student_id               V1
> >>> [1,]          1 NA,NA,NA,101,102
> >>> [2,]          2      101,102,103
> >>>
> >>> Note that each element in the $V1 column is a character vector of
> >>> length 1 and not individual course codes.
> >>>
> >>> Without using the `collapse` argument to your call to paste, you
> just
> >>> get a character vector which is the same length as you passed in,
> eg:
> >>>
> >>> R> paste(c('A', 'B', NA, 'C'))
> >>> [1] "A"  "B"  "NA" "C"
> >>>
> >>> vs.
> >>>
> >>> R> paste(c('A', 'B', NA, 'C'), collapse=",")
> >>> [1] "A,B,NA,C"
> >>>
> >>> HTH,
> >>>
> >>> -steve
> >>>
> >>> --
> >>> Steve Lianoglou
> >>> Graduate Student: Computational Systems Biology
> >>>  | Memorial Sloan-Kettering Cancer Center
> >>>  | Weill Medical College of Cornell University
> >>> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >>
> >
> >
> 
> 
> 
> 
> 
> --
> 
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
> 
> 
>  
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list