[datatable-help] using paste function while grouping gives strange results
Matthew Dowle
mdowle at mdowle.plus.com
Sat May 7 00:10:45 CEST 2011
Steve H,
Interesting thread. To add fuel to the fire, have you seen R Inferno?
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
Matthew
On Fri, 2011-05-06 at 15:26 -0400, Joseph Voelkel wrote:
> Steve H,
>
>
>
> As a R user, I sometimes make fundamental mistakes (like forgetting to
> use collapse with the paste function when I want to collapse).
>
>
>
> However, R is a powerful language. It assumes the user knows what he
> or she is doing unless something is almost certainly wrong (Steve L
> provided some examples. This seems like the 80-90% you mentioned, but
> it’s probably more in the 95%-99% range.) In my opinion, it is
> unrealistic for you to make what are really programming mistakes on
> your part (for what you INTENDED—if you INTENDED something else it
> would not be a mistake) and then expect the software to be able to
> read your INTENT.
>
>
>
> I am not a great programmer, but having worked with software that
> prints out too many warnings—or worse, that will not let you do some
> things because the programmers decided a user would be unlikely to
> want to do this—I prefer R’s approach.
>
>
>
> Regarding the recycling note recently posted—yes, that may be a nice
> option. (But will you need to need to have a third option: “don’t
> print out recycling warnings for vectors of length 1”? That’s usually
> done intentionally.
>
>
>
> Regards,
>
>
>
> Joe V.
>
> From: datatable-help-bounces at r-forge.wu-wien.ac.at
> [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of
> Steve Harman
> Sent: Friday, May 06, 2011 2:05 PM
> To: Steve Lianoglou
> Cc: datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] using paste function while grouping
> gives strange results
>
>
>
>
> Steve,
>
>
>
>
>
> These are good examples of confusing statements.
>
> In same cases, people might prefer to use them intentionally for
> certain purposes,
>
>
> (even in that case, it would detract from the readability or
> maintainability of programs).
>
>
> On the other side of the coin, they are masking program errors.
>
>
> It is a mistake that R overlooked such usability issues (i.e.,
> programmer usability).
>
>
> And, two wrongs will not make a right.
>
>
>
>
> I wouldn't go as much as saying that R should have been
>
>
> a typed language, but I do strongly believe that R libraries can be
> made
>
>
> more user or developer friendly (still using the command line).
>
>
> Using appropriate warnings in the places where you suspect that, with
> 80-90%
>
>
> probability, the user or programmer might be doing something
> unexpected,
>
>
> just issue a warning.
>
>
>
>
>
>
>
> On Fri, May 6, 2011 at 10:48 AM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>
> Hi Steve,
>
> As (another :-) aside -- make sure you use "reply-all" when replying
> to messages from this (and pretty much all other R-related) mailing
> lists, otherwise your mail goes straight to the person, and not back
> to the list.
>
> Other comments in line:
>
> On Fri, May 6, 2011 at 10:29 AM, Steve Harman <stvharman at gmail.com>
> wrote:
> > Steve, this works.
>
> Great! Glad to hear it.
>
>
> > However, this discussion shows that we need some error or
> > at least warning messages in this case.
>
>
> For this particular case, I'd respectfully have to disagree.
>
>
> > It is important to pay attention to user (in this case programmer)
> > experience and facilitate recovery from
> > mistakes by providing the user with meaningful and timely messages.
> > thanks for all your help,
>
>
> I would argue that what happened to you is actually "expected
> behavior."
>
> You'll find that in many contexts, if "R" thinks it can figure out
> what you intended to do with two vectors that aren't the same length,
> it will try to be smart and do it.
>
> For instance, this is similar to what happened to you -- notice how
> TRUE is recycled to be as long as the first column here:
>
> R> data.frame(id=letters[1:5], huh=TRUE)
> id huh
> 1 a TRUE
> 2 b TRUE
> 3 c TRUE
> 4 d TRUE
> 5 e TRUE
>
> Perhaps more strangely, but still "R-correct" (note no warning):
>
> R> 1:3 + 1:6 ## == c(1:3,1:3) + 1:6
> [1] 2 4 6 5 7 9 8
>
> R thinks this is strange, but still does "something" for you (but
> gives a warning since the 2nd vector isn't a multiple of the first
>
> R> 1:3 + 1:7
> [1] 2 4 6 5 7 9 8
> Warning message:
> In 1:3 + 1:7 :
> longer object length is not a multiple of shorter object length
>
> Often times I actually take advantage of the situation that happened
> to you to expand a result into several rows (instead of just into 1)
> when doing split/summarize/merge stuff with data.table's [,
> by='something'] mojo.
>
> My 2 cents,
>
> -steve
>
>
> > On Fri, May 6, 2011 at 9:44 AM, Steve Harman <stvharman at gmail.com>
> wrote:
> >>
> >> Thanks, I'll try it today and let you know.
> >>
> >> On Fri, May 6, 2011 at 12:22 AM, Steve Lianoglou
> >> <mailinglist.honeypot at gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> As an aside -- in the future, please provide some data in a form
> that
> >>> we can just copy and paste from your email into an R session so
> that
> >>> we can get a working object up quickly.
> >>>
> >>> For example:
> >>>
> >>> R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102,
> 103),
> >>> student_id=c(1, 1, 1, 1, 1, 2, 2, 2),
> >>> key='student_id')
> >>>
> >>> On Thu, May 5, 2011 at 10:54 PM, Steve Harman
> <stvharman at gmail.com>
> >>> wrote:
> >>> > Hello
> >>> >
> >>> > I have a data table called dt in which each student can have
> multiple
> >>> > records (created using data.table)
> >>> >
> >>> > coursecode student_id
> >>> > ---------------- ----------------
> >>> > NA 1
> >>> > NA 1
> >>> > NA 1
> >>> > .... 1
> >>> > .... 1
> >>> > NA 2
> >>> > 101 2
> >>> > 102 2
> >>> > NA 2
> >>> > 103 2
> >>> >
> >>> > I am trying to group by student id and concatenate the
> coursecode
> >>> > strings in
> >>> > student records. This string is mostly NA but it can also be
> real
> >>> > course code
> >>> > (because of messy real life data coursecode was not always
> entered)
> >>> > There are 999999 records.
> >>> >
> >>> > So, I thought I would get results like
> >>> >
> >>> > 1 NA NA NA .....
> >>> > 2 NA 101 102 NA 123 ....
> >>>
> >>> What type of object are you expecting that result to be?
> >>>
> >>> > However, as seen below, it brings me a result with 999999 rows
> >>> > and it fails to concatenate the coursecode's.
> >>> >
> >>> >> codes <- dt[,paste(coursecode),by=student_id]
> >>> >> codes
> >>> > student_id V1
> >>> > [1,] 1 NA
> >>> > [2,] 1 NA
> >>> > [3,] 1 NA
> >>> > [4,] 1 NA
> >>> > [5,] 1 NA
> >>> > [6,] 1 NA
> >>> > [7,] 1 NA
> >>> > [8,] 1 NA
> >>> > [9,] 1 NA
> >>> > [10,] 1 NA
> >>> > First 10 rows of 999999 printed.
> >>> >
> >>> > If I repeat the same example for a numeric attribute and use
> some math
> >>> > aggregation functions such as sum, mean, etc., then the number
> of rows
> >>> > returned is correct, it is indeed equal to the number of
> students.
> >>> >
> >>> > I was wondering if the problem is with NA's or with the use of
> paste
> >>> > as the aggregation function. I can alternatively use RMySQL with
> MySQL
> >>> > to concatenate those strings but I would like to use data.table
> if
> >>> > possible.
> >>>
> >>> What if you try this (using my `dt` example from above):
> >>>
> >>> R> dt[, paste(coursecode, collapse=","), by=student_id]
> >>> student_id V1
> >>> [1,] 1 NA,NA,NA,101,102
> >>> [2,] 2 101,102,103
> >>>
> >>> Note that each element in the $V1 column is a character vector of
> >>> length 1 and not individual course codes.
> >>>
> >>> Without using the `collapse` argument to your call to paste, you
> just
> >>> get a character vector which is the same length as you passed in,
> eg:
> >>>
> >>> R> paste(c('A', 'B', NA, 'C'))
> >>> [1] "A" "B" "NA" "C"
> >>>
> >>> vs.
> >>>
> >>> R> paste(c('A', 'B', NA, 'C'), collapse=",")
> >>> [1] "A,B,NA,C"
> >>>
> >>> HTH,
> >>>
> >>> -steve
> >>>
> >>> --
> >>> Steve Lianoglou
> >>> Graduate Student: Computational Systems Biology
> >>> | Memorial Sloan-Kettering Cancer Center
> >>> | Weill Medical College of Cornell University
> >>> Contact Info: http://cbio.mskcc.org/~lianos/contact
> >>
> >
> >
>
>
>
>
>
> --
>
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
>
>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list