[datatable-help] using paste function while grouping gives strange results
Steve Lianoglou
mailinglist.honeypot at gmail.com
Fri May 6 06:22:40 CEST 2011
Hi,
As an aside -- in the future, please provide some data in a form that
we can just copy and paste from your email into an R session so that
we can get a working object up quickly.
For example:
R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102, 103),
student_id=c(1, 1, 1, 1, 1, 2, 2, 2),
key='student_id')
On Thu, May 5, 2011 at 10:54 PM, Steve Harman <stvharman at gmail.com> wrote:
> Hello
>
> I have a data table called dt in which each student can have multiple
> records (created using data.table)
>
> coursecode student_id
> ---------------- ----------------
> NA 1
> NA 1
> NA 1
> .... 1
> .... 1
> NA 2
> 101 2
> 102 2
> NA 2
> 103 2
>
> I am trying to group by student id and concatenate the coursecode
> strings in
> student records. This string is mostly NA but it can also be real
> course code
> (because of messy real life data coursecode was not always entered)
> There are 999999 records.
>
> So, I thought I would get results like
>
> 1 NA NA NA .....
> 2 NA 101 102 NA 123 ....
What type of object are you expecting that result to be?
> However, as seen below, it brings me a result with 999999 rows
> and it fails to concatenate the coursecode's.
>
>> codes <- dt[,paste(coursecode),by=student_id]
>> codes
> student_id V1
> [1,] 1 NA
> [2,] 1 NA
> [3,] 1 NA
> [4,] 1 NA
> [5,] 1 NA
> [6,] 1 NA
> [7,] 1 NA
> [8,] 1 NA
> [9,] 1 NA
> [10,] 1 NA
> First 10 rows of 999999 printed.
>
> If I repeat the same example for a numeric attribute and use some math
> aggregation functions such as sum, mean, etc., then the number of rows
> returned is correct, it is indeed equal to the number of students.
>
> I was wondering if the problem is with NA's or with the use of paste
> as the aggregation function. I can alternatively use RMySQL with MySQL
> to concatenate those strings but I would like to use data.table if
> possible.
What if you try this (using my `dt` example from above):
R> dt[, paste(coursecode, collapse=","), by=student_id]
student_id V1
[1,] 1 NA,NA,NA,101,102
[2,] 2 101,102,103
Note that each element in the $V1 column is a character vector of
length 1 and not individual course codes.
Without using the `collapse` argument to your call to paste, you just
get a character vector which is the same length as you passed in, eg:
R> paste(c('A', 'B', NA, 'C'))
[1] "A" "B" "NA" "C"
vs.
R> paste(c('A', 'B', NA, 'C'), collapse=",")
[1] "A,B,NA,C"
HTH,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the datatable-help
mailing list