[datatable-help] using paste function while grouping gives strange results

Steve Lianoglou mailinglist.honeypot at gmail.com
Fri May 6 06:22:40 CEST 2011


Hi,

As an aside -- in the future, please provide some data in a form that
we can just copy and paste from your email into an R session so that
we can get a working object up quickly.

For example:

R> dt <- data.table(coursecode=c(NA, NA, NA, 101, 102, 101, 102, 103),
  student_id=c(1, 1, 1, 1, 1, 2, 2, 2),
  key='student_id')

On Thu, May 5, 2011 at 10:54 PM, Steve Harman <stvharman at gmail.com> wrote:
> Hello
>
> I have a data table called dt in which each student can have multiple
> records (created using data.table)
>
> coursecode    student_id
> ----------------    ----------------
> NA               1
> NA               1
> NA               1
> ....                1
> ....                1
> NA                2
> 101               2
> 102               2
> NA                2
> 103                2
>
> I am trying to group by student id and concatenate the coursecode
> strings in
> student records. This string is mostly NA but it can also be real
> course code
> (because of messy real life data coursecode was not always entered)
> There are 999999 records.
>
> So, I thought I would get results like
>
> 1 NA NA NA .....
> 2 NA 101 102 NA 123 ....

What type of object are you expecting that result to be?

> However, as seen below, it  brings me a result with 999999 rows
> and it fails to concatenate the coursecode's.
>
>>  codes <- dt[,paste(coursecode),by=student_id]
>> codes
>      student_id V1
>  [1,]          1 NA
>  [2,]          1 NA
>  [3,]          1 NA
>  [4,]          1 NA
>  [5,]          1 NA
>  [6,]          1 NA
>  [7,]          1 NA
>  [8,]          1 NA
>  [9,]          1 NA
> [10,]          1 NA
> First 10 rows of 999999 printed.
>
> If I repeat the same example for a numeric attribute and use some math
> aggregation functions such as sum, mean, etc., then the number of rows
> returned is correct, it is indeed equal to the number of students.
>
> I was wondering if the problem is with NA's or with the use of paste
> as the aggregation function. I can alternatively use RMySQL with MySQL
> to concatenate those strings but I would like to use data.table if
> possible.

What if you try this (using my `dt` example from above):

R> dt[, paste(coursecode, collapse=","), by=student_id]
     student_id               V1
[1,]          1 NA,NA,NA,101,102
[2,]          2      101,102,103

Note that each element in the $V1 column is a character vector of
length 1 and not individual course codes.

Without using the `collapse` argument to your call to paste, you just
get a character vector which is the same length as you passed in, eg:

R> paste(c('A', 'B', NA, 'C'))
[1] "A"  "B"  "NA" "C"

vs.

R> paste(c('A', 'B', NA, 'C'), collapse=",")
[1] "A,B,NA,C"

HTH,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact


More information about the datatable-help mailing list