[datatable-help] Grouping with sort

Matthew Dowle mdowle at mdowle.plus.com
Sat May 7 22:28:47 CEST 2011


Hi Steve H,

Please read "Describe the goal, not the step" here :
http://www.catb.org/~esr/faqs/smart-questions.html

Matthew


On Sat, 2011-05-07 at 01:50 -0400, Steve Harman wrote:
> Thanks.
> 
> 
> Here is the bigger picture.
> There are about 2 million records. They need to be grouped using
> person ID.
> When we group them, we want to obtain a string where the grouped
> values are sorted
> and concatenated.
> 
> 
> For example
> 
> 
> ID, V1
> ---   ---
> 1, 2
> 1, 1
> 2, 8
> 2, 3
> 2, 5
> 2, 2
> 
> 
> should become
> 
> 
> ID, Gr_V1
> ---  -----
> 1,  1,2
> 2,   2,3,5,8
> 
> 
> The number of people is about 1,007 K
> 
> 
> I am giving examples because (1) I cannot copy-paste code (2) data &
> problem are classified
> All of these computations are performed on secure machines
> disconnected from the Internet.
> Using R is not a requirement. Many databases can handle the above
> using SQL.
> However, these questions came up because I saw data.table while
> browsing on the Internet
> and thought I could give it a try in order to avoid using SQL.
> 
> On Fri, May 6, 2011 at 6:37 PM, Matthew Dowle <mdowle at mdowle.plus.com>
> wrote:
>         Steve H,
>         How much is 'much better' and 'much longer' please? And on how
>         many
>         rows/GB? What is the bigger picture, and why are you
>         concatenating
>         strings together and using paste() at all?
>         Guess 1: you can include the x column in your key; e.g.
>         setkey(grp,x),
>         then there would be no need to sort(x) again.
>         Guess 2: sorting character can be slow. Hence we don't allow
>         character
>         columns in keys (yet); data.table converts character to
>         factor.
>         But, ideally, more information at a higher level would help us
>         to help.
>         Matthew
>         
>         
>         
>         On Fri, 2011-05-06 at 12:16 -0700, Steve Harman wrote:
>         > Connected to this RMySQL performs much better
>         > (using GROUP BY and functions such as GROUP_CONCAT which
>         allows you
>         > to
>         > order and use a separator too).
>         >
>         > So, I would recommend using them if you want grouping with
>         sorting.
>         >
>         > On May 6, 2:36 pm, Steve Harman <stvhar... at gmail.com> wrote:
>         > > Hello !
>         > > When grouping using data.table, mean and sum functions
>         applied within
>         > > groups work well but if we use sort(x) function it takes
>         much longer.
>         > >
>         > > I would like to do first sort(x) and put it inside paste
>         such as
>         > > paste(sort(x),collapse=",")
>         > > I was wondering if there is any more efficient of
>         effective way of
>         > > doing this?
>         > >
>         > > thanks in advance,
>         > >
>         > > Steve
>         > > _______________________________________________
>         > > datatable-help mailing list
>         > >
>         datatable-h... at lists.r-forge.r-project.orghttps://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatabl...
>         > _______________________________________________
>         > datatable-help mailing list
>         > datatable-help at lists.r-forge.r-project.org
>         >
>         https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>         
>         
>         
> 
> 




More information about the datatable-help mailing list