[datatable-help] Grouping with sort

Steve Harman stvharman at gmail.com
Sat May 7 07:50:50 CEST 2011


Thanks.

Here is the bigger picture.
There are about 2 million records. They need to be grouped using person ID.
When we group them, we want to obtain a string where the grouped values are
sorted
and concatenated.

For example

ID, V1
---   ---
1, 2
1, 1
2, 8
2, 3
2, 5
2, 2

should become

ID, Gr_V1
---  -----
1,  1,2
2,   2,3,5,8

The number of people is about 1,007 K

I am giving examples because (1) I cannot copy-paste code (2) data & problem
are classified
All of these computations are performed on secure machines disconnected from
the Internet.
Using R is not a requirement. Many databases can handle the above using SQL.
However, these questions came up because I saw data.table while browsing on
the Internet
and thought I could give it a try in order to avoid using SQL.

On Fri, May 6, 2011 at 6:37 PM, Matthew Dowle <mdowle at mdowle.plus.com>wrote:

> Steve H,
> How much is 'much better' and 'much longer' please? And on how many
> rows/GB? What is the bigger picture, and why are you concatenating
> strings together and using paste() at all?
> Guess 1: you can include the x column in your key; e.g. setkey(grp,x),
> then there would be no need to sort(x) again.
> Guess 2: sorting character can be slow. Hence we don't allow character
> columns in keys (yet); data.table converts character to factor.
> But, ideally, more information at a higher level would help us to help.
> Matthew
>
>
> On Fri, 2011-05-06 at 12:16 -0700, Steve Harman wrote:
> > Connected to this RMySQL performs much better
> > (using GROUP BY and functions such as GROUP_CONCAT which allows you
> > to
> > order and use a separator too).
> >
> > So, I would recommend using them if you want grouping with sorting.
> >
> > On May 6, 2:36 pm, Steve Harman <stvhar... at gmail.com> wrote:
> > > Hello !
> > > When grouping using data.table, mean and sum functions applied within
> > > groups work well but if we use sort(x) function it takes much longer.
> > >
> > > I would like to do first sort(x) and put it inside paste such as
> > > paste(sort(x),collapse=",")
> > > I was wondering if there is any more efficient of effective way of
> > > doing this?
> > >
> > > thanks in advance,
> > >
> > > Steve
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-h... at lists.r-forge.r-project.orghttps://
> lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatabl...
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> >
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/datatable-help/attachments/20110507/aee55c97/attachment.htm>


More information about the datatable-help mailing list