[datatable-help] best way to set keys, when you don't know in advance wich fields you will use

Matthew Dowle mdowle at mdowle.plus.com
Thu Aug 25 20:18:35 CEST 2011


JJ,
Yes, Chris is spot on.
keyed by should be faster when the size of each group is large; e.g., a
1 billion row data.table of 1,000 groups. See FAQ 3.3 for why.
However in your example, ad hoc by does seem more appropriate.
Matthew

On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote:
> You don't necessarily have to use keys at all.  When you aggregate and
> give the by columns, they don't necessarily have to be keys of the
> data table.  This is called an "ad-hoc by". It is slightly slower, but
> my intuition says that it isn't really any slower than setting the
> key.
> 
> When you add a key you sort by those fields.  You incur a time cost
> for that. If you are consistently doing things with those keys then
> you may make up for that time cost further on.  But for multiple
> different groupings the ad-hoc by is probably faster.  Do some timings
> to see.  Some simple ones I did show that the act of sorting is slower
> than ad-hoc by.
> 
> On 25 August 2011 11:05, Jean Jacques Dureau <jj.dureau at gmail.com> wrote:
> > Hi,
> > i have a data.table (10,000k of rows) with 20 (factor) fields and i
> > need to filter data according some of them.
> > I use this data.table inside a function and i don't know "in advance"
> > wich fileds i'll use to filter data and to sum.
> >
> > So, for example, consider a data.table (named dt_data) with 20 fileds,
> > named f1, f2, ... ,f20.
> >
> > I use this approach: i set the key on the field i have to use, for
> > example f2. Then i "filter" the data and i use them to do some
> > computations.
> >
> > Subsequently, with these computations, i discover wich fileds i have
> > to filter, for example f4 and f5. Now, i set the key on dt_data on
> > (f4,f5), and so on ...
> >
> > I use this approach because i don't  know if it's possible to set the
> > key on all fields f1, f2, .., f20 in advance and then use only some of
> > them!
> >
> > Is there a better way to use data.table?
> >
> > thanks
> >
> > jj
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> >
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list