[datatable-help] best way to set keys, when you don't know in advance wich fields you will use

Matthew Dowle mdowle at mdowle.plus.com
Thu Aug 25 20:32:50 CEST 2011


Also, it makes a difference if the groups happen to be contiguous in the
table, or not.

Try creating a large table with large sized groups, where each group is
scattered throughout the table non-contiguously.  Time an ad hoc by.
Then set a key, remove the key, and time the ad hoc by again.  The 2nd
ad hoc by should be much faster.  Then set the key again, and time a
keyed by, it should be faster still.

Does that illustrate what's going on?
 

On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote:
> JJ,
> Yes, Chris is spot on.
> keyed by should be faster when the size of each group is large; e.g., a
> 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why.
> However in your example, ad hoc by does seem more appropriate.
> Matthew
> 
> On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote:
> > You don't necessarily have to use keys at all.  When you aggregate and
> > give the by columns, they don't necessarily have to be keys of the
> > data table.  This is called an "ad-hoc by". It is slightly slower, but
> > my intuition says that it isn't really any slower than setting the
> > key.
> > 
> > When you add a key you sort by those fields.  You incur a time cost
> > for that. If you are consistently doing things with those keys then
> > you may make up for that time cost further on.  But for multiple
> > different groupings the ad-hoc by is probably faster.  Do some timings
> > to see.  Some simple ones I did show that the act of sorting is slower
> > than ad-hoc by.
> > 
> > On 25 August 2011 11:05, Jean Jacques Dureau <jj.dureau at gmail.com> wrote:
> > > Hi,
> > > i have a data.table (10,000k of rows) with 20 (factor) fields and i
> > > need to filter data according some of them.
> > > I use this data.table inside a function and i don't know "in advance"
> > > wich fileds i'll use to filter data and to sum.
> > >
> > > So, for example, consider a data.table (named dt_data) with 20 fileds,
> > > named f1, f2, ... ,f20.
> > >
> > > I use this approach: i set the key on the field i have to use, for
> > > example f2. Then i "filter" the data and i use them to do some
> > > computations.
> > >
> > > Subsequently, with these computations, i discover wich fileds i have
> > > to filter, for example f4 and f5. Now, i set the key on dt_data on
> > > (f4,f5), and so on ...
> > >
> > > I use this approach because i don't  know if it's possible to set the
> > > key on all fields f1, f2, .., f20 in advance and then use only some of
> > > them!
> > >
> > > Is there a better way to use data.table?
> > >
> > > thanks
> > >
> > > jj
> > > _______________________________________________
> > > datatable-help mailing list
> > > datatable-help at lists.r-forge.r-project.org
> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 




More information about the datatable-help mailing list