[datatable-help] best way to set keys, when you don't know in advance wich fields you will use

Jean Jacques Dureau jj.dureau at gmail.com
Fri Aug 26 10:08:26 CEST 2011


Dear chris and mattew,
thanks for the fantastic explanation that you gave me!

I am very satisfied with the processing time of the "ad hoc by". I
just wanted to confirm that working without a set of keys to
data.table, I were still using the potential of this library! So, you
confirmed to me that my approach is not wrong.

I noticed, in fact, that with 7.000 k rows, 7 factors to group (f1,
..., f7) and a variable to sum, I get:

 processing time of DT[, sum (f8), by = ("f1, f2, f3, f4, f5, f6, f7")]
 <
 processing time of setkey (DT, f1, f2, f3, f4, f5, f6, f7) and DT[,
sum (f8), by = key (DT)]

Thank you very much, and I congratulate the developers! Considering
that the statisticians are increasingly working with large data, I
think it's one of the most interesting R library !!!!!

jj

2011/8/25 Matthew Dowle <mdowle at mdowle.plus.com>:
> Also, it makes a difference if the groups happen to be contiguous in the
> table, or not.
>
> Try creating a large table with large sized groups, where each group is
> scattered throughout the table non-contiguously.  Time an ad hoc by.
> Then set a key, remove the key, and time the ad hoc by again.  The 2nd
> ad hoc by should be much faster.  Then set the key again, and time a
> keyed by, it should be faster still.
>
> Does that illustrate what's going on?
>
>
> On Thu, 2011-08-25 at 19:18 +0100, Matthew Dowle wrote:
>> JJ,
>> Yes, Chris is spot on.
>> keyed by should be faster when the size of each group is large; e.g., a
>> 1 billion row data.table of 1,000 groups. See FAQ 3.3 for why.
>> However in your example, ad hoc by does seem more appropriate.
>> Matthew
>>
>> On Thu, 2011-08-25 at 11:17 -0400, Chris Neff wrote:
>> > You don't necessarily have to use keys at all.  When you aggregate and
>> > give the by columns, they don't necessarily have to be keys of the
>> > data table.  This is called an "ad-hoc by". It is slightly slower, but
>> > my intuition says that it isn't really any slower than setting the
>> > key.
>> >
>> > When you add a key you sort by those fields.  You incur a time cost
>> > for that. If you are consistently doing things with those keys then
>> > you may make up for that time cost further on.  But for multiple
>> > different groupings the ad-hoc by is probably faster.  Do some timings
>> > to see.  Some simple ones I did show that the act of sorting is slower
>> > than ad-hoc by.
>> >
>> > On 25 August 2011 11:05, Jean Jacques Dureau <jj.dureau at gmail.com> wrote:
>> > > Hi,
>> > > i have a data.table (10,000k of rows) with 20 (factor) fields and i
>> > > need to filter data according some of them.
>> > > I use this data.table inside a function and i don't know "in advance"
>> > > wich fileds i'll use to filter data and to sum.
>> > >
>> > > So, for example, consider a data.table (named dt_data) with 20 fileds,
>> > > named f1, f2, ... ,f20.
>> > >
>> > > I use this approach: i set the key on the field i have to use, for
>> > > example f2. Then i "filter" the data and i use them to do some
>> > > computations.
>> > >
>> > > Subsequently, with these computations, i discover wich fileds i have
>> > > to filter, for example f4 and f5. Now, i set the key on dt_data on
>> > > (f4,f5), and so on ...
>> > >
>> > > I use this approach because i don't  know if it's possible to set the
>> > > key on all fields f1, f2, .., f20 in advance and then use only some of
>> > > them!
>> > >
>> > > Is there a better way to use data.table?
>> > >
>> > > thanks
>> > >
>> > > jj
>> > > _______________________________________________
>> > > datatable-help mailing list
>> > > datatable-help at lists.r-forge.r-project.org
>> > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>> > >
>> > _______________________________________________
>> > datatable-help mailing list
>> > datatable-help at lists.r-forge.r-project.org
>> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>>
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>


More information about the datatable-help mailing list