[datatable-help] using J() to select for a value that is in something other than the first key

Joseph Voelkel jgvcqa at rit.edu
Wed Feb 8 18:52:55 CET 2012


I like that idea.

My two cents:
1. Instead of DT[J(5),usekey=c("c","a")], can DT[J(5),usekey=list(c,a)] be used? Same syntax as for by= (or at least the syntax I use). Or both, as the character vector one would be easier to use with arguments in a function call.
2. I can imagine that the user will most often either have only a primary key, or a primary key and one secondary key. For simplicity (but perhaps offending some others at the same time...), can 
DT[J(5),usekey=2] refer to the first secondary key, usekey=3 to the second secondary key, and so on.
3. How about deleting secondary keys? If someone is trying a variety or ideas (or accidentally enters 
DT[J(5),usekey=c("b","c")] and has a new secondary key created on the fly), these might pile up.

Joe

-----Original Message-----
From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] 
Sent: Wednesday, February 08, 2012 5:41 AM
To: Joseph Voelkel
Cc: Farrel Buchinsky; Steve Lianoglou; datatable-help at r-forge.wu-wien.ac.at
Subject: Re: [datatable-help] using J() to select for a value that is in something other than the first key

> Well, I don't know what's faster, but I'll have to agree with Farrel's
> philosophy. In addition, I KNOW that if I were to look at the code for $i
> 'trick' a few weeks later, I would have uneasy feelings about it. I would
> prefer that I can easily read my code-the less thought, the better.

Fully agree. We'd like proper secondary keys, but just haven't got to it yet.

>
> Here are the two ways I would think about doing it. No idea of speed
> issues.
>
> dt <- data.table(a=c('a','a','a','a','b','b','b','b'),
>   b=c('a','b','a','b','a','b','b','a'),c=1:8,key=c('a','b'))
>
> # method 1. Just include all possible values of the first key in J. To me,
> this is conceptually the simplest
> dt[J(unique(a),"b")]

That's nice, didn't think of that.

>
> # method 2. Swap the keys, twice
> setkey(dt,b,a)
> dt<-dt[J("b")]
> setkey(dt,a,b)

This should be a lot faster than it used to be, now that setkey is working
fully by reference even when coercing.  But it's ugly and inefficient
relative to what secondary keys could be. Especially when dt has many
columns.

Ok, let's do secondary keys.  The sticking point is what should the syntax
be?  How to write it internally without copying is fairly easy.

So, on the syntax I'm thinking new function set2key().  Same arguments as
setkey.  That would find the ordering of the table by those columns and
merely store that ordering vector as an attribute of the table. Perhaps
the name should be setkey2() but that suggests a 2nd version of setkey,
and rather than setsecondarykey, arrived at set2key().

A data.table could have a secondary key without a primary key. If a table
has secondary keys, and then a setkey() is done (to add or change the
primary key - the actual table ordering), then the secondary keys need to
be updated.

So far so good and I'm clear.    But, what should the syntax be to tell
data.table which key of x that i should join to?  I'm thinking it could
use the column names of i to know.  Currently, the column names of i
aren't used and don't need to match.

   DT = data.table(a=1:2,b=3:4,c=5:6,v=10:11)
   setkey(DT,a,b)  # primary key. data can only be sorted one way.
   set2key(DT,c,a) # 1st secondary key
   set2key(DT,c,b) # 2nd secondary key
   DT[J(2)]        # join to 'a' column of primary key, as now
   DT[J(c=5)]      # still joins to 'a' column of primary key, currently.

This could be changed so that if DT has a secondary key matching names(i),
then join to that secondary key. Otherwise, join to the primary key.

   DT[J(c=5)]      # join to secondary key

But, which of the two secondary keys that start with column c in this
example should it join to?  The first secondary key I guess. It does make
a difference to the ordering returned, though, and maybe joining to the
2nd secondary key is desired.

So, perhaps it should be a new 'usekey' argument. Just a character vector
subset of names(x).

   DT[J(5),usekey="c"]         # join to the first key starting with column c
   DT[J(5),usekey=c("c","a")]  # join to 1st secondary key explicitly.
names(i), being length 1 here wouldn't be sufficient to control which
of the 2 secondary keys to join to, but 'usekey' argument handle it.

   DT[J(5),usekey=c("c","b")]  # join to 2nd secondary key

The default for 'usekey' would be key(DT), as now.  key() would return the
primary key.   keys() would return the primary key (if any) followed by
(s)econdary key(s).  keys(DT)[[1]] = key(DT).   keys(DT)[[2]] = first
secondary key.  key(DT)[[3]] = 2nd secondary key.

Further,

   DT[J(5),usekey=c("b","c")]  # no key exists for (b,c), so create 3rd
secondary key on the fly and store it; i.e., implicit set2key().
   DT[J(5),usekey=c("b","c")]  # join to 3rd secondary key as it's now
cached by previous query.

   DT[J(5),usekey=keys(DT)[[2]]]  # join to 1st secondary key

What do you think?

Thanks.


>
>
> From: datatable-help-bounces at r-forge.wu-wien.ac.at
> [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Farrel
> Buchinsky
> Sent: Thursday, January 19, 2012 8:59 PM
> To: Steve Lianoglou
> Cc: datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] using J() to select for a value that is in
> something other than the first key
>
> I do not know if that is how all indexes work. I am not really a
> card-carrying database manager or programmer. I just play one in my spare
> time. The price I pay is not remembering how to write syntax when I need
> to do something. That to  me, is a higher price than slow subsetting. If
> the syntax is not easy I would rather just use the traditional vector scan
> methods that one sees in conventional data.frame subset commands.
>
> Notwithstanding my idiosyncratic needs, I thank you very much for your
> explanation.
> Farrel
>
>
>
> On Thu, Jan 19, 2012 at 18:26, Steve Lianoglou
> <mailinglist.honeypot at gmail.com<mailto:mailinglist.honeypot at gmail.com>>
> wrote:
> On Thu, Jan 19, 2012 at 6:09 PM, Farrel Buchinsky
> <fjbuch at gmail.com<mailto:fjbuch at gmail.com>> wrote:
>> Oy gevalt!.Am I correct to believe that the technique is rearranging the
>> data.table so that J can accept the input as pertaining to a secondary
>> key?
>> That seems as if it is too much work for me and my computer. I will
>> rather
>> stick to the vector scan methods for now.
> Not the entire data.table, just the key columns.
>
> Depending on how many queries you're going to make against the 2nd key
> only, the pay off for your troubles could be anywhere from zero to
> mucho. Of course if you simply don't have the RAM to make the idx
> data.table in the first place, then that's that.
>
> That's how all indexes work though, no? In a database for instance, if
> you have a compound key/index over two or more columns, the index will
> only help queries that work any prefix (or whole) part of the key, and
> not just any subset elements of it (as you want to do here), right?
>
> HTH,
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list