[datatable-help] using J() to select for a value that is in something other than the first key

Matthew Dowle mdowle at mdowle.plus.com
Thu Feb 9 01:15:05 CET 2012


Agreed on all that.  For deleting keys, how about:

    rmkey(DT,b,c)   # Remove the (b,c) key whether that be primary or
secondary.
    rmkey(DT)       # Remove all keys
    setkey(DT,NULL)  # Remove primary key, leave secondary keys
    rmkey(DT,2)     # Remove 1st secondary key
    rmkey(DT,c("b","c"))   # Remove (b,c) key.
    rm2key(DT)      # Remove all secondary keys

Back to 'usekey', ok so we're fine for choosing which key of x to join
to using 'usekey', but what about which columns of i?

    DT = data.table(a=1:2,b=3:4,val=10:11)
    setkey(DT,a)
    set2key(DT,b)
    X=data.table(a=2,b=4,key="a")
    DT[X]              # key(X) joined to key(DT)
    DT[X,usekey="b"]   # key(X) joined to secondary key
    DT[X[,list(b)],usekey="b"]   # X.b joined to DT.b,  but X.b is
copied here

Maybe new 'useicols' argument, or just 'icols',  but it feels like there
should be a neater way.

Or, when j is a subset of DT's columns DT[,j] could be changed to return
a reference to DT's columns in memory, to be copied-on-write if/when
needed.

Hm.



On Wed, 2012-02-08 at 12:52 -0500, Joseph Voelkel wrote:
> I like that idea.
> 
> My two cents:
> 1. Instead of DT[J(5),usekey=c("c","a")], can DT[J(5),usekey=list(c,a)] be used? Same syntax as for by= (or at least the syntax I use). Or both, as the character vector one would be easier to use with arguments in a function call.
> 2. I can imagine that the user will most often either have only a primary key, or a primary key and one secondary key. For simplicity (but perhaps offending some others at the same time...), can 
> DT[J(5),usekey=2] refer to the first secondary key, usekey=3 to the second secondary key, and so on.
> 3. How about deleting secondary keys? If someone is trying a variety or ideas (or accidentally enters 
> DT[J(5),usekey=c("b","c")] and has a new secondary key created on the fly), these might pile up.
> 
> Joe
> 
> -----Original Message-----
> From: Matthew Dowle [mailto:mdowle at mdowle.plus.com] 
> Sent: Wednesday, February 08, 2012 5:41 AM
> To: Joseph Voelkel
> Cc: Farrel Buchinsky; Steve Lianoglou; datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] using J() to select for a value that is in something other than the first key
> 
> > Well, I don't know what's faster, but I'll have to agree with Farrel's
> > philosophy. In addition, I KNOW that if I were to look at the code for $i
> > 'trick' a few weeks later, I would have uneasy feelings about it. I would
> > prefer that I can easily read my code-the less thought, the better.
> 
> Fully agree. We'd like proper secondary keys, but just haven't got to it yet.
> 
> >
> > Here are the two ways I would think about doing it. No idea of speed
> > issues.
> >
> > dt <- data.table(a=c('a','a','a','a','b','b','b','b'),
> >   b=c('a','b','a','b','a','b','b','a'),c=1:8,key=c('a','b'))
> >
> > # method 1. Just include all possible values of the first key in J. To me,
> > this is conceptually the simplest
> > dt[J(unique(a),"b")]
> 
> That's nice, didn't think of that.
> 
> >
> > # method 2. Swap the keys, twice
> > setkey(dt,b,a)
> > dt<-dt[J("b")]
> > setkey(dt,a,b)
> 
> This should be a lot faster than it used to be, now that setkey is working
> fully by reference even when coercing.  But it's ugly and inefficient
> relative to what secondary keys could be. Especially when dt has many
> columns.
> 
> Ok, let's do secondary keys.  The sticking point is what should the syntax
> be?  How to write it internally without copying is fairly easy.
> 
> So, on the syntax I'm thinking new function set2key().  Same arguments as
> setkey.  That would find the ordering of the table by those columns and
> merely store that ordering vector as an attribute of the table. Perhaps
> the name should be setkey2() but that suggests a 2nd version of setkey,
> and rather than setsecondarykey, arrived at set2key().
> 
> A data.table could have a secondary key without a primary key. If a table
> has secondary keys, and then a setkey() is done (to add or change the
> primary key - the actual table ordering), then the secondary keys need to
> be updated.
> 
> So far so good and I'm clear.    But, what should the syntax be to tell
> data.table which key of x that i should join to?  I'm thinking it could
> use the column names of i to know.  Currently, the column names of i
> aren't used and don't need to match.
> 
>    DT = data.table(a=1:2,b=3:4,c=5:6,v=10:11)
>    setkey(DT,a,b)  # primary key. data can only be sorted one way.
>    set2key(DT,c,a) # 1st secondary key
>    set2key(DT,c,b) # 2nd secondary key
>    DT[J(2)]        # join to 'a' column of primary key, as now
>    DT[J(c=5)]      # still joins to 'a' column of primary key, currently.
> 
> This could be changed so that if DT has a secondary key matching names(i),
> then join to that secondary key. Otherwise, join to the primary key.
> 
>    DT[J(c=5)]      # join to secondary key
> 
> But, which of the two secondary keys that start with column c in this
> example should it join to?  The first secondary key I guess. It does make
> a difference to the ordering returned, though, and maybe joining to the
> 2nd secondary key is desired.
> 
> So, perhaps it should be a new 'usekey' argument. Just a character vector
> subset of names(x).
> 
>    DT[J(5),usekey="c"]         # join to the first key starting with column c
>    DT[J(5),usekey=c("c","a")]  # join to 1st secondary key explicitly.
> names(i), being length 1 here wouldn't be sufficient to control which
> of the 2 secondary keys to join to, but 'usekey' argument handle it.
> 
>    DT[J(5),usekey=c("c","b")]  # join to 2nd secondary key
> 
> The default for 'usekey' would be key(DT), as now.  key() would return the
> primary key.   keys() would return the primary key (if any) followed by
> (s)econdary key(s).  keys(DT)[[1]] = key(DT).   keys(DT)[[2]] = first
> secondary key.  key(DT)[[3]] = 2nd secondary key.
> 
> Further,
> 
>    DT[J(5),usekey=c("b","c")]  # no key exists for (b,c), so create 3rd
> secondary key on the fly and store it; i.e., implicit set2key().
>    DT[J(5),usekey=c("b","c")]  # join to 3rd secondary key as it's now
> cached by previous query.
> 
>    DT[J(5),usekey=keys(DT)[[2]]]  # join to 1st secondary key
> 
> What do you think?
> 
> Thanks.
> 
> 
> >
> >
> > From: datatable-help-bounces at r-forge.wu-wien.ac.at
> > [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Farrel
> > Buchinsky
> > Sent: Thursday, January 19, 2012 8:59 PM
> > To: Steve Lianoglou
> > Cc: datatable-help at r-forge.wu-wien.ac.at
> > Subject: Re: [datatable-help] using J() to select for a value that is in
> > something other than the first key
> >
> > I do not know if that is how all indexes work. I am not really a
> > card-carrying database manager or programmer. I just play one in my spare
> > time. The price I pay is not remembering how to write syntax when I need
> > to do something. That to  me, is a higher price than slow subsetting. If
> > the syntax is not easy I would rather just use the traditional vector scan
> > methods that one sees in conventional data.frame subset commands.
> >
> > Notwithstanding my idiosyncratic needs, I thank you very much for your
> > explanation.
> > Farrel
> >
> >
> >
> > On Thu, Jan 19, 2012 at 18:26, Steve Lianoglou
> > <mailinglist.honeypot at gmail.com<mailto:mailinglist.honeypot at gmail.com>>
> > wrote:
> > On Thu, Jan 19, 2012 at 6:09 PM, Farrel Buchinsky
> > <fjbuch at gmail.com<mailto:fjbuch at gmail.com>> wrote:
> >> Oy gevalt!.Am I correct to believe that the technique is rearranging the
> >> data.table so that J can accept the input as pertaining to a secondary
> >> key?
> >> That seems as if it is too much work for me and my computer. I will
> >> rather
> >> stick to the vector scan methods for now.
> > Not the entire data.table, just the key columns.
> >
> > Depending on how many queries you're going to make against the 2nd key
> > only, the pay off for your troubles could be anywhere from zero to
> > mucho. Of course if you simply don't have the RAM to make the idx
> > data.table in the first place, then that's that.
> >
> > That's how all indexes work though, no? In a database for instance, if
> > you have a compound key/index over two or more columns, the index will
> > only help queries that work any prefix (or whole) part of the key, and
> > not just any subset elements of it (as you want to do here), right?
> >
> > HTH,
> > -steve
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> >  | Memorial Sloan-Kettering Cancer Center
> >  | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> 




More information about the datatable-help mailing list