[datatable-help] using J() to select for a value that is in something other than the first key
Matthew Dowle
mdowle at mdowle.plus.com
Thu Feb 9 01:15:05 CET 2012
Agreed on all that. For deleting keys, how about:
rmkey(DT,b,c) # Remove the (b,c) key whether that be primary or
secondary.
rmkey(DT) # Remove all keys
setkey(DT,NULL) # Remove primary key, leave secondary keys
rmkey(DT,2) # Remove 1st secondary key
rmkey(DT,c("b","c")) # Remove (b,c) key.
rm2key(DT) # Remove all secondary keys
Back to 'usekey', ok so we're fine for choosing which key of x to join
to using 'usekey', but what about which columns of i?
DT = data.table(a=1:2,b=3:4,val=10:11)
setkey(DT,a)
set2key(DT,b)
X=data.table(a=2,b=4,key="a")
DT[X] # key(X) joined to key(DT)
DT[X,usekey="b"] # key(X) joined to secondary key
DT[X[,list(b)],usekey="b"] # X.b joined to DT.b, but X.b is
copied here
Maybe new 'useicols' argument, or just 'icols', but it feels like there
should be a neater way.
Or, when j is a subset of DT's columns DT[,j] could be changed to return
a reference to DT's columns in memory, to be copied-on-write if/when
needed.
Hm.
On Wed, 2012-02-08 at 12:52 -0500, Joseph Voelkel wrote:
> I like that idea.
>
> My two cents:
> 1. Instead of DT[J(5),usekey=c("c","a")], can DT[J(5),usekey=list(c,a)] be used? Same syntax as for by= (or at least the syntax I use). Or both, as the character vector one would be easier to use with arguments in a function call.
> 2. I can imagine that the user will most often either have only a primary key, or a primary key and one secondary key. For simplicity (but perhaps offending some others at the same time...), can
> DT[J(5),usekey=2] refer to the first secondary key, usekey=3 to the second secondary key, and so on.
> 3. How about deleting secondary keys? If someone is trying a variety or ideas (or accidentally enters
> DT[J(5),usekey=c("b","c")] and has a new secondary key created on the fly), these might pile up.
>
> Joe
>
> -----Original Message-----
> From: Matthew Dowle [mailto:mdowle at mdowle.plus.com]
> Sent: Wednesday, February 08, 2012 5:41 AM
> To: Joseph Voelkel
> Cc: Farrel Buchinsky; Steve Lianoglou; datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: [datatable-help] using J() to select for a value that is in something other than the first key
>
> > Well, I don't know what's faster, but I'll have to agree with Farrel's
> > philosophy. In addition, I KNOW that if I were to look at the code for $i
> > 'trick' a few weeks later, I would have uneasy feelings about it. I would
> > prefer that I can easily read my code-the less thought, the better.
>
> Fully agree. We'd like proper secondary keys, but just haven't got to it yet.
>
> >
> > Here are the two ways I would think about doing it. No idea of speed
> > issues.
> >
> > dt <- data.table(a=c('a','a','a','a','b','b','b','b'),
> > b=c('a','b','a','b','a','b','b','a'),c=1:8,key=c('a','b'))
> >
> > # method 1. Just include all possible values of the first key in J. To me,
> > this is conceptually the simplest
> > dt[J(unique(a),"b")]
>
> That's nice, didn't think of that.
>
> >
> > # method 2. Swap the keys, twice
> > setkey(dt,b,a)
> > dt<-dt[J("b")]
> > setkey(dt,a,b)
>
> This should be a lot faster than it used to be, now that setkey is working
> fully by reference even when coercing. But it's ugly and inefficient
> relative to what secondary keys could be. Especially when dt has many
> columns.
>
> Ok, let's do secondary keys. The sticking point is what should the syntax
> be? How to write it internally without copying is fairly easy.
>
> So, on the syntax I'm thinking new function set2key(). Same arguments as
> setkey. That would find the ordering of the table by those columns and
> merely store that ordering vector as an attribute of the table. Perhaps
> the name should be setkey2() but that suggests a 2nd version of setkey,
> and rather than setsecondarykey, arrived at set2key().
>
> A data.table could have a secondary key without a primary key. If a table
> has secondary keys, and then a setkey() is done (to add or change the
> primary key - the actual table ordering), then the secondary keys need to
> be updated.
>
> So far so good and I'm clear. But, what should the syntax be to tell
> data.table which key of x that i should join to? I'm thinking it could
> use the column names of i to know. Currently, the column names of i
> aren't used and don't need to match.
>
> DT = data.table(a=1:2,b=3:4,c=5:6,v=10:11)
> setkey(DT,a,b) # primary key. data can only be sorted one way.
> set2key(DT,c,a) # 1st secondary key
> set2key(DT,c,b) # 2nd secondary key
> DT[J(2)] # join to 'a' column of primary key, as now
> DT[J(c=5)] # still joins to 'a' column of primary key, currently.
>
> This could be changed so that if DT has a secondary key matching names(i),
> then join to that secondary key. Otherwise, join to the primary key.
>
> DT[J(c=5)] # join to secondary key
>
> But, which of the two secondary keys that start with column c in this
> example should it join to? The first secondary key I guess. It does make
> a difference to the ordering returned, though, and maybe joining to the
> 2nd secondary key is desired.
>
> So, perhaps it should be a new 'usekey' argument. Just a character vector
> subset of names(x).
>
> DT[J(5),usekey="c"] # join to the first key starting with column c
> DT[J(5),usekey=c("c","a")] # join to 1st secondary key explicitly.
> names(i), being length 1 here wouldn't be sufficient to control which
> of the 2 secondary keys to join to, but 'usekey' argument handle it.
>
> DT[J(5),usekey=c("c","b")] # join to 2nd secondary key
>
> The default for 'usekey' would be key(DT), as now. key() would return the
> primary key. keys() would return the primary key (if any) followed by
> (s)econdary key(s). keys(DT)[[1]] = key(DT). keys(DT)[[2]] = first
> secondary key. key(DT)[[3]] = 2nd secondary key.
>
> Further,
>
> DT[J(5),usekey=c("b","c")] # no key exists for (b,c), so create 3rd
> secondary key on the fly and store it; i.e., implicit set2key().
> DT[J(5),usekey=c("b","c")] # join to 3rd secondary key as it's now
> cached by previous query.
>
> DT[J(5),usekey=keys(DT)[[2]]] # join to 1st secondary key
>
> What do you think?
>
> Thanks.
>
>
> >
> >
> > From: datatable-help-bounces at r-forge.wu-wien.ac.at
> > [mailto:datatable-help-bounces at r-forge.wu-wien.ac.at] On Behalf Of Farrel
> > Buchinsky
> > Sent: Thursday, January 19, 2012 8:59 PM
> > To: Steve Lianoglou
> > Cc: datatable-help at r-forge.wu-wien.ac.at
> > Subject: Re: [datatable-help] using J() to select for a value that is in
> > something other than the first key
> >
> > I do not know if that is how all indexes work. I am not really a
> > card-carrying database manager or programmer. I just play one in my spare
> > time. The price I pay is not remembering how to write syntax when I need
> > to do something. That to me, is a higher price than slow subsetting. If
> > the syntax is not easy I would rather just use the traditional vector scan
> > methods that one sees in conventional data.frame subset commands.
> >
> > Notwithstanding my idiosyncratic needs, I thank you very much for your
> > explanation.
> > Farrel
> >
> >
> >
> > On Thu, Jan 19, 2012 at 18:26, Steve Lianoglou
> > <mailinglist.honeypot at gmail.com<mailto:mailinglist.honeypot at gmail.com>>
> > wrote:
> > On Thu, Jan 19, 2012 at 6:09 PM, Farrel Buchinsky
> > <fjbuch at gmail.com<mailto:fjbuch at gmail.com>> wrote:
> >> Oy gevalt!.Am I correct to believe that the technique is rearranging the
> >> data.table so that J can accept the input as pertaining to a secondary
> >> key?
> >> That seems as if it is too much work for me and my computer. I will
> >> rather
> >> stick to the vector scan methods for now.
> > Not the entire data.table, just the key columns.
> >
> > Depending on how many queries you're going to make against the 2nd key
> > only, the pay off for your troubles could be anywhere from zero to
> > mucho. Of course if you simply don't have the RAM to make the idx
> > data.table in the first place, then that's that.
> >
> > That's how all indexes work though, no? In a database for instance, if
> > you have a compound key/index over two or more columns, the index will
> > only help queries that work any prefix (or whole) part of the key, and
> > not just any subset elements of it (as you want to do here), right?
> >
> > HTH,
> > -steve
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> > | Memorial Sloan-Kettering Cancer Center
> > | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >
> > _______________________________________________
> > datatable-help mailing list
> > datatable-help at lists.r-forge.r-project.org
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
>
>
More information about the datatable-help
mailing list