[datatable-help] What is the point of SJ?
Matthew Dowle
mdowle at mdowle.plus.com
Mon Jul 1 15:19:36 CEST 2013
Hi,
I don't use SJ very much admittedly. ?SJ says it's for :
DT[SJ(...)]
where ... has :
"Each argument is a vector. Generally each vector is the same length
but if they are not then usual silent repitition is applied."
So it's not really for :
X[ SJ(Y) ]
since
X[Y]
is already that. Or maybe other ways I use sometimes :
X[setkey(Y)]
or
X[setkey(Y,...)]
or
X[setkey(copy(Y),...)]
So SJ() is more for constructing a data.table from vectors, in the
spirit of J() originally being a mere alias for data.table().
Let's say you have randomly ordered ids in vector 'ids' and X is keyed
by id.
X[J(ids)] # look up data and return it in the same order as ids is
ordered (each lookup is a new binary search)
X[SJ(ids)] # sort ids first, binary merge (bit faster if i is keyed
too), and return data in sorted order, keyed by id too
That's the idea anyway. Sometimes if I'm not sure the input vector is
sorted or not, I'll use SJ() just to make sure. There may be a shortcut
in there that uses is.unsorted first to save the cost of sorting (and if
not there probably should be).
X must be keyed. Y having a key is optional, but if Y has a key too it
will take advantage of it. Obviously speed differences will depend on
many factors including the number of rows in Y, the number of columns in
the join, the number of rows in X and the number of rows in the result.
And there is a known potential performance improvement in this area
(i.e. when both X and Y are keyed), although quite a bit was done
already last year in particular for character vector joins. [Types make
a large difference in benchmarks.]
Matthew
On 30.06.2013 15:04, Gabor Grothendieck wrote:
> Consider SJ which I assume was intended to be used like this
> X[ SJ(Y) ]
> where X and Y are two data tables. What is the point of SJ? It
> seems
> similar to J except it also adds a key to its argument; however, is
> it
> not the case that that the key on Y will not be used since it has to
> do a full scan of Y anyways?
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
>
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list