[datatable-help] What is the point of SJ?

Matthew Dowle mdowle at mdowle.plus.com
Mon Jul 1 15:19:36 CEST 2013


Hi,

I don't use SJ very much admittedly.  ?SJ says it's for :
    DT[SJ(...)]
where ... has :
    "Each argument is a vector. Generally each vector is the same length 
but if they are not then usual silent repitition is applied."
So it's not really for :
    X[ SJ(Y) ]
since
    X[Y]
is already that. Or maybe other ways I use sometimes :
    X[setkey(Y)]
or
    X[setkey(Y,...)]
or
    X[setkey(copy(Y),...)]

So SJ() is more for constructing a data.table from vectors, in the 
spirit of J() originally being a mere alias for data.table().

Let's say you have randomly ordered ids in vector 'ids' and X is keyed 
by id.

X[J(ids)]   # look up data and return it in the same order as ids is 
ordered (each lookup is a new binary search)
X[SJ(ids)]  # sort ids first, binary merge (bit faster if i is keyed 
too), and return data in sorted order, keyed by id too

That's the idea anyway. Sometimes if I'm not sure the input vector is 
sorted or not, I'll use SJ() just to make sure.  There may be a shortcut 
in there that uses is.unsorted first to save the cost of sorting (and if 
not there probably should be).

X must be keyed. Y having a key is optional, but if Y has a key too it 
will take advantage of it.  Obviously speed differences will depend on 
many factors including the number of rows in Y, the number of columns in 
the join, the number of rows in X and the number of rows in the result.  
And there is a known potential performance improvement in this area 
(i.e. when both X and Y are keyed), although quite a bit was done 
already last year in particular for character vector joins. [Types make 
a large difference in benchmarks.]

Matthew


On 30.06.2013 15:04, Gabor Grothendieck wrote:
> Consider SJ which I assume was intended to be used like this
>    X[ SJ(Y) ]
> where X and Y are two data tables.  What is the point of SJ?  It 
> seems
> similar to J except it also adds a key to its argument; however, is 
> it
> not the case that that the key on Y will not be used since it has to
> do a full scan of Y anyways?
>
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> 
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


More information about the datatable-help mailing list