[datatable-help] Column names after self join

Matthew Dowle mdowle at mdowle.plus.com
Thu Mar 31 09:26:12 CEST 2011


Hi Andreas,

Agreed, thanks - added bug #1340 re the column names.

The example data seems a little too cut down but if I understand
correctly then this idiom might be better :

> setkey(dt,x1,id)
> dt[J(x1,id-1,id,x2),roll=TRUE,nomatch=0]
     x1 id x2 id.1 x2.1
[1,]  a  1  1    2    2
> 

That has the same column name issue but in the result this time and may
be easier to work around in the meantime. I've assumed you've already
tried grouping by x1 and using .SD.

Matthew


On Wed, 2011-03-30 at 13:55 +0200, Andreas Borg wrote:
> Dear list members,
> 
> I started incorporating data.tabe into the RecordLinkage package for
> speed improvement. Right now I am trying to use a self join on a
> data.table to find from a dataset all record pairs that have equal
> values for a specified column. An example table:
> 
> > dt <- data.table(id=1:4, x1=c("a","a","b","c"), x2=c(1,2,3,3), key="x1")
> > dt
>      id x1 x2
> [1,]  1  a  1
> [2,]  2  a  2
> [3,]  3  b  3
> [4,]  4  c  3
> 
> I do a self join to find all pairs of rows with same value for x1:
> 
> > dt[dt]
>      x1 id x2 id.1 x2.1
> [1,]  a  1  1    1    1
> [2,]  a  2  2    1    1
> [3,]  a  1  1    2    2
> [4,]  a  2  2    2    2
> [5,]  b  3  3    3    3
> [6,]  c  4  3    4    3
> 
> 
> The problem comes now: I want to select the columns "id" and "id.1" and
> let only rows with id < id.1 pass (which means that each pair appears
> only once and a row is not matched to itself). Naturally, this would be:
> 
> dt[dt][id < id.1]
> 
> but I get an error, because "id.1" is really "id" internally:
> 
> > summary.default(dt[dt])
>    Length Class  Mode
> x1 6      factor numeric
> id 6      -none- numeric
> x2 6      -none- numeric
> id 6      -none- numeric
> x2 6      -none- numeric
> 
> and also the other components are ambigiuos, so there seems to be no way
> to discern between the two "id" columns. I would propose to change this
> behaviour to the one of merge, where one gets unambigous column names:
> 
> > summary.default(merge(dt, dt, by="x1"))
>      Length Class  Mode
> id   6      -none- numeric
> x1   6      factor numeric
> x2   6      -none- numeric
> id.1 6      -none- numeric
> x2.1 6      -none- numeric
> 
> Or is there any other possibility to deal with this?
> 
> Anyway, thanks to the developers for creating this useful package!
> 
> Best regards,
> 
> Andreas
> 
> 
> 




More information about the datatable-help mailing list