[datatable-help] Column names after self join

Matthew Dowle mdowle at mdowle.plus.com
Tue Apr 19 21:18:56 CEST 2011


Andreas,
Bug #1340 fixed so that now works as you expected.
That clears the known bug list.
Matthew

On Thu, 2011-03-31 at 08:26 +0100, Matthew Dowle wrote:
> Hi Andreas,
> 
> Agreed, thanks - added bug #1340 re the column names.
> 
> The example data seems a little too cut down but if I understand
> correctly then this idiom might be better :
> 
> > setkey(dt,x1,id)
> > dt[J(x1,id-1,id,x2),roll=TRUE,nomatch=0]
>      x1 id x2 id.1 x2.1
> [1,]  a  1  1    2    2
> > 
> 
> That has the same column name issue but in the result this time and may
> be easier to work around in the meantime. I've assumed you've already
> tried grouping by x1 and using .SD.
> 
> Matthew
> 
> 
> On Wed, 2011-03-30 at 13:55 +0200, Andreas Borg wrote:
> > Dear list members,
> > 
> > I started incorporating data.tabe into the RecordLinkage package for
> > speed improvement. Right now I am trying to use a self join on a
> > data.table to find from a dataset all record pairs that have equal
> > values for a specified column. An example table:
> > 
> > > dt <- data.table(id=1:4, x1=c("a","a","b","c"), x2=c(1,2,3,3), key="x1")
> > > dt
> >      id x1 x2
> > [1,]  1  a  1
> > [2,]  2  a  2
> > [3,]  3  b  3
> > [4,]  4  c  3
> > 
> > I do a self join to find all pairs of rows with same value for x1:
> > 
> > > dt[dt]
> >      x1 id x2 id.1 x2.1
> > [1,]  a  1  1    1    1
> > [2,]  a  2  2    1    1
> > [3,]  a  1  1    2    2
> > [4,]  a  2  2    2    2
> > [5,]  b  3  3    3    3
> > [6,]  c  4  3    4    3
> > 
> > 
> > The problem comes now: I want to select the columns "id" and "id.1" and
> > let only rows with id < id.1 pass (which means that each pair appears
> > only once and a row is not matched to itself). Naturally, this would be:
> > 
> > dt[dt][id < id.1]
> > 
> > but I get an error, because "id.1" is really "id" internally:
> > 
> > > summary.default(dt[dt])
> >    Length Class  Mode
> > x1 6      factor numeric
> > id 6      -none- numeric
> > x2 6      -none- numeric
> > id 6      -none- numeric
> > x2 6      -none- numeric
> > 
> > and also the other components are ambigiuos, so there seems to be no way
> > to discern between the two "id" columns. I would propose to change this
> > behaviour to the one of merge, where one gets unambigous column names:
> > 
> > > summary.default(merge(dt, dt, by="x1"))
> >      Length Class  Mode
> > id   6      -none- numeric
> > x1   6      factor numeric
> > x2   6      -none- numeric
> > id.1 6      -none- numeric
> > x2.1 6      -none- numeric
> > 
> > Or is there any other possibility to deal with this?
> > 
> > Anyway, thanks to the developers for creating this useful package!
> > 
> > Best regards,
> > 
> > Andreas
> > 
> > 
> > 
> 
> 
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help




More information about the datatable-help mailing list