[datatable-help] Column names after self join
Matthew Dowle
mdowle at mdowle.plus.com
Tue Apr 19 21:18:56 CEST 2011
Andreas,
Bug #1340 fixed so that now works as you expected.
That clears the known bug list.
Matthew
On Thu, 2011-03-31 at 08:26 +0100, Matthew Dowle wrote:
> Hi Andreas,
>
> Agreed, thanks - added bug #1340 re the column names.
>
> The example data seems a little too cut down but if I understand
> correctly then this idiom might be better :
>
> > setkey(dt,x1,id)
> > dt[J(x1,id-1,id,x2),roll=TRUE,nomatch=0]
> x1 id x2 id.1 x2.1
> [1,] a 1 1 2 2
> >
>
> That has the same column name issue but in the result this time and may
> be easier to work around in the meantime. I've assumed you've already
> tried grouping by x1 and using .SD.
>
> Matthew
>
>
> On Wed, 2011-03-30 at 13:55 +0200, Andreas Borg wrote:
> > Dear list members,
> >
> > I started incorporating data.tabe into the RecordLinkage package for
> > speed improvement. Right now I am trying to use a self join on a
> > data.table to find from a dataset all record pairs that have equal
> > values for a specified column. An example table:
> >
> > > dt <- data.table(id=1:4, x1=c("a","a","b","c"), x2=c(1,2,3,3), key="x1")
> > > dt
> > id x1 x2
> > [1,] 1 a 1
> > [2,] 2 a 2
> > [3,] 3 b 3
> > [4,] 4 c 3
> >
> > I do a self join to find all pairs of rows with same value for x1:
> >
> > > dt[dt]
> > x1 id x2 id.1 x2.1
> > [1,] a 1 1 1 1
> > [2,] a 2 2 1 1
> > [3,] a 1 1 2 2
> > [4,] a 2 2 2 2
> > [5,] b 3 3 3 3
> > [6,] c 4 3 4 3
> >
> >
> > The problem comes now: I want to select the columns "id" and "id.1" and
> > let only rows with id < id.1 pass (which means that each pair appears
> > only once and a row is not matched to itself). Naturally, this would be:
> >
> > dt[dt][id < id.1]
> >
> > but I get an error, because "id.1" is really "id" internally:
> >
> > > summary.default(dt[dt])
> > Length Class Mode
> > x1 6 factor numeric
> > id 6 -none- numeric
> > x2 6 -none- numeric
> > id 6 -none- numeric
> > x2 6 -none- numeric
> >
> > and also the other components are ambigiuos, so there seems to be no way
> > to discern between the two "id" columns. I would propose to change this
> > behaviour to the one of merge, where one gets unambigous column names:
> >
> > > summary.default(merge(dt, dt, by="x1"))
> > Length Class Mode
> > id 6 -none- numeric
> > x1 6 factor numeric
> > x2 6 -none- numeric
> > id.1 6 -none- numeric
> > x2.1 6 -none- numeric
> >
> > Or is there any other possibility to deal with this?
> >
> > Anyway, thanks to the developers for creating this useful package!
> >
> > Best regards,
> >
> > Andreas
> >
> >
> >
>
>
> _______________________________________________
> datatable-help mailing list
> datatable-help at lists.r-forge.r-project.org
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
More information about the datatable-help
mailing list