[datatable-help] converting sql to data table with subqueriesand exists clauses
Matthew Dowle
mdowle at mdowle.plus.com
Sun Feb 26 18:32:08 CET 2012
Thinking about it, his should be faster by avoiding the two vector scans
(==) :
DT = data.table(user=c(1,1,2,2,3,4,4),
movie=c(2,3,3,4,5,4,7),
rating=c(3,4,4,5,1,1,3))
setkey(DT, rating, movie)
DT[,newcol:=NA_integer_]
DT[ DT[J(1L),unique(movie)][,rating:=5L], newcol:=1L, mult="first"]
DT[ DT[J(5L),unique(movie)][,rating:=1L], newcol:=5L, mult="first"]
DT[order(user,movie)]
user movie rating newcol
[1,] 1 2 3 NA
[2,] 1 3 4 NA
[3,] 2 3 4 NA
[4,] 2 4 5 1
[5,] 3 5 1 NA
[6,] 4 4 1 5
[7,] 4 7 3 NA
On Sun, 2012-02-26 at 17:00 +0000, Matthew Dowle wrote:
> So the new column should be:
>
> 1 when user voted 5 and there's another user who rated the same movie 1.
> 5 when user rated 1 and there's another user who voted the same movie 5.
> NA otherwise
>
> If that's right, then why is avg needed? Avg of a set of 1's is 1, and a
> set of 5's is 5. Anyway, how about this :
>
> DT = data.table(user=c(1,1,2,2,3,4,4),
> movie=c(2,3,3,4,5,4,7),
> rating=c(3,4,4,5,1,1,3))
> setkey(DT, movie, rating)
> DT[,newcol:=NA_integer_]
> DT[ J(DT[rating==1L,unique(movie)],5L), newcol:=1L, mult="first"]
> DT[ J(DT[rating==5L,unique(movie)],1L), newcol:=5L, mult="first"]
> DT[order(user,movie)]
> user movie rating newcol
> [1,] 1 2 3 NA
> [2,] 1 3 4 NA
> [3,] 2 3 4 NA
> [4,] 2 4 5 1
> [5,] 3 5 1 NA
> [6,] 4 4 1 5
> [7,] 4 7 3 NA
>
>
> On Fri, 2012-02-24 at 18:45 -0500, DS wrote:
> > A user could rate multiple times but lets please ignore that fact as that is not the case in this data and i just made up fake data below and made a mistake.
> >
> > The new column is what i want to compute using data table and add to the original data.
> >
> > so the output should be:
> > user,mv,rating,somedate, Avg_disimilar_users
> > 1,2,3,date,NA
> > 1,3,4,date,NA
> > 2,3,4,date,NA
> > 2,4,5,date,1
> > 3,5,1,date,NA
> > 4,4,1,date,5
> > 4,7,3,date,na
> >
> > input:
> >
> > each row is : user, movie, rating, time.
> > user,mv,rating,somedate
> > 1,2,3,date
> > 1,3,4,date,
> > 2,3,4,date,
> > 2,4,5,date,
> > 3,5,1,date
> > 4,4,1,date
> > 4,7,3,date
> >
> > so here for user 2, movie 4, the new column called avg disimilar rating should be 1.
> > for user 4, movie 4, the new column should be 5.
> >
> > the new column Avg_disimilar users which I want is the avg of users who rated differently on the movie in question on that row of the data.
> > when users dont have movies in common 4 apart the result should be NA or empty.
> >
> > Hope that helps.
> >
> > Dhruv
> >
> >
> > -----Original Message-----
> > From: "Matthew Dowle" [mdowle at mdowle.plus.com]
> > Date: 02/24/2012 04:43 AM
> > To: ds5j at excite.com
> > CC: datatable-help at r-forge.wu-wien.ac.at
> > Subject: Re: converting sql to data table with subqueriesand exists clauses
> >
> >
> > Dhruv,
> >
> > But user 4 voted for movie 4 twice (first 1*, then 3*). Which is something
> > I already asked about (but you haven't addressed). Further, you've
> > described a new column. That implies a new value for every row. There are
> > 7 rows but you've only stated what 2 of those 7 values should be.
> >
> > Please read through this in full :
> >
> > http://www.catb.org/~esr/faqs/smart-questions.html
> >
> > Matthew
> >
> > each row is : user, movie, rating, time.
> > user,mv,rating,somedate
> > 1,2,3,date
> > 1,3,4,date,
> > 2,3,4,date,
> > 2,4,5,date,
> > 3,5,1,date
> > 4,4,1,date
> > 4,4,3,date
> >
> > so here for user 2, movie 4, the new column called avg disimilar rating should be 1.
> > for user 4, movie 4, the new column should be 5.
> >
> > this column is a disimilar column.
> >
> > Dhruv
> >
> >
>
More information about the datatable-help
mailing list