[datatable-help] converting sql to data table with subqueriesand exists clauses

Sun Feb 26 18:32:08 CET 2012

Thinking about it, his should be faster by avoiding the two vector scans
(==) :

DT = data.table(user=c(1,1,2,2,3,4,4),
                movie=c(2,3,3,4,5,4,7),
                rating=c(3,4,4,5,1,1,3))
setkey(DT, rating, movie)
DT[,newcol:=NA_integer_]
DT[ DT[J(1L),unique(movie)][,rating:=5L], newcol:=1L, mult="first"]
DT[ DT[J(5L),unique(movie)][,rating:=1L], newcol:=5L, mult="first"]
DT[order(user,movie)]
     user movie rating newcol
[1,]    1     2      3     NA
[2,]    1     3      4     NA
[3,]    2     3      4     NA
[4,]    2     4      5      1
[5,]    3     5      1     NA
[6,]    4     4      1      5
[7,]    4     7      3     NA

On Sun, 2012-02-26 at 17:00 +0000, Matthew Dowle wrote:
> So the new column should be:
> 
> 1 when user voted 5 and there's another user who rated the same movie 1.
> 5 when user rated 1 and there's another user who voted the same movie 5.
> NA otherwise
> 
> If that's right, then why is avg needed? Avg of a set of 1's is 1, and a
> set of 5's is 5. Anyway, how about this :
> 
> DT = data.table(user=c(1,1,2,2,3,4,4),
>                 movie=c(2,3,3,4,5,4,7),
>                 rating=c(3,4,4,5,1,1,3))
> setkey(DT, movie, rating)
> DT[,newcol:=NA_integer_]
> DT[ J(DT[rating==1L,unique(movie)],5L), newcol:=1L, mult="first"]
> DT[ J(DT[rating==5L,unique(movie)],1L), newcol:=5L, mult="first"]
> DT[order(user,movie)]
>      user movie rating newcol
> [1,]    1     2      3     NA
> [2,]    1     3      4     NA
> [3,]    2     3      4     NA
> [4,]    2     4      5      1
> [5,]    3     5      1     NA
> [6,]    4     4      1      5
> [7,]    4     7      3     NA
> 
> 
> On Fri, 2012-02-24 at 18:45 -0500, DS wrote:
> > A user could rate multiple times but lets please ignore that fact  as that is not the case in this data and i just made up fake data below and made a mistake.
> > 
> > The new column is what i want to compute using data table and add to the original data.
> > 
> > so the output should be:
> > user,mv,rating,somedate, Avg_disimilar_users
> > 1,2,3,date,NA
> > 1,3,4,date,NA
> > 2,3,4,date,NA
> > 2,4,5,date,1
> > 3,5,1,date,NA
> > 4,4,1,date,5
> > 4,7,3,date,na
> > 
> > input:
> > 
> > each row is : user, movie, rating, time.
> > user,mv,rating,somedate
> > 1,2,3,date
> > 1,3,4,date,
> > 2,3,4,date,
> > 2,4,5,date,
> > 3,5,1,date
> > 4,4,1,date
> > 4,7,3,date
> > 
> > so here for user 2, movie 4, the new column called avg disimilar rating should be 1.
> > for user 4, movie 4, the new column should be 5.
> > 
> > the new column Avg_disimilar users which I want is the avg of users who rated differently on the movie in question on that row of the data.
> > when users dont have movies in common 4 apart the result should be NA or empty.
> > 
> > Hope that helps.
> > 
> > Dhruv
> > 
> > 
> > -----Original Message-----
> > From: "Matthew Dowle" [mdowle at mdowle.plus.com]
> > Date: 02/24/2012 04:43 AM
> > To: ds5j at excite.com
> > CC: datatable-help at r-forge.wu-wien.ac.at
> > Subject: Re: converting sql to data table with subqueriesand exists clauses
> > 
> > 
> > Dhruv,
> > 
> > But user 4 voted for movie 4 twice (first 1*, then 3*). Which is something
> > I already asked about (but you haven't addressed). Further, you've
> > described a new column. That implies a new value for every row. There are
> > 7 rows but you've only stated what 2 of those 7 values should be.
> > 
> > Please read through this in full :
> > 
> >     http://www.catb.org/~esr/faqs/smart-questions.html
> > 
> > Matthew
> > 
> > each row is : user, movie, rating, time.
> > user,mv,rating,somedate
> > 1,2,3,date
> > 1,3,4,date,
> > 2,3,4,date,
> > 2,4,5,date,
> > 3,5,1,date
> > 4,4,1,date
> > 4,4,3,date
> > 
> > so here for user 2, movie 4, the new column called avg disimilar rating should be 1.
> > for user 4, movie 4, the new column should be 5.
> > 
> > this column is a disimilar column.
> > 
> > Dhruv
> > 
> > 
>