[datatable-help] converting sql to data table with subqueries and exists clauses

Fri Feb 24 01:33:49 CET 2012

each row is : user, movie, rating, time.
user,mv,rating,somedate
1,2,3,date
1,3,4,date,
2,3,4,date,
2,4,5,date,
3,5,1,date
4,4,1,date
4,4,3,date

so here for user 2, movie 4, the new column called avg disimilar rating should be 1.
for user 4, movie 4, the new column should be 5.

this column is a disimilar column.

Dhruv

-----Original Message-----
From: "Matthew Dowle" [mdowle at mdowle.plus.com]
Date: 02/23/2012 05:22 AM
To: "DS" <ds5j at excite.com>
CC: datatable-help at r-forge.wu-wien.ac.at
Subject: Re: converting sql to data table with subqueries and exists clauses

It's an interesting problem. Thanks for keeping it onlist too. What I
don't understand is the essence of the query.  Are ratings 0-10?  If so,
if you find all the ratings more than 4 different, say there are 2 0's and
2 10's, then the average will be 5. Is that what it finds?  Why find the
average of the extremes?  If there was great disagreement amongst users I
think I'd want a measure of dispersion.  So I must have misunderstood.

As a movie renter myself, what I'd like to know is the average rating of
movies I haven't watched, by the users who have watched the same movies as
me, and, agreed with my ratings of those movies.  That gives me an
incentive to rate movies I have watched (because the movie ratings
personal to me improve as the intersect of movies I've watched and rated
intersects with other users).

Looking at the code anyway, I'm confused by the very first line :

# get all users for movie
b<- data.table(dt[dt$itemid==a$itemid & dt$userid!=a$userid,list(userid)])

The comment doesn't seem close to what that does. The users for a movie is
just :

DT[,unique(userid),by=itemid]

which I guess is the same as

DT[,userid,by=itemid]

unless, a user watches the same movie twice?  If they watch it two or more
times, can they rate it differently each time.  So a preliminary step to
filter the data to find the the last rating by user by film may be first
step?

Some example data would really help. Maybe 20-50 rows, 3 films and 5 users?

Matthew

> I took a crack at converting the following sql to data table.
>
> it is working off of the movielens rating data (userid,rating,itemid,
> timestamp)
>
>> (select ave(rating) from data x
>> where x.itemid=a.itemid and x.userid!=a.userid
>> and exists (select 1 from
>> (select * from data y where y.userid=x.userid) y,
>> (select * from data z where z.userid=a.userid) z
>> where y.itemid=z.itemid
>> and y.itemid!=a.itemid
>> and abs(y.rating-z.rating)>=4 )
>> ) pred_rating_disimilar
>> from data1 a')
>>
>
> my data table attempt is below but it takes 13 hours to run:
> #create columns in dataframe to hold results
> dt$meanDisimilar<-NA
> dt$countDisimilar<-NA
>
> system.time(
> for (i in 1:nrow(data1))
> {
>
> 	a<- dt[i,]
>
> 	# get all users for movie
>         b<- data.table(dt[dt$itemid==a$itemid &
> dt$userid!=a$userid,list(userid)])
>
>         # get movies for users of this movie
> 	c<-data.table(dt[dt$userid==a$userid,list(userid,itemid,rating)])
>
> 	# get all movies for other users who rated this movie
> 	d<- data.table(dt[dt$userid==b$userid ,list(userid,itemid,rating)])
>
>         setkey(c,itemid)
> 	 setkey(d,itemid)
>
>
> 	e<-c[d,]
> 	names(e)<-c("itemid", "userid0", "rating0", "userid", "rating")
>
> 	#get all users where share with thi user and where rating >=4
> 	f<-e[ abs(rating0-rating)>=4,list(userid)]
>
>         # get avg rating for these users on this movie
> 	dt[i,]$meanDisimilar <- dt[dt$userid==f$userid &
> dt$itemid==a$itemid,mean(rating)]
> 	dt[i,]$countDisimilar <- dt[dt$userid==f$userid &
> dt$itemid==a$itemid,nrow(rating)]
> 	#print(dt[dt$userid==f$userid & dt$itemid==a$itemid,mean(rating)]);
> }
> )
>
> is there a more elegant compact way to do it.
>
>