[datatable-help] converting sql to data table with subqueries and exists clauses

Matthew Dowle mdowle at mdowle.plus.com
Thu Feb 23 16:03:32 CET 2012


Ok thanks - I'm starting to get it. Thanks for data but we need a small
example, along with expected output.  For example each row is a movie
rating but you seem to be producing a row for every rating rather than
each user. A small example with expected output would help us to
understand _quickly_. Reduce the time for us to understand and you'll
increase the chance of getting good help quicker.

Or maybe someone else on the list understands already, in which case
you'll be in luck!

> thanks Matt.  The ratings go from 1-5.  I am trying to show traditional
> models can do as well as svd++, matrix fact and top line recommender
> algorithms.
>
> So in my white box approach I am trying to find all the meaningful
> variables which work.
>
> this particular variable is crucial .
>
> I have means by movie and item and those were easy to do in data.table and
> much quicker than in sql.
>
> ratings are 1-5 and i want people who are 4 different which should give me
> people who give 1 and 5 I guess.  I tried other deltas and this one was
> most predictive.
>
> the problem with movie data is that it is sparse and thin and biased
> upwards. most ratings are 3-4.
>
> but lets say u and i like the godfather and a lot of people like the
> godfather but that wont tell whether we like the cable guy.
> i think movies with extreme values have more info about people.  I think
> age and other info helps too but I want to try to compare to benchmarks
> using only movie data for a fair comparison.
>
> that said there is also a totally subjective movie randomness to where we
> like movies we saw as kids and for random reasons.  everyone loves  a
> movie which is flawed but somehow speaks to them.
>
> after weeks of careful variable selection this variable somehow makes a
> difference along with frequencies of ratings how many times do u give
> 1,2,3,4,5 and movie frequencies and recent ratings etc.
>
> I am enclosing the movie data im using.
>
> Dhruv
>
> -----Original Message-----
> From: "Matthew Dowle" [mdowle at mdowle.plus.com]
> Date: 02/23/2012 05:22 AM
> To: "DS" <ds5j at excite.com>
> CC: datatable-help at r-forge.wu-wien.ac.at
> Subject: Re: converting sql to data table with subqueries and exists
> clauses
>
>
> It's an interesting problem. Thanks for keeping it onlist too. What I
> don't understand is the essence of the query.  Are ratings 0-10?  If so,
> if you find all the ratings more than 4 different, say there are 2 0's and
> 2 10's, then the average will be 5. Is that what it finds?  Why find the
> average of the extremes?  If there was great disagreement amongst users I
> think I'd want a measure of dispersion.  So I must have misunderstood.
>
> As a movie renter myself, what I'd like to know is the average rating of
> movies I haven't watched, by the users who have watched the same movies as
> me, and, agreed with my ratings of those movies.  That gives me an
> incentive to rate movies I have watched (because the movie ratings
> personal to me improve as the intersect of movies I've watched and rated
> intersects with other users).
>
> Looking at the code anyway, I'm confused by the very first line :
>
> # get all users for movie
> b<- data.table(dt[dt$itemid==a$itemid & dt$userid!=a$userid,list(userid)])
>
> The comment doesn't seem close to what that does. The users for a movie is
> just :
>
> DT[,unique(userid),by=itemid]
>
> which I guess is the same as
>
> DT[,userid,by=itemid]
>
> unless, a user watches the same movie twice?  If they watch it two or more
> times, can they rate it differently each time.  So a preliminary step to
> filter the data to find the the last rating by user by film may be first
> step?
>
> Some example data would really help. Maybe 20-50 rows, 3 films and 5
> users?
>
> Matthew
>
>
>> I took a crack at converting the following sql to data table.
>>
>> it is working off of the movielens rating data (userid,rating,itemid,
>> timestamp)
>>
>>> (select ave(rating) from data x
>>> where x.itemid=a.itemid and x.userid!=a.userid
>>> and exists (select 1 from
>>> (select * from data y where y.userid=x.userid) y,
>>> (select * from data z where z.userid=a.userid) z
>>> where y.itemid=z.itemid
>>> and y.itemid!=a.itemid
>>> and abs(y.rating-z.rating)>=4 )
>>> ) pred_rating_disimilar
>>> from data1 a')
>>>
>>
>> my data table attempt is below but it takes 13 hours to run:
>> #create columns in dataframe to hold results
>> dt$meanDisimilar<-NA
>> dt$countDisimilar<-NA
>>
>> system.time(
>> for (i in 1:nrow(data1))
>> {
>>
>> 	a<- dt[i,]
>>
>> 	# get all users for movie
>>         b<- data.table(dt[dt$itemid==a$itemid &
>> dt$userid!=a$userid,list(userid)])
>>
>>         # get movies for users of this movie
>> 	c<-data.table(dt[dt$userid==a$userid,list(userid,itemid,rating)])
>>
>> 	# get all movies for other users who rated this movie
>> 	d<- data.table(dt[dt$userid==b$userid ,list(userid,itemid,rating)])
>>
>>         setkey(c,itemid)
>> 	 setkey(d,itemid)
>>
>>
>> 	e<-c[d,]
>> 	names(e)<-c("itemid", "userid0", "rating0", "userid", "rating")
>>
>> 	#get all users where share with thi user and where rating >=4
>> 	f<-e[ abs(rating0-rating)>=4,list(userid)]
>>
>>         # get avg rating for these users on this movie
>> 	dt[i,]$meanDisimilar <- dt[dt$userid==f$userid &
>> dt$itemid==a$itemid,mean(rating)]
>> 	dt[i,]$countDisimilar <- dt[dt$userid==f$userid &
>> dt$itemid==a$itemid,nrow(rating)]
>> 	#print(dt[dt$userid==f$userid & dt$itemid==a$itemid,mean(rating)]);
>> }
>> )
>>
>> is there a more elegant compact way to do it.
>>
>>
>
>
>
>




More information about the datatable-help mailing list