Continue with machine learning - building recommendation system
Bài đăng này đã không được cập nhật trong 6 năm
In the last post https://viblo.asia/p/continue-with-machine-learning-building-recommendation-system-E375zeYjlGW we talk about general methods used in building recommendation systems such as content-based filtering and collaborative filtering. In this post we want to get into the implementation of those methods and test its effectiveness.
We'll make a recommendation system on movies data from http://files.grouplens.org/datasets/movielens/ . Below is some information of the data:
MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. This data set consists of: 100,000 ratings (1-5) from 943 users on 1682 movies. Each user has rated at least 20 movies. Simple demographic info for the users (age, gender, occupation, zip)
The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC
u.info -- The number of users, items, and ratings in the u data set.
u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.
u.genre -- A list of the genres.
u.user -- Demographic information about the users; this is a tab separated list of user id | age | gender | occupation | zip code The user ids are the ones used in the u.data data set. u.occupation -- A list of the occupations.
Implementation
What we are trying to do here is to make a recommendation to a user based on the rating of similar users.
Let's get some data movielens
, after downloading locally.
Here is the rating data:
import os
import numpy as np
import pandas as pd
data_url = os.getcwd() + '/ml-100k'
rating_data = pd.read_csv(data_url + '/u.data', sep='\t', header=None, names=['userId', 'itemId', 'rating', 'timestamp'])
rating_data.head()
Output:
userId itemId rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596
Let's get data on movies:
itemData = pd.read_csv(data_url + '/u.item', sep='|', encoding = "ISO-8859-1", usecols=[0,1], header=None, names=['itemId', 'movie'])
itemData.head()
Output:
itemId movie
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)
Let's merge the 2 data above:
data = pd.merge(rating_data, itemData, right_on='itemId', left_on='itemId')
data.head()
Output:
userId itemId rating timestamp movie
0 196 242 3 881250949 Kolya (1996)
1 63 242 3 875747190 Kolya (1996)
2 226 242 5 883888671 Kolya (1996)
3 154 242 3 879138235 Kolya (1996)
4 306 242 5 876503793 Kolya (1996)
Let's find the number of users and number of movies:
data = pd.DataFrame.sort_values(data, ['userId', 'itemId'], ascending=[0,1])
numUsers = max(data.userId)
numItems = max(data.itemId)
print(numUsers)
print(numItems)
Output:
943
1682
And find out the number of ratings per user and the number of ratings per movie
405 737
655 685
13 636
450 540
276 518
Name: userId, dtype: int64
50 583
258 509
100 508
181 507
294 485
Name: itemId, dtype: int64
Now let's find out the user who do the most ratings and the most rated movie
userData = pd.read_csv(data_url + '/u.user', sep='|', header=None, names=['userId', 'age', 'gener', 'occupation', 'zipcode'])
userData.head()
mostRatingUser = userData[userData.userId == numItemsPerUser.index[0]]
print(mostRatingUser)
mostRatedMovie = itemData[itemData.itemId == numUsersPerItem.index[0]]
print(mostRatedMovie)
Output:
userId age genre occupation zipcode
404 405 22 F healthcare 10019
itemId movie
49 50 Star Wars (1977)
Now we write a function to find the favorite movies for a particular user:
def favoriteMovies(userId, topN):
topRatedMovies = pd.DataFrame.sort_values(data[data.userId == userId], ['rating'], ascending=[0])[:topN]
return list(topRatedMovies.movie)
print(favoriteMovies(405, 5))
Output:
["Schindler's List (1993)", 'Wizard of Oz, The (1939)', 'Raiders of the Lost Ark (1981)', 'Princess Bride, The (1987)', 'Empire Strikes Back, The (1980)']
It's time to create a pivot table from data
with userId
as index and itemId
as column and rating
as values.
userItemsRatingMatrix = pd.pivot_table(data, index=['userId'], columns=['itemId'], values='rating')
print(userItemsRatingMatrix.head())
Output:
itemId 1 2 3 4 5 6 7 8 9 10 ... \
userId ...
1 5.0 3.0 4.0 3.0 3.0 5.0 4.0 1.0 5.0 3.0 ...
2 4.0 NaN NaN NaN NaN NaN NaN NaN NaN 2.0 ...
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
5 4.0 3.0 NaN NaN NaN NaN NaN NaN NaN NaN ...
itemId 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682
userId
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 1682 columns]
Now we reach the core of the implementation let's find similarity between user:
from scipy.spatial.distance import correlation
def similarity(user1, user2):
user1 = np.array(user1) - np.nanmean(user1)
user2 = np.array(user2) - np.nanmean(user2)
commonItemIds = [i for i in range(len(user1)) if user1[i] > 0 and user2[i] > 0]
if len(commonItemIds) == 0:
return 0
else:
user1 = [user1[i] for i in commonItemIds]
user2 = [user2[i] for i in commonItemIds]
return correlation(user1, user2)
Then let's find a predicted rating of each movies for a particular user:
First get the similarity score of every user to the above user, then pick the kTop
most similar users.
Get user-item rating matrix by user index of kTop
users.
Get rating for each movie
calcuate avarage rating for each movies for the user from the ratings of the kTop
most similar users.
def nearestNeighborRating(userId, kTop):
similarityMatrix = pd.DataFrame(index=userItemsRatingMatrix.index, columns=['similarity'])
for i in userItemsRatingMatrix.index:
similarityMatrix.loc[i] = similarity(userItemsRatingMatrix.loc[userId], userItemsRatingMatrix.loc[i])
similarityMatrix = pd.DataFrame.sort_values(similarityMatrix, ['similarity'], ascending=[0])
nearestNeighbors = similarityMatrix[:kTop]
neighborItemRatings = userItemsRatingMatrix.loc[nearestNeighbors.index]
predictItemRatings = pd.DataFrame(index=userItemsRatingMatrix.columns, columns=['rating'])
activeUserMeanRating = np.nanmean(userItemsRatingMatrix.loc[userId])
for i in userItemsRatingMatrix.columns:
predictRating = activeUserMeanRating
for j in neighborItemRatings.index:
if userItemsRatingMatrix.loc[j,i] > 0:
predictRating += (userItemsRatingMatrix.loc[j,i] - np.nanmean(userItemsRatingMatrix.loc[j])) * nearestNeighbors.loc[j,'similarity']
predictItemRatings.loc[i, 'rating'] = predictRating
return predictItemRatings
Let's get the top recommended movies a user
def topRecommendation(userId, nTop):
predictItemRatings = nearestNeighborRating(userId, 10)
watchedMovies = userItemsRatingMatrix.loc[userId].loc[userItemsRatingMatrix.loc[userId] > 0].index
predictItemRatings = predictItemRatings.drop(watchedMovies)
topRecommendedMovies = pd.DataFrame.sort_values(predictItemRatings, ['rating'], ascending=[0])[:nTop]
topRecommendedMovieTitles = itemData.loc[topRecommendedMovies.index - 1].movie
return list(topRecommendedMovieTitles)
Now find out the top 3 recommended movies for user with userId
5
print(topRecommendation(5, 3
Output:
['Scream (1996)', 'First Wives Club, The (1996)', 'Truth About Cats & Dogs, The (1996)']
Hope it is useful. We'll take a look at Matrix Factorization and Association Rule in the next post.
All rights reserved