Continue with machine learning - building recommendation system

In the last post https://viblo.asia/p/continue-with-machine-learning-building-recommendation-system-E375zeYjlGW we talk about general methods used in building recommendation systems such as content-based filtering and collaborative filtering. In this post we want to get into the implementation of those methods and test its effectiveness.

We'll make a recommendation system on movies data from http://files.grouplens.org/datasets/movielens/ . Below is some information of the data:

MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. This data set consists of: 100,000 ratings (1-5) from 943 users on 1682 movies. Each user has rated at least 20 movies. Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set. Detailed descriptions of the data file can be found at the end of this file.

u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC

u.info -- The number of users, items, and ratings in the u data set.

u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.

u.genre -- A list of the genres.

u.user -- Demographic information about the users; this is a tab separated list of user id | age | gender | occupation | zip code The user ids are the ones used in the u.data data set. u.occupation -- A list of the occupations.

Implementation

What we are trying to do here is to make a recommendation to a user based on the rating of similar users. Let's get some data movielens, after downloading locally.

Here is the rating data:

import os
import numpy as np
import pandas as pd

data_url = os.getcwd() + '/ml-100k'

rating_data = pd.read_csv(data_url + '/u.data', sep='\t', header=None, names=['userId', 'itemId', 'rating', 'timestamp'])
rating_data.head()

Output:

	userId	itemId	rating	timestamp
0	196	     242	3	881250949
1	186	     302	3	891717742
2	22	      377	1	878887116
3	244	     51	    2	880606923
4	166	     346	1	886397596

Let's get data on movies:

itemData = pd.read_csv(data_url + '/u.item', sep='|', encoding = "ISO-8859-1", usecols=[0,1], header=None, names=['itemId', 'movie'])
itemData.head()

Output:

	itemId	movie
0	      1	Toy Story (1995)
1	      2	GoldenEye (1995)
2	      3	Four Rooms (1995)
3	      4	Get Shorty (1995)
4	      5	Copycat (1995)

Let's merge the 2 data above:

data = pd.merge(rating_data, itemData, right_on='itemId', left_on='itemId')
data.head()

Output:

	userId	itemId	rating	timestamp	movie
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

Let's find the number of users and number of movies:

data = pd.DataFrame.sort_values(data, ['userId', 'itemId'], ascending=[0,1])
numUsers = max(data.userId)
numItems = max(data.itemId)
print(numUsers)
print(numItems)

Output:

943
1682

And find out the number of ratings per user and the number of ratings per movie

405    737
655    685
13     636
450    540
276    518
Name: userId, dtype: int64
50     583
258    509
100    508
181    507
294    485
Name: itemId, dtype: int64

Now let's find out the user who do the most ratings and the most rated movie

userData = pd.read_csv(data_url + '/u.user', sep='|', header=None, names=['userId', 'age', 'gener', 'occupation', 'zipcode'])
userData.head()
mostRatingUser = userData[userData.userId == numItemsPerUser.index[0]]
print(mostRatingUser)
mostRatedMovie = itemData[itemData.itemId == numUsersPerItem.index[0]]
print(mostRatedMovie)

Output:

userId  age genre  occupation zipcode
404     405   22     F  healthcare   10019
    itemId             movie
49      50  Star Wars (1977)

Now we write a function to find the favorite movies for a particular user:

def favoriteMovies(userId, topN):
    topRatedMovies = pd.DataFrame.sort_values(data[data.userId == userId], ['rating'], ascending=[0])[:topN]
    return list(topRatedMovies.movie)

print(favoriteMovies(405, 5))

Output:

["Schindler's List (1993)", 'Wizard of Oz, The (1939)', 'Raiders of the Lost Ark (1981)', 'Princess Bride, The (1987)', 'Empire Strikes Back, The (1980)']

It's time to create a pivot table from data with userId as index and itemId as column and rating as values.

userItemsRatingMatrix = pd.pivot_table(data, index=['userId'], columns=['itemId'], values='rating')
print(userItemsRatingMatrix.head())

Output:

itemId  1     2     3     4     5     6     7     8     9     10    ...   \
userId                                                              ...    
1        5.0   3.0   4.0   3.0   3.0   5.0   4.0   1.0   5.0   3.0  ...    
2        4.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   2.0  ...    
3        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    
4        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    
5        4.0   3.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...    

itemId  1673  1674  1675  1676  1677  1678  1679  1680  1681  1682  
userId                                                              
1        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
2        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
3        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
4        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
5        NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  

[5 rows x 1682 columns]

Now we reach the core of the implementation let's find similarity between user:

from scipy.spatial.distance import correlation

def similarity(user1, user2):
    user1 = np.array(user1) - np.nanmean(user1)
    user2 = np.array(user2) - np.nanmean(user2)
    
    commonItemIds = [i for i in range(len(user1)) if user1[i] > 0 and user2[i] > 0]
    if len(commonItemIds) == 0:
        return 0
    else:
        user1 = [user1[i] for i in commonItemIds]
        user2 = [user2[i] for i in commonItemIds]

        return correlation(user1, user2)

Then let's find a predicted rating of each movies for a particular user: First get the similarity score of every user to the above user, then pick the kTop most similar users. Get user-item rating matrix by user index of kTop users. Get rating for each movie calcuate avarage rating for each movies for the user from the ratings of the kTop most similar users.

def nearestNeighborRating(userId, kTop):
    similarityMatrix = pd.DataFrame(index=userItemsRatingMatrix.index, columns=['similarity'])
    for i in userItemsRatingMatrix.index:
        similarityMatrix.loc[i] = similarity(userItemsRatingMatrix.loc[userId], userItemsRatingMatrix.loc[i])
        
    similarityMatrix = pd.DataFrame.sort_values(similarityMatrix, ['similarity'], ascending=[0])
    nearestNeighbors = similarityMatrix[:kTop]
    
    neighborItemRatings = userItemsRatingMatrix.loc[nearestNeighbors.index]
    
    predictItemRatings = pd.DataFrame(index=userItemsRatingMatrix.columns, columns=['rating'])
    
    activeUserMeanRating = np.nanmean(userItemsRatingMatrix.loc[userId])
    for i in userItemsRatingMatrix.columns:
        predictRating = activeUserMeanRating
        for j in neighborItemRatings.index:
            if userItemsRatingMatrix.loc[j,i] > 0:
                predictRating += (userItemsRatingMatrix.loc[j,i] - np.nanmean(userItemsRatingMatrix.loc[j])) * nearestNeighbors.loc[j,'similarity']
        
        predictItemRatings.loc[i, 'rating'] = predictRating
        
    return predictItemRatings

Let's get the top recommended movies a user

def topRecommendation(userId, nTop):
    predictItemRatings = nearestNeighborRating(userId, 10)
    watchedMovies = userItemsRatingMatrix.loc[userId].loc[userItemsRatingMatrix.loc[userId] > 0].index
    predictItemRatings = predictItemRatings.drop(watchedMovies)
    topRecommendedMovies = pd.DataFrame.sort_values(predictItemRatings, ['rating'], ascending=[0])[:nTop]
    topRecommendedMovieTitles = itemData.loc[topRecommendedMovies.index - 1].movie
    return list(topRecommendedMovieTitles)

Now find out the top 3 recommended movies for user with userId 5

print(topRecommendation(5, 3

Output:

['Scream (1996)', 'First Wives Club, The (1996)', 'Truth About Cats & Dogs, The (1996)']

Hope it is useful. We'll take a look at Matrix Factorization and Association Rule in the next post.


All Rights Reserved