Wondered how Google comes up with movies that are similar to the ones you like? After reading this post you will be able to build one such recommendation system for yourself.
It turns out that there are (mostly) three ways to build a recommendation engine:
- Popularity based recommendation engine
- Content based recommendation engine
- Collaborative filtering based recommendation engine
Now you might be thinking “That’s interesting. But, what are the differences between these recommendation engines?”. Let me help you out with that.
Popularity based recommendation engine:
Perhaps, this is the simplest kind of recommendation engine that you will come across. The trending list you see in YouTube or Netflix is based on this algorithm. It keeps a track of view counts for each movie/video and then lists movies based on views in descending order(highest view count to lowest view count). Pretty simple but, effective. Right?
Content based recommendation engine:
This type of recommendation systems, takes in a movie that a user currently likes as input. Then it analyzes the contents (storyline, genre, cast, director etc.) of the movie to find out other movies which have similar content. Then it ranks similar movies according to their similarity scores and recommends the most relevant movies to the user.
Collaborative filtering based recommendation engine:
This algorithm at first tries to find similar users based on their activities and preferences (for example, both the users watch same type of movies or movies directed by the same director). Now, between these users(say, A and B) if user A has seen a movie that user B has not seen yet, then that movie gets recommended to user B and vice-versa. In other words, the recommendations get filtered based on the collaboration between similar user’s preferences (thus, the name “Collaborative Filtering”). One typical application of this algorithm can be seen in the Amazon e-commerce platform, where you get to see the “Customers who viewed this item also viewed” and “Customers who bought this item also bought” list.
Look at the following picture to get a better intuition over content based and collaborative filtering based recommendation systems-
Another type of recommendation system can be created by mixing properties of two or more types of recommendation systems. This type of recommendation systems are known as hybrid recommendation system.
In this article, we are going to implement a Content based recommendation system using the scikit-learn library.
Finding the similarity
We know that our recommendation engine will be content based. So, we need to find similar movies to a given movie and then recommend those similar movies to the user. The logic is pretty straightforward. Right?
But, wait…. How can we find out which movies are similar to the given movie in the first place? How can we find out how much similar(or dissimilar) two movies are?
Let us start with something simple and easy to understand.
Suppose, you are given the following two texts:
Text A: London Paris London
Text B: Paris Paris London
How would you find the similarity between Text A and Text B?
Let’s analyze these texts….
- Text A: Contains the word “London” 2 times and the word “Paris” 1 time.
- Text B: Contains the word “London” 1 time and the word “Paris” 2 times.
Now, what will happen if we try to represent these two texts in a 2D plane (with “London” in X axis and “Paris” in Y axis)? Let’s try to do this.
It will look like this-
Here, the red vector represents “Text A” and the blue vector represents “Text B”.
Now we have graphically represented these two texts. So, now can we find out the similarity between these two texts?
The answer is “Yes, we can”. But, exactly how?
These two texts are represented as vectors. Right? So, we can say that two vectors are similar if the distance between them is small. By distance, we mean the angular distance between two vectors, which is represented by θ (theta). By thinking further from the machine learning perspective, we can understand that the value of cos θ makes more sense to us rather than the value of θ (theta) because, the cosine(or “cos”) function will map the value of θ in the first quadrant between 0 to 1 (Remember? cos 90° = 0 and cos 0° = 1 ).
And from high school maths, we can remember that there is actually a formula for finding out cos θ between two vectors. See the picture below-
Don’t get scared, we don’t need to implement the formula from scratch for finding out cos θ. We have our friend Scikit Learn to calculate that for us :)
Let’s see how we can do that.
At first, we need to have text A and B in our program:
text = ["London Paris London","Paris Paris London"]
Now, we need to find a way to represent these texts as vectors. The
CountVectorizer() class from
sklearn.feature_extraction.text library can do this for us. We need to import this library before we can create a new
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() count_matrix = cv.fit_transform(text)
count_matrix gives us a sparse matrix. To make it in human readable form, we need to apply
toarrray() method over it. And before printing out this
count_matrix, let us first print out the feature list(or, word list), which have been fed to our
The output of the above code will look like this-
['london', 'paris'] [[2 1] [1 2]]
This indicates that the word ‘london’ occurs 2 times in A and 1 time in B. Similarly, the word ‘paris’ occurs 1 time in A and 2 times in B. Makes sense. Right?
Now, we need to find cosine(or “cos”) similarity between these vectors to find out how similar they are from each other. We can calculate this using
cosine_similarity() function from
from sklearn.metrics.pairwise import cosine_similarity similarity_scores = cosine_similarity(count_matrix) print(similarity_scores)
The above code will output a similarity matrix, which looks like this-
[[1. 0.8] [0.8 1. ]]
What does this output indicate?
We can interpret this output like this-
- Each row of the similarity matrix indicates each sentence of our input. So, row 0 = Text A and row 1 = Text B.
- The same thing applies for columns. To get a better understanding over this, we can say that the output given above is same as the following:
Text A: Text B: Text A: [[1. 0.8] Text B: [0.8 1.]]
Interpreting this, says that Text A is similar to Text A(itself) by 100%(position [0,0]) and Text A is similar to Text B by 80%(position [0,1]). And by looking at the kind of output it is giving, we can easily say that this is always going to output a symmetric matrix. Because, if Text A is similar to Text B by 80% then, Text B is also going to be similar to Text A by 80%.
Now we know how to find similarity between contents. So, let’s try to apply this knowledge to build a content based movie recommendation engine.
Building the recommendation engine:
The movie dataset that we are going to use in our recommendation engine can be downloaded from Course Github Repo.
After downloading the dataset, we need to import all the required libraries and then read the csv file using
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity df = pd.read_csv("movie_dataset.csv")
If you visualize the dataset, you will see that it has many extra info about a movie. We don’t need all of them. So, we choose keywords, cast, genres and director column to use as our feature set(the so called “content” of the movie).
features = ['keywords','cast','genres','director']
Our next task is to create a function for combining the values of these columns into a single string.
def combine_features(row): return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']
Now, we need to call this function over each row of our dataframe. But, before doing that, we need to clean and preprocess the data for our use. We will fill all the
NaN values with blank string in the dataframe.
for feature in features: df[feature] = df[feature].fillna('') #filling all NaNs with blank string df["combined_features"] = df.apply(combine_features,axis=1) #applying combined_features() method over each rows of dataframe and storing the combined string in "combined_features" column
Now that we have obtained the combined strings, we can now feed these strings to a
CountVectorizer() object for getting the count matrix.
cv = CountVectorizer() #creating new CountVectorizer() object count_matrix = cv.fit_transform(df["combined_features"]) #feeding combined strings(movie contents) to CountVectorizer() object
At this point, 60% work is done. Now, we need to obtain the cosine similarity matrix from the count matrix.
cosine_sim = cosine_similarity(count_matrix)
Now, we will define two helper functions to get movie title from movie index and vice-versa.
def get_title_from_index(index): return df[df.index == index]["title"].values def get_index_from_title(title): return df[df.title == title]["index"].values
Our next step is to get the title of the movie that the user currently likes. Then we will find the index of that movie. After that, we will access the row corresponding to this movie in the similarity matrix. Thus, we will get the similarity scores of all other movies from the current movie. Then we will enumerate through all the similarity scores of that movie to make a tuple of movie index and similarity score. This will convert a row of similarity scores like this-
[1 0.5 0.2 0.9] to this-
[(0, 1) (1, 0.5) (2, 0.2) (3, 0.9)] . Here, each item is in this form- (movie index, similarity score).
movie_user_likes = "Avatar" movie_index = get_index_from_title(movie_user_likes) similar_movies = list(enumerate(cosine_sim[movie_index])) #accessing the row corresponding to given movie to find all the similarity scores for that movie and then enumerating over it
Now comes the most vital point. We will sort the list
similar_moviesaccording to similarity scores in descending order. Since the most similar movie to a given movie will be itself, we will discard the first element after sorting the movies.
sorted_similar_movies = sorted(similar_movies,key=lambda x:x,reverse=True)[1:]
Now, we will run a loop to print first 5 entries from
i=0 print("Top 5 similar movies to "+movie_user_likes+" are:\n") for element in sorted_similar_movies: print(get_title_from_index(element)) i=i+1 if i>5: break
And we are done here!
You can download the Python script and associated datasets from Course Github Repo.
So, the whole combined code of our movie recommendation engine is:
import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics.pairwise import cosine_similarity df = pd.read_csv("movie_dataset.csv") features = ['keywords','cast','genres','director'] def combine_features(row): return row['keywords'] +" "+row['cast']+" "+row["genres"]+" "+row["director"] for feature in features: df[feature] = df[feature].fillna('') df["combined_features"] = df.apply(combine_features,axis=1) cv = CountVectorizer() count_matrix = cv.fit_transform(df["combined_features"]) cosine_sim = cosine_similarity(count_matrix) def get_title_from_index(index): return df[df.index == index]["title"].values def get_index_from_title(title): return df[df.title == title]["index"].values movie_user_likes = "Avatar" movie_index = get_index_from_title(movie_user_likes) similar_movies = list(enumerate(cosine_sim[movie_index])) sorted_similar_movies = sorted(similar_movies,key=lambda x:x,reverse=True)[1:] i=0 print("Top 5 similar movies to "+movie_user_likes+" are:\n") for element in sorted_similar_movies: print(get_title_from_index(element)) i=i+1 if i>=5: break
Now, it’s time to run our code and see the output. If you run the above code, you will see this output-
Top 5 similar movies to Avatar are: Guardians of the Galaxy Aliens Star Wars: Clone Wars: Volume 1 Star Trek Into Darkness Star Trek Beyond
After seeing the output, I went one step further to compare it to other recommendation engines.
So, I searched Google for similar movies to “Avatar” and here is what I got-
See the output? Our simple movie recommendation engine works pretty good. Right? It’s good as a basic level implementation but, it can be further improved with many other factors. Try to optimize this recommendation engine yourself and let us know your story in the comments.
That’s all for now. Stay tuned for the next article :)
You can download the Python script and associated datasets from Course Github Repo.