Recommendation System
for Movies
Built data ETL pipeline to analyze movie rating dataset and conducted online analytical processing (OLAP) with Spark SQL
Implemented Alternative Least Square (ALS) model to provide personalized movie recommendations
Conducted model hyperparameters tuning with Spark ML cross-validation, and got the best model with root mean square error (RMSE) of predicted ratings smaller than 0.6 on around 10,000 rating entries
Developed user-based approaches to handle system cold-start problems
Background
Recommendation system development department is the most profitable department in companies with video, advertising or customer-merchant matching business such as Google, Facebook, Airbnb and Uber. The ability to design and build a recommendation system is the most important and attractive capability for a data scientist.
Keywords: Recommendation System, Collaborative Filtering, Matrix Factorization, Spark Machine Learning, Spark ALS Model
Introduction
This project will build a recommendation system based on big data. Real-world movie rating data are used to build the recommendation system, and will be used to train a machine learning model. We will implement on Spark machine learning pipeline building and collaborative filtering model automatically tuning, and apply the built model on movie rating data predictions.
Data Exploration and OLAP
Movie Table
Movie table has movie ID, movie title and movie genre attributes. Rating table has information of users who rated the movies and corresponding movie ID and score with timestamp. Link table contains the IMDb (Internet Movie Database) and TMDB (The Movie Database) sources of movies. Tag table represnets different tags of all the movies with time information.
Rating Table
Link Table
Tag Table
According to the statistical information of the dataset, there are 610 users who rated 9,724 out of 9,742 movies in total. And 3,446 out of 9,724 movies were rated by only one user. There are 18 movies without any ratings.
Unrated Movies
Meanwhile, among users, the minimum and maximum number of ratings made by one user are 20 and 2,698 respectively. The most rated movie has 329 ratings.
Rating Counts
Top 3 movies with most ratings are Forrest Gump, The Shawshank Redemption and Pulp Fiction, and all of them were rated by over 300 times. It looks like they are pretty popular among audiances. Interestingly, all of those movies were released in 1994 and are still famous now. Classic Forever!
There are 20 unique categories in the dataset and movies can be labeled with multiple tags. Drama and comedy are the most common tags for movies, which account for around 44.8% and 38.5% respectively in all 9,742 movies.
5-Star Rating System
The rating scores concentrate between 3.0 and 4.0. The most common score is 4.0 and the rarest one is 0.5. Now, according to the 5-star rating system, we will see which movies are the best/worst among audiances and which audiances are most likely to give high/low ratings.
Worst Movies
Top 3 movies with lowest average rating, which is lower than 2.8, are Wild Wild West, Coneheads and Ace Ventura: When Nature Calls.
Top 3 movies with most negative comments, i.e. with most ratings lower than 2.0, are Ace Ventura: Pet Detective, Ace Ventura: When Nature Calls and Star Wars: Episode I – The Phantom Menace. It looks like the Ace Ventura series failed.
Top 3 Movies with Lowest Average Rating
Top 3 Movies with Most Negative Comments
Best Movies
Top 3 movies with highest average rating, which is higher than 4.2, are The Shawshank Redemption, Fight Club and Star Wars: Episode IV – A New Hope. We can also see that the Star Wars series are always hot topics.
Top 3 movies with most positive comments, i.e. with most ratings higher than 4.0, are The Shawshank Redemption, Forrest Gump and Pulp Fiction. They are three GOAT (greatest of all time) movies across the world!
Top 3 Movies with Highest Average Rating
Top 3 Movies with Most Positive Comments
Now let us find some interesting audiances.
We can see mean audiances first. There are some audiances who prefer to give very low ratings, which is lower than 2.0, after watching a movie. Maybe they are just strict or have their own rating system. The user with ID 567 is the “meanest”. He/She rated 196 out of 385 watched movies under 2.0, with over 50% negative comment rate. Most of his/her rated movies were under score 3.5, and over 50 movies were rated as score 0.5.
Ratings of User 567
We can also find friendly audiances. There are lots of audiances who would rather encourage the movie set than critique them. Sometimes, however, audiances just want to relax while watching a movie, and thus they will not deeply think about the connotation of movies. Among those audiances, the user with ID 105 is the “most friendly” one. He/She rated 722 movies and gave 554 of them over score 4.0. It is very hard to say that an audiance is so lucky that over 75% of watched movies are worth being recommended. Maybe he/she is a very nice guy or can be easily satisfied by popcorn movies.
Ratings of User 105
Recommendation System Building
We split the dataset into 80% training set and 20% testing set, then we trained and fine-tuned an Alternating Least Square (ALS) model to build the best prediction model. According to the user-based model, we can predict the rating of a target movie for any specific audiance. Once we get the rating list of a user, it is possible to know which movies he/she will give high or low ratings. Thus, we can recommend unseen movies, in which he/she might be interested, to this user by his/her personalized high-rating list. Meanwhile, because we can also get the feature matrix of all movies, it is easy to figure out which movies are similar based on similarity computation. Accordingly, if we are interested in one movie, we can quickly find its “close relatives”.
Now, let us try to recommend movies to some special audiances.
Of course we are interested in some strict audiances and those with less information left in movie database. It is a big challenge to offer them satisfying movies.
First of all, we are interested in the “meanest” user’s preference. Thus, we will see the recommeded movies for user with ID 567. It looks like animation or sci-fi movies have a high chance to attract him/her. Strict audiance like him/her will rate these movies over 4.5.
User with ID 147 is one of the users we know least, with only 20 ratings information. We can see that the his/her predicted ratings of top 5 recommended movies are over 5.0 due to lack of data. However, it is easy to tell that he/she is a fan of action and adventure movies.
We are also curious about the interests of those friendly or very active audiances. It is tough on some degree to recommend them right movies. Friendly audiances may rate most of seen movies with high scores, which makes us unable to tell his/her tastes. One of the strategies for friendly audiances is that we can avoid recommending similar movies they might even give a low score. We have the same situation on active users. They watched too many movies and left us so much information. Forturnately, active audiances may have distinct characteristics for us to learn.
According to the rating mechanism, it is no wonder that the “most friendly” audiance, with user ID 105, will rate close to full mark on all top 5 recommended movies. We can also conclude that he/she likes comedy movies, and he/she is a fan of crime and mystery movies, especially Sherlock Holmes series.
Surprisingly, user with ID 414 rated 2,698 movies. Suppose a movie costs 100 minutes in average and this user spends around 3 hours a day to watch movies, it still takes over 3.5 years to finish all of them. We would regard this user account as a shared one. Otherwise, this user must be crazy about movies or he/she is a professional film critics. It looks like one of this user’s favorite topics is romance.
Sometimes, we are purely interested in the movie itself, such that we want to find similar movies with a specified movie or topic. It sounds like that we just want to get some insights based on movie’s contents rather than user’s ratings. Basically, we have two methods to measure the similarity between movies, Euclidean Distance Similarity and Cosine Similarity. We will figure out their difference using the same movie, and we will use one of the most classic movies in human history, The Shawshank Redemption.
Target Movie
Similar Movies by Euclidean Distance Similarity
Similar Movies by Cosine Similarity
The recommended movies by Euclidean distance similarity are completely different from those by cosine similarity. In Euclidean distance simialrity results, at least we can find a movie originated from Stephen King’s novel Rose Red, which has the same author with the novel Rita Hayworth and Shawshank Redemption. In cosine similarity results, however, it is a little bit confusing and hard to explain.
From the mathematical perspective, the smaller the Euclidean distance between the factors is, the more similar the movies are. Thus, this similarity considers the actual strength, e.g. movie 1 with factor [1, 2, 3] and movie 2 with factor [2, 4, 6] are considered not similar enough. When it comes to cosine similarity, nevertheless, the larger the cosine value is, the smaller the two feature vectors’ angle is, the more similar the movies are. This similarity considers the direction only, e.g. movie 1 with factor [1, 2, 3] and movie 2 with factor [2, 4, 6] are considered the same.
Intuitively, if we want to cluster similar movies in features space, it is much more reasonable to cluster movies by real distance instead of by direction. Thus, we prefer to use Euclidean distance similarity to explore the similar movies.
Similar Movies by Euclidean Distance Similarity
Similar Movies by Cosine Similarity
Now, let us explore another two classic movies in human history, Forrest Gump and Pulp Fiction.
Target Movie
Traget Movie
Well, at first glance, it is really hard to explain where their similarities come from. We may need more work to figure out the secrets of features space.
Conclusion
More and more ecommerce platforms start to characterize their recommendation systems in order to provide better service. As a website to watch movies, it will be more profitable if more customers are able to quickly and correctly find their favorite films through the recommendation services. Thus, a good recommendation system will be the key to the success.