In-Depth Analysis
A Step-by-Step guide to building a recommender system in Python using LightFM
Published in · 10 min read · Feb 26, 2020
--
A recommender system, or a recommendation system, can be thought of as a subclass of information filtering system that seeks to predict the best “rating” or “preference” a user would give to an item which is typically obtained by optimizing for objectives like total clicks, total revenue, and overall sales.
So, what is the basic principle that underlies the working of recommendation algorithms?
The basic principle of recommendations is that there are significant dependencies between user- and item-centric activity. For example, a user who is interested in hotels in New York City is more likely to be interested in other NYC hotels, rather than in Boston.
While the primary business goal of any recommender system is to provide users’ with a personalized experience; from the problem setting standpoint, we can broadly classify recommenders into two categories:
- Prediction Problem: The first approach is where we would like to predict the rating value of a user-item combination with an assumption that the training data is available indicating a user’s preference for other items. Imagine m × n matrix, where m is # of users, and n is # of items, and the goal is to predict the missing (or unobserved) values.
- Ranking Problem: Other times, we wish to recommend the top-k items for a particular user or determine the top-k users to target for a specific topic. This problem is also referred to as the top-k recommendation problem, and it is the ranking formulation of the recommendation problem. Think of search engine, where depending on who is searching; you’d like to surface top-k items to serve personalized results based on their past preferences, and recent activity.
Broadly speaking, most recommender systems leverage two types of data:
- Interaction Data, such as ratings, and browsing behaviors, and
- Attribution Information, about each users and items
The modeling approach relying on the former data are generally known Collaborative Filtering
method, and approach using the later are referred to as Content Base Filtering
method. There is also another category known as Knowledge-Based
recommender system that is based on explicitly specified user requirements.
Of course, each of these methods has its strengths and weaknesses depending on which applications they are used for, and the amount of data available.
Hybrid Systems
are then used to combined the advantages of these approaches to have a robust performing system across a wide variety of applications.
- Collaborative Filtering Methods: These types of models use the collaborative power of the ratings provided by multiple users to make recommendations and rely mostly on leveraging either inter-item correlations or inter-user interactions for the prediction process. Intuitively, it relies on an underlying notion that two users who rate items similarly are likely to have comparable preferences for other items.
- There are two types of methods that are commonly used in collaborative filtering:
Memory-based methods also referred to as neighborhood-based collaborative filtering algorithms, where ratings of user-item combinations are predicted based on their neighborhoods. These neighborhoods can be further defined as (1) User Based, and (2) Item Based
In Model-based methods, ML techniques are used to learn model parameters within the context of a given optimization framework
- Content Based Filtering Methods: In these types of systems, the descriptive attributes of items/users are used to make recommendations. The term “content” refers to these descriptions. In content-based methods, the ratings and interaction behavior of users are combined with the content information available in the items.
- Hybrid Methods: In many cases, a wider variety of inputs is available; in such cases, many opportunities exist for hybridization, where the various aspects from different types of systems are combined to achieve the best of all worlds. The approach is comparable to the conventional ensemble analysis approach, where the power of multiple types of machine learning algorithms is combined to create a more robust model.
There is more to different types of approaches, and nuances of each; if you are interested in learning more, I would highly recommend the Statistical Methods for Recommender Systems book by Deepak Agarwal.
There are many open-source recommender system frameworks and libraries available across different languages that can help you get started with your first implementation of the recommendation model.
For this article, we’ll explore one of them that I have found valuable in my work, and covers a wide variety of underlying algorithms for varied use-cases, known as LighFM
LightFM is a Python implementation of several popular recommendation algorithms for both implicit and explicit feedback types.
Importantly, it allows you to incorporate both item and user metadata into the traditional matrix factorization algorithms, making it possible to generalize to new items (via item features) and new users (via user features).
To learn more about it, you can check its documentation here.
The complete code is available as a Jupyter Notebook on GitHub
- Loading Data
- Data Inspection and Preparation
- Data Preprocessing
- Model Training
- Top n-recommendations
For this tutorial, we’ll use book reviews data by Goodreads. The datasets were collected in late 2017 from goodreads.com, where the data from users’ public shelves was scraped, i.e., everyone can see it on the web without login. User IDs and review IDs are anonymized.
The dataset is collected for academic use only. Please do not redistribute them or use for commercial purposes.
There are three groups of datasets:
- meta-data of the books
- user-book interactions
- users’ detailed book reviews
These datasets can be merged together by matching book/user/review ids. For this tutorial, we’ll be using only the former two.
You can download the dataset using in this article from here:
- Books Metadata: https://drive.google.com/uc?id=1H6xUV48D5sa2uSF_BusW-IBJ7PCQZTS1
- User-Book Interactions: https://drive.google.com/uc?id=17G5_MeSWuhYnD4fGJMvKRSOlBqCCimxJ
%%time
books_metadata = pd.read_json('./data/goodreads_books_poetry.json', lines=True)
interactions = pd.read_json('./data/goodreads_interactions_poetry.json', lines=True)
Books Metadata
Let’s start by inspecting the books’ metadata information. To develop a reliable and robust ML model, it is essential to get a thorough understanding of the available data.
As the first step, let’s take a look at all the available fields, and sample data
books_metadata.sample(2)
While all the available information is vital to extract contextual information to be able to train a better recommendation system, for this example, we’ll only focus on the selected fields that require minimal manipulation.
# Limit the books metadata to selected fields
books_metadata_selected = books_metadata[['book_id',
'average_rating', 'is_ebook', 'num_pages', 'publication_year',
'ratings_count', 'language_code']]
books_metadata_selected.sample(5)
Now that we have the data with selected fields, next, we’ll run it through pandas profiler to perform preliminary exploratory data analysis to help us better understand the available data
import pandas_profiling
import numpy as np
# replace blank cells with NaN
books_metadata_selected.replace('', np.nan, inplace=True)
# not taking book_id into the profiler report
profile = \
pandas_profiling.ProfileReport(books_metadata_selected[['average_rating',
'is_ebook', 'num_pages', 'publication_year', 'ratings_count']])
profile.to_file('./results/profiler_books_metadata_1.html')
Considering the results from the profiler, we’ll perform following transformations to the dataset:
- Replace the missing value of categorical values with another value to create a new category
- Convert bin values for numeric variables into discrete intervals
# using pandas cut method to convert fields into discrete intervals
books_metadata_selected['num_pages'].replace(np.nan, -1, inplace=True)
books_metadata_selected['num_pages'] = pd.to_numeric(books_metadata_selected['num_pages'])
books_metadata_selected['num_pages'] = pd.cut(books_metadata_selected['num_pages'], bins=25)
# rounding ratings to neares .5 score
books_metadata_selected['average_rating'] = books_metadata_selected['average_rating'].apply(lambda x: round(x*2)/2)
# using pandas qcut method to convert fields into quantile-based discrete intervals
books_metadata_selected['ratings_count'] = pd.qcut(books_metadata_selected['ratings_count'], 25)
# replacing missing values to year 2100
books_metadata_selected['publication_year'].replace(np.nan, 2100, inplace=True)
# replacing missing values to 'unknown'
books_metadata_selected['language_code'].replace(np.nan, 'unknown', inplace=True)
# convert is_ebook column into 1/0 where true=1 and false=0
books_metadata_selected['is_ebook'] = books_metadata_selected.is_ebook.map(
lambda x: 1.0*(x == 'true'))
profile = pandas_profiling.ProfileReport(books_metadata_selected[['average_rating', 'is_ebook', 'num_pages',
'publication_year', 'ratings_count']])
profile.to_file('./results/profiler_books_metadata_2.html')
Interactions Data
As the first step, let’s take a look at all the available fields, and sample data
interactions.sample(5)
Limit the data to only the selected fields
# Limit the books metadata to selected fields
interactions_selected = interactions[['user_id', 'book_id', 'is_read', 'rating']]
# mapping boolean to string
booleanDictionary = {True: 'true', False: 'false'}
interactions_selected['is_read'] = interactions_selected['is_read'].replace(booleanDictionary)
profile = pandas_profiling.ProfileReport(interactions_selected[['is_read', 'rating']])
profile.to_file('./results/profiler_interactions.html')
Considering the results from the profiler, we’ll convert is_read column to 1/0
# convert is_read column into 1/0 where true=1 and false=0
interactions_selected['is_read'] = interactions_selected.is_read.map(
lambda x: 1.0*(x == 'true'))
Since we have two fields denoting interaction between a user and a book, `is_read` and `rating` — let’s see how many data points we have where the user hasn’t read the book but have given the ratings.
interactions_selected.groupby(['rating', 'is_read']).size().reset_index().pivot(columns='rating', index='is_read', values=0)
From the above results, we can conclusively infer that users with ratings >= 1 have all read the book. Therefore, we’ll use the `ratings` as the final score, drop interactions where `is_read` is false, and limit interactions from random 5000 users to limit the data size for further analysis
Now, let’s transform the available data into CSR sparse matrix that can be used for matrix operations. We will start by the process by creating books_metadata matrix which is np.float64 csr_matrix of shape ([n_books, n_books_features]) — Each row contains that book’s weights over features. However, before we create a sparse matrix, we’ll first create an item dictionary for future references
item_dict ={}
df = books_metadata[['book_id', 'title']].sort_values('book_id').reset_index()
for i in range(df.shape[0]):
item_dict[(df.loc[i,'book_id'])] = df.loc[i,'title']
# dummify categorical features
books_metadata_selected_transformed = pd.get_dummies(books_metadata_selected, columns = ['average_rating', 'is_ebook', 'num_pages', 'publication_year', 'ratings_count', 'language_code'])
books_metadata_selected_transformed = books_metadata_selected_transformed.sort_values('book_id').reset_index().drop('index', axis=1)
books_metadata_selected_transformed.head(5)
# convert to csr matrix
books_metadata_csr = csr_matrix(books_metadata_selected_transformed.drop('book_id', axis=1).values)
Next we’ll create an iterations matrix which is np.float64 csr_matrix of shape ([n_users, n_books]). We’ll also create a user dictionary for future use cases
user_book_interaction = pd.pivot_table(interactions_selected, index='user_id', columns='book_id', values='rating')
# fill missing values with 0
user_book_interaction = user_book_interaction.fillna(0)
user_id = list(user_book_interaction.index)
user_dict = {}
counter = 0
for i in user_id:
user_dict[i] = counter
counter += 1
# convert to csr matrix
user_book_interaction_csr = csr_matrix(user_book_interaction.values)
user_book_interaction_csr
Ideally, we would build, train, and evaluate several models for our recommender system to determine which model holds the most promise for further optimization (hyper-parameter tuning).
However, for this tutorial, we’ll train the base model, with randomly selected input parameters for demonstrations.
model = LightFM(loss='warp',
random_state=2016,
learning_rate=0.90,
no_components=150,
user_alpha=0.000005)
model = model.fit(user_book_interaction_csr,
epochs=100,
num_threads=16, verbose=False)
For getting back the top-n recommendations, we’ll re-purpose the code by Aayush Agrawal here
def sample_recommendation_user(model, interactions, user_id, user_dict,
item_dict,threshold = 0,nrec_items = 5, show = True): n_users, n_items = interactions.shape
user_x = user_dict[user_id]
scores = pd.Series(model.predict(user_x,np.arange(n_items), item_features=books_metadata_csr))
scores.index = interactions.columns
scores = list(pd.Series(scores.sort_values(ascending=False).index))
known_items = list(pd.Series(interactions.loc[user_id,:] \
[interactions.loc[user_id,:] > threshold].index).sort_values(ascending=False))
scores = [x for x in scores if x not in known_items]
return_score_list = scores[0:nrec_items]
known_items = list(pd.Series(known_items).apply(lambda x: item_dict[x]))
scores = list(pd.Series(return_score_list).apply(lambda x: item_dict[x]))
if show == True:
print ("User: " + str(user_id))
print("Known Likes:")
counter = 1
for i in known_items:
print(str(counter) + '- ' + i)
counter+=1
print("\n Recommended Items:")
counter = 1
for i in scores:
print(str(counter) + '- ' + i)
counter+=1
Let’s check the results
sample_recommendation_user(model, user_book_interaction, 'ff52b7331f2ccab0582678644fed9d85', user_dict, item_dict)
Machine learning has become increasingly popular over the past decade, and recent advances in computational availability have led to exponential growth to people looking for ways how new methods can be incorporated to advance the field of Recommendation Systems.
Often, we treat the Recommendation System as black-box algorithms. Still, hopefully, this post addressed to shed light on the underlying math, and intuitions behind it, and high-level code to get you started with building your first recommendation model.
In the next article, we’ll go one step deeper into understanding how you can evaluate the performance of these models, tune its hyper-parameters to get more intuitive and reliable results.
[1] https://en.wikipedia.org/wiki/Recommender_system
[2] Agarwal, Deepak K.. Statistical Methods for Recommender Systems. Cambridge University Press. Kindle Edition
[3] Charu, A. (1997). Recommender systems The Textbook. In Communications of the ACM (Vol. 40)
[4] Mengting Wan, Julian McAuley, “Item Recommendation on Monotonic Behavior Chains”, in RecSys’18
[5] Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley, “Fine-Grained Spoiler Detection from Large-Scale Review Corpora”, in ACL’19
[6] https://github.com/aayushmnit/cookbook/blob/master/recsys.py
Thanks for reading. If you have any feedback, please feel to reach out by commenting on this post, messaging me on LinkedIn, or shooting me an email (shmkapadia[at]gmail.com)
If you liked this article, read my other articles on NLP