Last Updated : 17 May, 2024
Summarize
Comments
Improve
Recommendation engines are responsible for enhancing user experience in every domain whether it's online shopping, social media, or movie streaming. With millions of content generated per second, it gets extremely difficult for businesses to recommend customers with content of their interest and behavior. This is where recommendation systems come into play and help with personalized recommendations.
In this article, we will understand what is collaborative filtering and how we can use it to build our recommendation system.
Building a Recommendation Engine With Collaborative Filtering in Python
In this implementation, we will build an item-item memory-based recommendation engine using Python which recommends top-5 books to the user based on their choice. You can download the datasets from here:
- books.csv
- ratings.csv
- users.csv
Step 1: Importing Necessary libraries
We need to import the below libraries for implementing the recommendation engine.
import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics.pairwise import cosine_similarity
Step 2: Load the Dataset
Getting data descriptions using 'info()' method.
# Load datasetsusers = pd.read_csv('/kaggle/input/book-recommendation-dataset/Users.csv')books = pd.read_csv('/kaggle/input/book-recommendation-dataset/Books.csv')ratings = pd.read_csv('/kaggle/input/book-recommendation-dataset/Ratings.csv')# Get dataset infousers.info()books.info()ratings.info()
Output:
users:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User-ID 278858 non-null int64
1 Location 278858 non-null object
2 Age 168096 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MBbooks:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271360 entries, 0 to 271359
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ISBN 271360 non-null object
1 Book-Title 271360 non-null object
2 Book-Author 271358 non-null object
3 Year-Of-Publication 271360 non-null object
4 Publisher 271358 non-null object
5 Image-URL-S 271360 non-null object
6 Image-URL-M 271360 non-null object
7 Image-URL-L 271357 non-null object
dtypes: object(8)
memory usage: 16.6+ MBratings:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 User-ID 1149780 non-null int64
1 ISBN 1149780 non-null object
2 Book-Rating 1149780 non-null int64
dtypes: int64(2), object(1)
memory usage: 26.3+ MB
Step 3: Data Cleaning and Preparation
In this step, we clean the data and get it ready for model building.
We have many records with the same book title with different publishers and publishing years. So, we drop the rows with duplicate book titles and store it in the 'new_book' data frame.
# Drop rows with duplicate book titlenew_books = books.drop_duplicates('Book-Title')
We then merge the 'ratings' df with 'new_books' df on 'ISBN' i.e. unique identification number for books and store the result in 'ratings_with_name'. We also drop the columns that we do not require like 'ISBN', 'Image-URL-S' etc.
# Merge ratings and new_books dfratings_with_name = ratings.merge(new_books, on='ISBN')# Drop non-relevant columnsratings_with_name.drop(['ISBN', 'Image-URL-S', 'Image-URL-M', 'Image-URL-L'], axis = 1, inplace = True)
Now, we merge the 'ratings' df with 'users' df to get 'users_ratings_matrix'. Similarly, we will drop the non-relevant columns.
# Merge new 'ratings_with_name' df with users dfusers_ratings_matrix = ratings_with_name.merge(users, on='User-ID')# Drop non-relevant columnsusers_ratings_matrix.drop(['Location', 'Age'], axis = 1, inplace = True)# Print the first few rows of the new dataframeusers_ratings_matrix.head()
Output:
User-ID Book-Rating Book-Title Book-Author Year-Of-Publication Publisher
0 276725 0 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
1 2313 5 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
2 2313 8 In Cold Blood (Vintage International) TRUMAN CAPOTE 1994 Vintage
3 2313 9 Divine Secrets of the Ya-Ya Sisterhood : A Novel Rebecca Wells 1996 HarperCollins
4 2313 5 The Mistress of Spices Chitra Banerjee Divakaruni 1998 Anchor Books/Doubleday
Checking and dropping null values.
# Check for null valuesusers_ratings_matrix.isna().sum()# Drop null valuesusers_ratings_matrix.dropna(inplace = True)print(users_ratings_matrix.isna().sum())
Output:
User-ID 0
Book-Rating 0
Book-Title 0
Book-Author 0
Year-Of-Publication 0
Publisher 0
dtype: int64
Since we have too many entries in 'users_ratings_matrix'. We will filter down the matrix to users who gave many book ratings and then filter on the basis most rated books.
The code filters a DataFrame users_ratings_matrix
containing user-book interactions based on two criteria:
- Users with Many Book Ratings: It groups the DataFrame by the 'User-ID' column and counts the number of ratings each user has given creating a boolean mask
x
where each entry indicates whether the user has given more than 100 ratings. - Books with Most Ratings: It further filters the DataFrame
filtered_users_ratings
(which contains users with many ratings) based on books that have received at least 50 ratings.
# Filter down 'users_ratings_matrix' on the basis of users who gave many book ratingsx = users_ratings_matrix.groupby('User-ID').count()['Book-Rating'] > 100knowledgeable_users = x[x].indexfiltered_users_ratings = users_ratings_matrix[users_ratings_matrix['User-ID'].isin(knowledgeable_users)]# Filter down 'users_ratings_matrix' on the basis of books with most ratingsy = filtered_users_ratings.groupby('Book-Title').count()['Book-Rating'] >= 50famous_books = y[y].indexfinal_users_ratings = filtered_users_ratings[filtered_users_ratings['Book-Title'].isin(famous_books)]
Now, we will create the pivot table for 'final_users_ratings' df. It will be a sparse user-rating matrix where each row will contain all the user ratings for a particular item and each column will contain all the item ratings by a particular user.
# Pivot table creationpivot_table = final_users_ratings.pivot_table(index = 'Book-Title', columns = 'User-ID', values = 'Book-Rating')# Filling the NA values with '0'pivot_table.fillna(0, inplace = True)pivot_table.head()
Output:
User-ID 254 507 882 1424 1435 1733 1903 2033 2110 2276 ... 274549 274808 275020 275970 276680 277427 277478 277639 278188 278418
Book-Title
1984 9.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1st to Die: A Novel 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2nd Chance 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Blondes 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A Beautiful Mind: The Life of Mathematical Genius and Nobel Laureate John Nash 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
There is no standalone library for implementing centered cosine similarity in scikit-learn. So, first, we standardize the pivot table using 'StandardScaler' and then use cosine similarity on the standardized data.
# Standardize the pivot tablescaler = StandardScaler(with_mean=True, with_std=True)pivot_table_normalized = scaler.fit_transform(pivot_table)
Step 4: Model Building
First, we calculate the similarity matrix for all the items using 'cosine_similarity'.
# Calculate the similarity matrix for all the bookssimilarity_score = cosine_similarity(pivot_table_normalized)
Then, we create a function called 'recommend()' which recommends to top books to the user based on their choice.
- The code finds the numerical index of the given book name in the pivot table.
- It sorts the similarity scores for the given book in descending order.
- It selects the top 5 similar books (excluding the given book itself).
- It retrieves the details (title, author, and image URL) of the similar books from the
new_books
DataFrame. - It formats the information and returns it as a list.
def recommend(book_name): # Returns the numerical index for the book_name index = np.where(pivot_table.index==book_name)[0][0] # Sorts the similarities for the book_name in descending order similar_books = sorted(list(enumerate(similarity_score[index])),key=lambda x:x[1], reverse=True)[1:6] # To return result in list format data = [] for index,similarity in similar_books: item = [] # Get the book details by index temp_df = new_books[new_books['Book-Title'] == pivot_table.index[index]] # Only add the title, author, and image-url to the result item.extend(temp_df['Book-Title'].values) item.extend(temp_df['Book-Author'].values) item.extend(temp_df['Image-URL-M'].values) data.append(item) return data
Step 5: Model Validating
# Call the recommend methodrecommend('1984',similarity_score)
Output:
[["Foucault's Pendulum",
'Umberto Eco',
'http://images.amazon.com/images/P/0345368754.01.MZZZZZZZ.jpg'],
['Tis : A Memoir',
'Frank McCourt',
'http://images.amazon.com/images/P/0684848783.01.MZZZZZZZ.jpg'],
['Animal Farm',
'George Orwell',
'http://images.amazon.com/images/P/0451526341.01.MZZZZZZZ.jpg'],
['The Glass Lake',
'Maeve Binchy',
'http://images.amazon.com/images/P/0440221595.01.MZZZZZZZ.jpg'],
['Summer Pleasures',
'Nora Roberts',
'http://images.amazon.com/images/P/0373218397.01.MZZZZZZZ.jpg']]
Conclusion
Building a recommendation engine using collaborative filtering is a robust way to enhance personalization in services. By following the above steps, one can achieve a highly effective recommendation system that is sensitive to user preferences and behaviors.
Previous Article
Image Based Product Recommendation System
Next Article