Collaborative Filtering in 9 Lines of Code

Toby Segaran's 2007 book Programming Collective Intelligence has popularized machine learning methods for computer programmers. In the intervening years, we have also see powerful data analysis packages such as Pandas becoming mature. While I was revisiting the book, I think the implementation can be greatly improved by using these packages.

I started with Chapter 2 Making Recommendations. If you are interested, O'Reilly provides free preview of the chapter in the book's website. This chapter demonstrates the technique of collaborative filtering. By finding other people who have tastes similar to yours, it can look at other items they like and offer you recommendations much like what you have seen on amazon.com. As an exercise, I have tried to execute the method using Pandas. Sure enough the code that spread out several pages on the book can now be implemented in 9 lines. Mind you I am only just starting to get familiar with Pandas and I am not an expert by any means. But I am rather excited with the power it enables.

This article walk you through the steps. First of all, I have encoded the data that was originally a nested dictionary into the more standard CSV format movie_rating.csv. This can be loaded by the read_csv() method. Although the data is flatten into a list in the file, it can be easily viewed in as a pivot table by the critics and the titles. (Note that the book has somehow excluded the critic Michael Phillips in its calculation. So I have omitted it from the CSV file to match the book's result.)

In [3]:
import numpy as np; import pandas as pd; from pandas import Series, DataFrame
rating = pd.read_csv('movie_rating.csv')
rp = rating.pivot_table(cols=['critic'],rows=['title'],values='rating')
rp
Out[3]:
critic Claudia Puig Gene Seymour Jack Matthews Lisa Rose Mick LaSalle Toby
title
Just My Luck 3.0 1.5 NaN 3.0 2 NaN
Lady in the Water NaN 3.0 3.0 2.5 3 NaN
Snakes on a Plane 3.5 3.5 4.0 3.5 4 4.5
Superman Returns 4.0 5.0 5.0 3.5 3 4.0
The Night Listener 4.5 3.0 3.0 3.0 3 NaN
You Me and Dupree 2.5 3.5 3.5 2.5 2 1.0

Pandas has nicely filled in NaN in the cells for movies not reviewed by a critic.

The next step is to find the similarity score between the critics. The author Toby is used as an example. The book introduced a somewhat involving formula the Pearson correlation score. Turn out this is simply the correlation coefficient supported in most statistical packages. In Pandas, you can use corrwith() to calculate the correlation. A score close to 1 means their tastes are very similar. As you can see in the result below, Lisa Rose's taste is very similar to Toby but it is not so much with Gene Seymour.

In [10]:
rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
sim_toby
Out[10]:
critic
Claudia Puig     0.893405
Gene Seymour     0.381246
Jack Matthews    0.662849
Lisa Rose        0.991241
Mick LaSalle     0.924473
Toby             1.000000

To make recommendation for Toby, we calculate a rating of others weighted by the similarity. Note that we only need to calculate rating for movies Toby has not yet seen. The first line below filter out irrelevant data. It then assign the similarity score and the weighted rating.

In [14]:
rating_c = rating[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c['similarity'] = rating_c['critic'].map(sim_toby.get)
rating_c['sim_rating'] = rating_c.similarity * rating_c.rating

Lastly we add up the score for each title using groupby(). We also normalize the score by dividing it with the sum of the weights. Base on other critics' similarity and their rating, we have made a movie recommendation for Toby. The number matches the result of the book.

In [15]:
recommendation = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendation.order(ascending=False)
Out[15]:
title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981

Putting them all together, here are the 9 lines of code that does collaborative filtering.

In [13]:
rating = pd.read_csv('movie_rating.csv')
rp = rating.pivot_table(cols=['critic'],rows=['title'],values='rating')

rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)

rating_c = rating[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c['similarity'] = rating_c['critic'].map(sim_toby.get)
rating_c['sim_rating'] = rating_c.similarity * rating_c.rating

recommendation = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendation.order(ascending=False)
Out[13]:
title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981

Return to the main article.