Testing recommender systems in R

Recommender systems are pervasive. You have encountered them while buying a book on barnesandnoble, renting a movie on Netflix, listening to music on Pandora, to finding the bar visit (FourSquare). Saar for Revolution Analytics, had demonstrated how to get started with some techniques for R here.

We will build some using Michael Hahsler’s excellent package – recommenderlab. But to build something we have to learn to recognize when it is good. For this reason we will talk about some metrics quickly –

– RMSE  (Root Mean Squared Error) : Here we measure far were real ratings from the ones we predicted. Mathematically, we can write it out as

RMSE = \sqrt\frac{\sum_{(i,j) \in \kappa}(r_{(i,j)} - \hat {r}_{(i,j)})^2}{|\kappa|}

where \kappa is the set of all user-item pairings (i, j) for which we have a predicted rating \hat r_{(i,j)}  and a known rating r_{(i,j)}  which was not used to learn the recommendation model.

Here at sane.a.lytics, I will talk about when an analysis makes sense and when it doesn’t. RMSE is a great metric if you are measuring how good your predicted ratings are. But if you want to know how many people clicked on your recommendation, I have a different metric for you.

– Precision/Recall/f-value/AUC: Precision tells us how good the predictions are. In other words, how many were a hit.

Recall tells us how many of the hits were accounted for, or the coverage of the desirable outcome.

Precision and recall usually have an inverse relationship. This becomes an even bigger issue for rare issue phenomenon like recommendations. To tackle this problem, we will use f-value. This is nothing but the harmonic mean of precision and recall.

Another popular measure is AUC. This is roughly analogous. We will go ahead and use this for now for our comparisons of recommendation effectiveness.

– ARHR (Hit Rate): Karypis likes this metric.

ARHR = \frac{1}{\#users} \sum_{i=1}^{\#hits} \frac{1}{p_i}

where p is the position of the item in a ranked list.

OK, on to the fun stuff.

They are a few different ways to build a recommender system

Collaborative Filtering : If my friend Jimmy tells me that he liked the movie “Drive”, I might like it too since we have similar tastes. However if Paula tells me she liked “The Notebook”, I might avoid it. This is called UBCF (User-based collaborative filtering). Another way to think about it is that this is soft-clustering. We find Users with similar tastes (neighbourhood) and use their preferences to build yours.

Another flavour of this is IBCF (Item Based Collaborative Filtering). If I watched “Darjeeling Limited”, I might be inclined to watch “The Royal Tannenbaums” but not necessarily “Die Hard”. This is because the first two are more similar in the users who have watched/rated them. This is a rather simple to compute as all we need is the covariance between products to find out what this might be.

Let’s compare both approaches on some real data (thanks R)

# Load required library
library(recommenderlab) # package being evaluated
library(ggplot2) # For plots
# Load the data we are going to work with
data(MovieLense)
MovieLense
# 943 x 1664 rating matrix of class ‘realRatingMatrix’ with 99392 ratings.
# Visualizing a sample of this
image(sample(MovieLense, 500), main = "Raw ratings")

view raw
recommenderlab-1-1.R
hosted with ❤ by GitHub

Distribution of ratings sample

# Visualizing ratings
qplot(getRatings(MovieLense), binwidth = 1,
main = "Histogram of ratings", xlab = "Rating")
summary(getRatings(MovieLense)) # Skewed to the right
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 3.00 4.00 3.53 4.00 5.00

view raw
recommenderlab-1-2.R
hosted with ❤ by GitHub

Histogram of ratings

# How about after normalization?
qplot(getRatings(normalize(MovieLense, method = "Z-score")),
main = "Histogram of normalized ratings", xlab = "Rating")
summary(getRatings(normalize(MovieLense, method = "Z-score"))) # seems better
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -4.8520 -0.6466 0.1084 0.0000 0.7506 4.1280

view raw
recommenderlab-1-3.R
hosted with ❤ by GitHub

Rating histogram, normalized

# How many movies did people rate on average
qplot(rowCounts(MovieLense), binwidth = 10,
main = "Movies Rated on average",
xlab = "# of users",
ylab = "# of movies rated")
# Seems people get tired of rating movies at a logarithmic pace. But most rate some.

view raw
recommenderlab-1-4.R
hosted with ❤ by GitHub

Movies rated

# What is the mean rating of each movie
qplot(colMeans(MovieLense), binwidth = .1,
main = "Mean rating of Movies",
xlab = "Rating",
ylab = "# of movies")
# The big spike on 1 suggests that this could also be intepreted as binary
# In other words, some people don't want to see certain movies at all.
# Same on 5 and on 3.
# We will give it the binary treatment later

view raw
recommenderlab-1-5.R
hosted with ❤ by GitHub

avg movie rating

recommenderRegistry$get_entries(dataType = "realRatingMatrix")
# We have a few options
# Let's check some algorithms against each other
scheme <- evaluationScheme(MovieLense, method = "split", train = .9,
k = 1, given = 10, goodRating = 4)
scheme
algorithms <- list(
"random items" = list(name="RANDOM", param=list(normalize = "Z-score")),
"popular items" = list(name="POPULAR", param=list(normalize = "Z-score")),
"user-based CF" = list(name="UBCF", param=list(normalize = "Z-score",
method="Cosine",
nn=50, minRating=3)),
"item-based CF" = list(name="IBCF2", param=list(normalize = "Z-score"
))
)
# run algorithms, predict next n movies
results <- evaluate(scheme, algorithms, n=c(1, 3, 5, 10, 15, 20))
# Draw ROC curve
plot(results, annotate = 1:4, legend="topleft")
# See precision / recall
plot(results, "prec/rec", annotate=3)

view raw
recommenderlab-1-6.R
hosted with ❤ by GitHub

AUC

prec/recall

It seems like UBCF did better than IBCF. Then why would you use IBCF? The answer lies is when and how are you generating recommendations. UBCF saves the whole matrix and then generates the recommendation at predict by finding the closest user. IBCF saves only k closest items in the matrix and doesn’t have to save everything. It is pre-calculated and predict simply reads off the closest items.

Predictably, RANDOM is the worst but perhaps surprisingly it seems, its hard to beat POPULAR. I guess we are not so different, you and I.

In the next post I will go over some other algorithms that are out there and how to use them in R. I would also recommend reading Michael’s documentation on recommenderlab for more details.

Also added this to r-bloggers. Please check it out for more R goodies.

Testing recommender systems in R