Matrix factorization

staples-easy-button

Or fancy words that mean very simple things.

At the heart of most data mining, we are trying to represent complex things in a simple way. The simpler you can explain the phenomenon, the better you understand. It’s a little zen – compression is the same as understanding.

Warning: Some math ahead.. but stick with it, it’s worth it.

When faced with a matrix of very large number of users and items, we look to some classical ways to explain it. One favorite technique is Singular Value Decomposition, affectionately nicknamed SVD. This says that we can explain any matrix A in the form

A= USV^T

Here

A is of size num_users and num_products

U is of size num_users and num_users

S is a rectangular diagonal matrix of size num_users and num_products

V is of size num_products and num_products

This is cool because the numbers in the diagonal S are decreasing values of variance (roughly speaking). So the first number captures the most variance in A, the second, less so, and so on, till all the numbers put together capture all the variance of A.

You see where this is going. If we are comfortable with explaining only 10% of the variance, we can do so by only taking some of these values. We can then compress A.

The other nice thing about all this is eigenvectors (which those roughly represent) are orthogonal. In other words, they are perpendicular to each other and there aren’t any mixed effects. If you don’t understand this paragraph, don’t worry. Keep on.

Consider that above can be rewritten as

A = U\sqrt{S}\sqrt{S}V^T

or –

Y = X \Theta^T

If X is of size num_users and num_categories and \Theta is of size num_products and num_categories, then we have a decent model that approximates A into Y. Y is now called a low rank approximation of A, or truncated SVD.

It is important to understand what these matrices represent. X is the representation of users in some low dimension space (say romance, action). And \Theta is the representation of products in the very same low dimension space. However, we don’t know exactly what these factors mean. They could be romance/action or they could NYC/Dallas. This is also why this method is sometimes called Latent Factor Matrix Factorization.. wow, quite a mouthful.

In my chapter in the book Data Mining Applications with R, I go over different themes of matrix factorization models (and other animals as well). But for now, I am only going to cover the basic one that works very well in practice. And yes, won the Netflix prize.

There is one problem with our formulation – SVD is only defined for dense matrices. And our matrix is usually sparse.. very sparse. Netflix’s challenge matrix was 1% dense, or 99% sparse. In my job at Rent the Runway, it is only 2% dense. So what will happen to our beautiful formula?

Machine learning is sort of a bastard science. We steal from beautiful formulations, complex biology, probability, Hoefding’s inequality, and derive rules of thumb from it. Or as an artist would say – get inspiration.

So here is what we are going to do. We are going to ignore all the null values when we solve this model. Our cost function now becomes

J = R ( Y - X \Theta^T)^2 + \lambda (||X||^2 + ||\Theta||^2)

Here Y - X\Theta^T is the difference between observed data and our prediction. R is simply a matrix with 1 where we have a value and 0 where we do not. Multiplying these two we are only considering the cost when we observe a value. We are using L2 regularization of magnitude \lambda. We are going to divide by 2 to make all other math easier. The cost is relative so it doesn’t matter.

J = \frac{1}{2} R ( Y - X \Theta^T)^2 +\frac{1}{2} \lambda (||X||^2 + ||\Theta||^2)

Using this our gradients become –

\frac{\partial}{\partial{X}} = R (Y - X \Theta^T) \Theta + \lambda ||X||

\frac{\partial}{\partial{\Theta}} = R (Y - X \Theta^T)^T X+ \lambda ||\Theta||

If we were wanted to minimize the cost function, we can move in the direction opposite to the gradient at each step, getting new estimates for X and \Theta each time.

This looks easy enough. One last thing. We now have what we want to minimize, but how do we do it? R provides many optimization tools. There is a whole CRAN page on it. For our purpose we will use out of the box optim() function. This allows us to access a fast optimization method L-BFGS-B. It’s not only fast, but also doesn’t take too memory intensive which is desirable in our case.

We need to give it the cost function and the gradient that we have above. It also expects one gradient function, so we need to unroll it out to do both our gradients.

This is great, we can iterate a few times to approximate users and items into some small number of categories, then predict using that.

I have coded this into another package recommenderlabrats.

Let’s see how this does in practice against what we already have. I am going to use the same scheme as last post to evaluate these predictions against some general ones. I am not using Item Based Collaborative Filtering because it is very slow

RSVD ROC

RSVD prec recall

It does pretty well. It does better than POPULAR and is equivalent to UBCF. So why use this over UBCF or the other way round?

This is where things get interesting. This algorithm can be sped up quite a lot and more importantly, parallelised. It uses way less memory than UBCF and is more scalable.

Also, if you have already calculated \Theta, i.e. your items in some lower dimensional space, you might get away with just putting the users in that space. Now things become really fast because all you have to do is keep \Theta fixed and figure out X.

I have gone ahead and implemented a version where we just calculate \Theta, I leave it to you as an exercise to modify the code above to test this out as well. The algorithm is called RSVD_SPLIT. Also feel free to try other values of categories, lambda and maxit and see how things change.

On the other hand, the latent categories are very hard to explain. For UBCF you can say this user is similar to these other users. For IBCF, one can say this item that the user picked is similar to these other items. But that not the case for this particular flavor of matrix factorization. We will re-evaluate these limitations later.

The hardest part for a data scientist is to disassociate themselves from their dear models and watch them objectively in the wild. Our simulations and metrics are always imperfect but necessary to optimize. You might see your favorite model crash and burn. And a lot of times simple linear regression will be king. The job is to objectively measure them, tune them and see which one performs better in your scenario.

Good luck.

Matrix factorization

Testing recommender systems in R

Recommender systems are pervasive. You have encountered them while buying a book on barnesandnoble, renting a movie on Netflix, listening to music on Pandora, to finding the bar visit (FourSquare). Saar for Revolution Analytics, had demonstrated how to get started with some techniques for R here.

We will build some using Michael Hahsler’s excellent package – recommenderlab. But to build something we have to learn to recognize when it is good. For this reason we will talk about some metrics quickly –

- RMSE  (Root Mean Squared Error) : Here we measure far were real ratings from the ones we predicted. Mathematically, we can write it out as

RMSE = \sqrt\frac{\sum_{(i,j) \in \kappa}(r_{(i,j)} - \hat {r}_{(i,j)})^2}{|\kappa|}

where \kappa is the set of all user-item pairings (i, j) for which we have a predicted rating \hat r_{(i,j)}  and a known rating r_{(i,j)}  which was not used to learn the recommendation model.

Here at sane.a.lytics, I will talk about when an analysis makes sense and when it doesn’t. RMSE is a great metric if you are measuring how good your predicted ratings are. But if you want to know how many people clicked on your recommendation, I have a different metric for you.

- Precision/Recall/f-value/AUC: Precision tells us how good the predictions are. In other words, how many were a hit.

Recall tells us how many of the hits were accounted for, or the coverage of the desirable outcome.

Precision and recall usually have an inverse relationship. This becomes an even bigger issue for rare issue phenomenon like recommendations. To tackle this problem, we will use f-value. This is nothing but the harmonic mean of precision and recall.

Another popular measure is AUC. This is roughly analogous. We will go ahead and use this for now for our comparisons of recommendation effectiveness.

- ARHR (Hit Rate): Karypis likes this metric.

ARHR = \frac{1}{\#users} \sum_{i=1}^{\#hits} \frac{1}{p_i}

where p is the position of the item in a ranked list.

OK, on to the fun stuff.

They are a few different ways to build a recommender system

Collaborative Filtering : If my friend Jimmy tells me that he liked the movie “Drive”, I might like it too since we have similar tastes. However if Paula tells me she liked “The Notebook”, I might avoid it. This is called UBCF (User-based collaborative filtering). Another way to think about it is that this is soft-clustering. We find Users with similar tastes (neighbourhood) and use their preferences to build yours.

Another flavour of this is IBCF (Item Based Collaborative Filtering). If I watched “Darjeeling Limited”, I might be inclined to watch “The Royal Tannenbaums” but not necessarily “Die Hard”. This is because the first two are more similar in the users who have watched/rated them. This is a rather simple to compute as all we need is the covariance between products to find out what this might be.

Let’s compare both approaches on some real data (thanks R)

Distribution of ratings sample

Histogram of ratings

Rating histogram, normalized

Movies rated

avg movie rating

AUC

prec/recall

It seems like UBCF did better than IBCF. Then why would you use IBCF? The answer lies is when and how are you generating recommendations. UBCF saves the whole matrix and then generates the recommendation at predict by finding the closest user. IBCF saves only k closest items in the matrix and doesn’t have to save everything. It is pre-calculated and predict simply reads off the closest items.

Predictably, RANDOM is the worst but perhaps surprisingly it seems, its hard to beat POPULAR. I guess we are not so different, you and I.

In the next post I will go over some other algorithms that are out there and how to use them in R. I would also recommend reading Michael’s documentation on recommenderlab for more details.

Also added this to r-bloggers. Please check it out for more R goodies.

Testing recommender systems in R