Live Blog – day 4,5 – Dress recommendation models

2006-rolls-royce-phantom-black-engine-1280x960

<Please note that this post is unfinished because ************ even though we delivered better than expected results! Unlimited program went on to be 70% of revenue during my tenure>

This post is probably what you expected this series to be.

Previously, on building a production grade state of the art recommendation system in 30 days:

Live logging building a production recommendation engine in 30 days

Live reco build days 2-3 – Unlimited recommendation models

We had to get that work out of the way before can play. Let’s talk about the two baseline models we are building:

Bayesian model track

To refresh your memory, this simple Bayesian model is going to estimate

P(s | u) = P(c | u) * P(s | c)

where u is a user, s is a style (product in RTR terminology) and c is a carousel.

Well, what is c supposed to be?

Since we want explainable carousels, Sean is chasing down building this model with product attributes and 8 expert style persona tags (minimalist, glamorous, etc). We have a small merchandising team and tagging 1000 products with atleast 10 tags each took a month. To speed things up, we are going to ask Customer Experience (CX), a much larger team. They can tag all products within the week.

However, their definition of these style personas might be different than that of merchandising. So to start, we asked them tag a small sample of products (100) that had already been tagged. We got back results within a day. Then Sean compared how different ratings from these new taggers is compared to merchandising.

He compared them three ways. To refresh, ratings are from 1-5.

  • Absolute mean difference of means of CX tags and original tags
    • Average Difference is 0.45
    • Mean Absolute Difference by Persona:
      • Minimalist: 0.62
      • Romantic: 0.44
      • Glamorous: 0.24
      • Sexy: 0.24
      • Edgy: 0.37
      • Boho: 0.35
      • Preppy: 0.46
      • Casual: 0.78
  • Agreement between taggers of the same  team
    • CX tags cosine similarity to each other has median of 0.83, whereas original tags similarity to each other has median of 0.89. This is to be expected. Merchandisers speak the same language because they have to be specific. But CX is in the acceptable range of agreement.
  • Agreement between CX tags and original merchandisers as measured by cosine similarity
    • 99% above 0.8 and 87% above 0.9. So while they don’t match, they are directionally the same.

This is encouraging. Sean then set out to check the impact of these on rentals. He grouped members into folks who showed dominance of some attribute 3 previous months. Then tried to see what the probability of them renting the same attribute next month is. He computed chi squared metrics of these to see if they are statistically significant.

This is what you always want to do. You want to make sure your predictors are correlated with your response variable before you build a castle on top.

He found some correlations for things like formality but but not much else. He will continue into next week to solidify this analysis so we know for sure.

Matrix Factorization Collaborative filtering

This particular Bayesian model is attribute first. It says, given the attributes of the product, see how well you can predict what happened. These sort of models are called content based in literature.

giphy

I’m more on the AI/ML end of the data science spectrum, my kind is particularly distrustful of clever attributes and experts. We have what happened, we should be able to figure out what matters with enough data. I will throw in the hand-crafted attributes to see if that improves things, but that’s not where I’d start.

Please note that I do recommend Bayesian when the problem calls for it, or if that’s all I have, or have little data.

To be clear, the main difference between the two approaches is the Bayesian model we are building is merchandise attribute first. This ML model is order first. Both have to sail to the other shore but they are different starting points.  We are doing both to minimize chance of failure.

This has implications on cold start, both new users and new products. It also has implications on how to name these products. So I walked around the block a few times and after a few coffees had a head full out ideas.

First things first, if we can’t predict the products, it doesn’t matter what we name the carousels. So let’s do that first. I had mentioned earlier that I had my test harness from when I built Rent The Runway recommendations, see here on github.

Matrix factorization is a fill-in-the-blanks algorithm. It is not strictly supervised or unsupervised. It decomposes users and products into the SAME latent space. This means that we can we can now talk in only this much, smaller space and what we say will apply to the large user x product space.

If you need a refresher on Matrix Factorization, please see this old post. Now come back to this and let’s talk about a really important addition -> Implicit Feedback Matrix Factorization.

We are not predicting ratings unlike Netflix. We are predicting whether or not someone will pick (order). Let’s say that instead of ratings matrix Y of, we have Z simply saying someone picked a product or didn’t, so it has 1s and 0s.

We will now introduce a new matrix C which indicates our confidence in the user having picked that item. In the case of user not picking a style, that cell in Z will be 0. The corresponding value in C will be low, not zero. This is because we don’t know if she did not do so because the dress wasn’t available in her size, her cart was already full (she can pick 3 items at a time), or myriad other reasons. In general, if Z has 1, we have high confidence and low everywhere else.

We still have

Z = X \Theta^T

where X is size num_users and num_categories and \Theta is size num_products and num_categories.

But now our loss becomes

J = C (Z - X \Theta^T)^2

Let’s add some regularization

J = C (Z - X \Theta^T)^2 + \lambda ({||X||}^2 + {||\Theta||}^2 )

There are a lot of cool tricks in the paper, including personalized product to product recommendations per user, see the paper in all it’s glory here.

So I took all the users who had been in the program for at least 2 months for past two years. I took only their orders (picks), even though we have hearts, views and other data available to us. I got a beefy Amazon large memory box (thanks DevOps) and ran this few a few epochs. I pretty much left the parameters to what I had done last time around – 20 factors and a small regularizing lambda of 0.01. This came up with \Theta and X in 20 factor space.

This is a good time to gut-check. I asked someone in the office to sit with me. I showed her her previous history that the algorithm considered –

Then predicted her recommendations and asked her to tell me which ones she would wear.

For her, the results were really good. Everything predicted was either something she had considered, hearted, worn. There were some she said she wouldn’t wear but it was less than 10 products in this 100 product sample.

This is great news, we are on track. We still have to validate against all users to do a reality check of how many did we actually predict on aggregate. But that’s next week.

So now to the second question. How do we put these products into carousels? How do we cold-start?

Carousels go round and round

This is where the coffee came in handy. What are these decomposed matrices? They are the compressed version of what happened. They are telling the story of what the users actually picked/ordered. So \Theta is organized by what folks want, not product attributes.

Let’s say for a second that each factor is simply 1 and 0. If each product is in 20 factors, and you can combine these factors factorial(20) times to get 2 followed by 18 zeroes combinations. That’s a large space of possible carousels. We have real numbers, not integers, so this space is even larger.

Hierarchical clustering is a technique to break clusters into sub-clusters. Imagine each product sitting in it’s own cluster, organized by dissimilarity. Then you take the two most similar clusters, join them and compute dissimilarities between them using Lance–Williams dissimilarity update formula. You stop when you have only one cluster, all the products. You could also have done this the opposite way, top-down. Here is what it looks like when it’s all said and done for these products.

The advantage is that we can stop with top 3 clusters, and zoom down to any level we want. But how many clusters is an appropriate starting point, and how do we name these clusters?

Naming is hard

Explainability with ML is hard. With Bayesian, since we are starting attribute first, it comes naturally (see, I like Bayesian). We have our beautiful \Theta with all products and 20 categories. Let’s get an intuition for this by cutting at tree into 20 clusters.

I then compare these clusters to the product attributes and notice some trends. For example, cluster 9 has –

Cluster 10 has –

This is a lot but even if we filter to tops, these are different products

Cluster 9 tops

Custer 10 tops

So yes, these can be named via description of what product attributes show up, or via filtering by attribute. I have some ideas on experiments I want to run next week after validation.

Sidenote: AI

In our previous experiment, we had minimized KL divergence between expert tags and inventory. Although this isn’t the metric we care about, but we got 40% top-1 accuracy and 60% top2 (of 8 possible classes).

I got the evaluation set that I had predicted on. We got 41% for top-1 and 64% for top-2. The good news is this is consistent with our validation set although not down to per-class accuracy. But are these results good or bad?

There are two kinds of product attributes. I like to define them as facts (red, gown) and opinions (formality, style persona, reviews). Now of course you can get probability estimates for opinions to move them closer to facts. But imagine if you didn’t need them.

If we can do this well to match opinion tags with 645 styles, it means the space captured via Deep Learning is the right visual space.

What’s better, we can do this while buying new products. Currently the team is using DeepDress to match products to previous ones we had and comparing performance. We can do better if we map these to user clusters and carousels… we can tell them exactly which users will order the product if they bought it right now! This is pretty powerful.

However we have to prioritize. I always insist that the most important thing is to build  full data pipeline end-to-end. Model improvements can come later.

So right now, the most important thing is to build the validation pipeline. We will use that to validate the ML baseline and the Bayesian baseline when ready. We will have to build the delivery, serving and measurement pipeline too. Once we are done with all that, we will come back to this and explore this idea.

Live Blog – day 4,5 – Dress recommendation models

Leave a comment