Foundation Models for Ranking: Challenges, Successes, and Lessons Learned
Transcript
Moumita Bhattacharya: I'm a machine learning manager at Netflix, leading the foundation models team. Here the talk will cover some use cases from Netflix. First and foremost, I wanted to have this slide where hopefully by the end of the talk, I will be able to convince you that in ML, there is a Harry Potter and the Harry Potter has some magic, want to do something magical. Let's see if I manage to do that. Search and recommendation is the overarching topic of this talk. As we all know, search and recommendations as an application of machine learning is omnipresent in different products. Whether it's video streaming services like Netflix or Hulu or Amazon, or music streaming services such as Spotify, Pandora, eCommerce platforms such as Etsy, Amazon leverages ML, machine learning and AI for search and recommendations use cases. The user base as well as the catalog is ever growing. How many folks are aware of search and recommendation and applications of ML? Usually, because the catalog is really big, in reality, for B2C, business to customer products, let's say if it's a big company like Netflix or Spotify, there are 100 million plus users. Netflix has 300 million plus users. The catalog, the items over which we have to score to show to a user is usually more than 100 million. For example, in eCommerce context, probably it's in billions. It's a very tough task to, for each of the user, rank the whole catalog and show it in front of you. Imagine when you join Netflix, you open your Netflix TV, and if it takes five minutes before anything shows up on the screen, it will be ridiculous. What do we typically do? Because the problem space to score is really high, is we usually break it down into two stages. A user comes to Netflix or Spotify or Amazon. There are millions of listings. There is usually a first stage which is typically referred to as candidate set selection or retrieval, which reduces the number of candidates in the catalog to some hundreds of thousands of items. That then a more complex model, usually referred to as second pass ranker, is used to then rank them for precision, so that, as a user, you see something that is personalized and useful to you, and not the entire catalog. This is a very usual two-stage ranking in any industry setup for search and recommendation tasks. During my talk, I will focus more on the second pass ranker and then generalize it to more foundation models. Common Components for ML on Product Usually, these are some common components for ML or AI for product in the context of search and recommendation. As I mentioned, there is the first pass ranking which could be a very simple lightweight ML model or heuristic. That's about the last time you will hear about first pass ranking in this talk. Then we have second pass ranking, offline evaluation, inference setup, and A/B test, and online evaluation before it becomes available to 100% of the users. Second pass ranking has different stages that need to be handled. First and foremost, where is the data? How do we get the data? What are the features? What is the model architecture? What is the objective? Should we optimize for click versus purchase versus keeping you engaged for more time versus just showing you something delightful for a few seconds? Objective and reward is somewhere where we really try to capture the business need and the user need. Before we can launch anything in front of real users, there is usually a very rigorous setup of offline evaluation to understand whether the model is doing what it is supposed to do before we show it to the user. Those offline evaluations are guardrails. Then, once a model is ready to be shown to a user, there is a lot of inference considerations like latency. As I was saying, if a user has to wait for five minutes to see a result, that probably will be a very horrible experience. How do we optimize for latency? Some of you who work in ML infra space, you know p50, p90. What is the end-to-end time that the model takes to return results? Throughput, compute cost. Of course, now with GPU and stuff like that, cost is a big consideration. Finally, what are some user metrics that we leverage to assess whether this new model that we showed in front of all the users is relevant or not, and that's where A/B test metrics, score metrics, and so on comes in. This is just an overview of different components. In my talk, primarily, I will focus on these three components. I think these are something I already covered, like second pass ranker, offline evaluation, and inference. In inference, latency, throughput, and compute cost is very important. A Netflix Ranking Use Case - Unified Contextual Recommender (UniCoRn) Now let me share a specific use case, a ranking use case, where, in Netflix context, we proposed one model to serve both search and recommendation use case. Just a historical context, academically, search and recommendation has been approached in two different communities. There are conferences like RecSys that tackles recommendation tasks, and conferences like SIGIR tackles search tasks. We basically said, ultimately, given the right context, this is the same task, which is ranking. The question we asked was, can we build a single model for both search and recommendation tasks? The answer is yes. Just to repeat the example of what is a search task in the context of Netflix, when you type, let's say, P-A-R-I, which is a text, then we would expect Netflix to return titles like Emily in Paris, or Cooking with Paris, the Actress Paris, and so on. Then, a pure recommendation task is where we don't have any context, where we do not have a search term. Then a different kind of recommendation task is video-to-video recommendation task. When you click Emily in Paris, what are the other titles that are similar to it? The premise of this part of the talk is we can build one ML model to jointly serve both search and recommendation tasks. First, let's double-click on what are the differences between a search task and a recommendation task. The first one being the context itself. For search, there's always a query that you type, so it's a user intent that is provided to the system. Example is query in the input context for search, whereas for recommendation, it could be a video, or it could be nothing, which is basically just profile ID, or your user ID as a context. Then, because they're usually part of different parts of the product, there are different engagements, so when you go to search, you usually engage with a different part of the product, for example, in Amazon, when you search something, versus when you see Amazon homepage, it's recommendation. As a result, there are also different candidates that are retrieved for search and for recommendation. People who work in industry know there's always some business request based on which there are some last pass business rules set up, so they are usually different as well. The goal of this work was to develop a single contextual recommender system which we named as UniCoRn, Unified Contextual Ranker, or recommender system, that can serve all of search and recommendation tasks. What's the benefit? Instead of having four different models, when you can have just one model, you need fewer scientists to develop it, or fewer engineers to develop it. You can bring innovation to multiple places by just innovating on one model, so that's a huge benefit. Then these different tasks benefit from each other. No one really loves taking care of tech debt, so lower maintenance costs as well as reduced tech debt. These are some huge benefits. We were successfully able to build this one model and replace four different models in Netflix production, and this is part of what powers Netflix search as well as some part of the recommendations today. How do we go about doing it? Remember the differences in the context that I mentioned? We basically unify the context, or unify the differences. First is unify the context. Instead of just having query and profile ID in the context, we add query, country, language, task type, whether it's a search or recommendation task, then we also have a combination of data. We take engagement, user engagement data from across the products, and mix them, so it's a data-driven multitask learning. As well as we add context-specific features, so when it's a query, there are query-specific features, when it's just a source title ID, Emily in Paris, more like this, so we add those entity source title ID-related features, and so on. Here's an example of the task and its context. For search task, the context looks like something like this: query, country, language, and task equals to search, whereas for title-title recommendation, it's source video, country, language, and task is title-title recommendation. This is data-driven model parameter-driven multitask learning which is allowing the model to learn unique behavior from these different tasks while also benefiting from each other, other tasks. Then we also combine all the engagement from all parts of the product instead of training different models from different parts of the data. As well as the ultimate rank, whatever this model is ranking, it's still just likelihood of play, that when a user comes in, whichever part of the product they come in, Netflix wants you to find something that you can play. Similarly, Amazon wants you to probably find something that is relevant to you. Here's an example of the model. It's a fully connected deep neural network. I won't go into a lot of details of the architecture, but certain things to highlight here in the model. We have entity features which is basically the target because we are ranking a set of videos for Netflix. We are ranking a set of listings for Amazon. Those are features about the final output, so target. Then we have context features like query, source title ID, for example, Emily in Paris or Stranger Things, and profile, like whatever we know from your behavior on the product, we also take those contexts into account. Other context features are like device type or time of the day. Maybe it's Saturday evening, you want to watch something else sitting with your friends or partner, versus it's Monday morning, you want to take a break during work, you're watching something else on Netflix. It really depends on the context. Then context entity features as well. For a given time, and target, what are some features? Those are called cross features which are very important. Then we have similarly all these categorical features for all these different categories. For real value features, they are just basically numeric features, and then for any categorical features, we have embeddings in the model. Then it's a fully connected neural network with skip connections or residual connections, and they usually help with the model not forgetting some of the input context. Ultimately, the model tries to optimize for likelihood of positive engagement. In this case, in the context of Netflix, likelihood of positive engagement could be play. This model then gets deployed in production, and this one model then powers search, powers title-title recommendation, powers pure personalization recommendation, and so on. We are referring to this model as UniCoRn, because it's unicorn, serving four different canvases by one model. How is this learning happening? What are some unique things about a model like that? Each of these different tasks are actually benefiting from each other. The rest of the task is called auxiliary task. An example is when a user types stranger as a query and stranger thing as a source title ID, this model learns that the user really is looking for something similar. Through a query, the intent is, show me Stranger Things, but when I click on Stranger Things as a title, the rest of the recommendation, I want to fetch a title that is similar to Stranger Things. Then also task type as context and features specific to different tasks let the model learn tradeoffs between the different tasks. When we train an individual model for just one purpose, we are really making this model be very narrow about its intent, whereas when we are able to train one model for different tasks, it's able to learn from these different tasks, and it no longer remains as myopic. In some way, I'm trying to motivate the next part of the presentation that I'll show, which is foundation model, which is taking it even further down, where it's completely agnostic to any task. This is still specific to search and recommendation. The other additional aspects that I can share here is, imputing missing context is very helpful. For example, for some tasks, pure personalization task or title-title recommendation task, you don't have query terms, but figuring out a way through a heuristic or some other model, imputing those queries was very helpful. For title-title recommendation, when we don't have query, some of the things I've done is tokenize the source title name and treat it as a query, or somehow figure out, based on some heuristic, mapping an entity to a query, and so on. Also, something that ML practitioners know that feature crossing can be very helpful in this case. There is a model architecture called Deep & Cross Network, DCN-V2, that is what we are using here. That's very useful. With this unification, we were able to achieve either a lift or parity of performance for these different tasks. In one go, we were able to replace four different machine learning models in production with this one model. System Considerations Some system considerations here. Prior to this UniCoRn model being built, we had proliferation of ML models across the system, and this is in the context of Netflix. I've worked in other places, it's true. You train a model to solve a bespoke problem. Each of these systems had to maintain all these different parts of the pipeline. For example, for email notification, you needed a label preparation step, a featurization step, model training, and then you serve the model somewhere online, and then model hosting and inference has to happen. Each of these is also incurring its own cost. Similarly, for title-title recommendation or related items, the same steps have to be repeated with different labels, with different features, and so on. Similarly, for search, for category exploration, you name it. Both the offline part as well as the online part used to be done independently. Each of them requires some engineers and scientists to maintain it. Offline pipeline, online pipeline, there are failures, and so on. With UniCoRn, we were able to replace those series of columns with this ML system, where we just now have difference in the label data generation preparation because it's still connected to the product. For notification, we have a label preparation, for title-title similarity, we have a label preparation, and so on. For search, for category exploration, for pure personalization, so that's the only difference in the pipeline, and then everything becomes common. There is unified label preparation, unified feature generation, multitask model training, and then there is one way to host the model in different systems. Then as the user comes to the product, so a client makes a call to the service, service makes a call to the ranker, and then we get the results. It does definitely simplify both the offline pipeline as well as the online pipeline. However, for online infrastructure, there are some additional considerations. For example, different parts of the product, different online systems have different SLA considerations. For mitigating that, what we do is we do specify or host the model separately for some products if there is a separate SLA, and we also have different knobs to optimize things like caching, whether in which canvas should be cached, for some canvases, or for some part of the product, we cannot cache. Similarly, for some canvases, maybe latency is extremely sensitive, so throughput and SLA is very important, for some canvases, it's not. We really try to give those levers for online inference to continue serving the product as it needs, whereas under the hood, the model has now fully been unified. These are some specific choices about inference that we made, deploy the model in different system environment per use case. Provides knob to tune the characteristics of the model inference, including model latency, data freshness, caching, and exposing a generic use case agnostic API for consuming systems. What it enabled us is now any product partner can come in and say, we have a search use case or a recommendation use case, they have an endpoint of a model to take from, and they can just run from it, and there are different knobs of whether to enable caching, what kind of context to provide this model. It's much more self-serving in terms of the product partners to be able to use ML. This really increases innovation velocity, as you can imagine. To enable this flexibility, the API we have also enabled heterogeneous context input, so a pure personalization use case would just have user and country maybe. A pure title-title recommendation might have user, or not even user, just source entity ID, and so on. We also have enabled a way to have separate candidate set for each of these tasks. Foundation Model (FM) That was UniCoRn, the model that unifies search and recommendation. Now let me go over some specific aspects of foundation model, which is a user-specific foundation model that we built in Netflix, and then hopefully I can bring it together why I'm talking about two different models in the same presentation. What is a foundation model or a user foundation model? Inspired by the effectiveness of large language models, example GPT, Llama, within Netflix we built a large model that can holistically learn member preference, both long-term and short-term, and be task-agnostic. The whole idea of these large models is, the model's parameters are so big, and the capability of the model is so large, that we don't really need to tell it about specific tasks. It can understand and learn a lot more tasks than a few specific tasks. UniCoRn was still specific because it was for search and recommendation. Here we build a large model to just generically understand what members are doing on Netflix, and that's applicable for other products too. Why even build a foundation model? Here are some few pointers. One is, this is one model that can learn user long-term preference, short-term preference, as well as long tail entity representation. It also reduces maintenance costs to allow operating with small teams. It makes it cost-efficient because we just need to now train one model. Much fewer models, much more bespoke models can be deprecated. Innovation applied to one part of the product can immediately be applicable to other parts of the product. Now we are even going beyond the specific search and recommendation tasks that UniCoRn was already able to replace. How do we go about building this foundation model? Imagine this user comes to Netflix, maybe in this case he has been on Netflix for four years. We know all their engagement that they have done. First, let's say the user came to Netflix, discovered Stranger Things, it's a horror thriller show. Then the user binge-watched this title, and then discovered another sci-fi title, and so on. This is the user engagement history, and it's an input sequence, so it's an entire history of engagement about the user on the product. Side note, Netflix does not use any other demographic information about the user, only whatever they have engaged on the product. There is a similarity we might have seen between LLM and this foundation model, between large language model and this foundation model in the context of Netflix. The similarity is