21. April 2021
Enabling users to find the products they are looking for is the core task of the search team of otto.de. Besides finding the correct articles, putting the most relevant products at the top of the list is one of the main challenges we are facing in our job. In this and the following posts we will show you how we implemented Learning to Rank to improve our ranking, how we meet ecommerce-specific challenges, measure search quality and managed to realize technical implementation.
Since OTTO is transforming into a marketplace, the number and diversity of products on sale is increasing every day. Larger numbers result in greater complexity, particularly for the search system. This increase in complexity and the sheer volume of data at otto.de (see fig. 1) cannot be handled by traditional approaches. Instead, we are leveraging machine learning (ML) models for a data-driven approach that provides the greatest value for our customers. For our ranking, we are implementing a Learning-to-Rank model (LTR). LTR helps us increase recall (given the rise in product numbers we will return an ever larger number of products per query with varying relevance to the customer) without the risk of losing precision on visible ranks.
Ranking models learn abstractions of relevance dependencies for products given a certain query. Known LTR models utilize pointwise, pairwise and listwise approaches to optimize ranking (for further details see: https://medium.com/swlh/pointwise-pairwise-and-listwise-learning-to-rank-baf0ad76203e). We are using LambdaMART which can be considered to combine a pairwise and a listwise approach. LambdaMART calculates gradients based on a pairwise comparison of elements in the list followed by a weighting based on NDCG (a common ranking metric), combining the advantages of both approaches. As we mentioned earlier, these models have huge potential in ecommerce. There are also various sector-specific challenges to consider though, some of which we will address in the next section.
A challenge we are facing due to our ecommerce setting is that we cannot generate relevance labels via crowd sourcing or similar approaches. Estimating the relevance of a product in a crowd sourcing setup differs a lot from having the intention to buy something and estimating its relevance from that perspective. We therefore need to model evaluations automatically based on customer data. How this is done exactly depends on the KPI we want to do some optimizing for.
We started out intending to optimize our model by maximizing clicks only. However, we soon discovered that other KPIs are equally if not more relevant. The availability of products, for instance, plays a big part in a customer’s buying decision just as other significant factors do. Leveraging this knowledge, many different definitions of the target metric are conceivable
The model integration pipeline is shown in figure 3. We use a sampled query set to build the training data. For the queries in this set we calculate judgements (our ranking gold standard) and features per query-product pair. The features are generated in the Solr feature store. Combining judgements and features we assemble the train data used to set up our ranking model. The LambdaMART model training is done with a fork of RankLib called RankyMcRankface (see https://github.com/o19s/RankyMcRankFace). The model is then uploaded into the Solr model store and applied in the Solr for reranking. We can validate model performance with a test query set and our Offline Metrics Analyzer . This tool allows us to calculate NDCGs on different configurations of our search system. Thus, we are able to compare an LTR ranking with the status quo or analyze different model performances.
If you have experience in creating relevance judgements or if you have a different opinion about creating labeled data we would love to discuss this topic with you. So please contact us at JARVIS@otto.de. In the following posts we will discuss which data we used to implement LTR, how we measure search quality in our daily work and how we designed the pipeline for data collection and model training.