13. Januar 2014
Today, most online shop systems offer product recommendations. In addition to the page navigation and the onsite search, recommendations are another good possibility to lead customers to suitable products. If recommendations are based on click streams and purchases of customers, usually an external standard software is used for generating these recommendations. Otto.de uses the prudsys Realtime Decisioning Engine (prudsys RDE).
This scenario is particularly used if for example an initial positioning and configuration of recommendations cannot be found in the shop. In this case, the prudsys RDE only computes the products to be recommended while the concrete presentation of recommended products has to be done by the shop software itself. Hence, this simple architecture can compute a comparatively large number of recommendations.
If product recommendations have finally been accepted by the customers, the necessity for a deeper technical integration into the shop rises: An uninterruptible delivery of recommendations becomes mandatory.
For this reason, the prudsys RDE by default ships with a load balancer allowing more than one system in parallel.
.This architecture allows delivery of product recommendations without any interruptions. Furthermore, load is now distributed over many recommendation servers. However, this combination of components has some disadvantages: Each recommendation server learns recommendation rules itself. While the load balancer component assigns a certain recommendation server for each customer, specific rule sets are generated ('learned') on each server that have to be merged from time to time.
On the other side, the load balancer cannot distribute requests of one customer to more than one server, because in that case recommendations based on the click stream history of a customer have to be computed using many servers. This is not feasible in the given system architecture
To meet Otto.de's functional and non-functional goals, the internal architecture of the prudsys RDE was improved. The concrete activities were:
Therefore, the following system architecture was developed in collaboration with prudsys.
What's eye-catching is that the recommendation system is now separated in to two types of nodes ('node' meaning one server in the recommendation system). The learning node performs all operations necessary to maintain the rule base. This node also informs all other nodes about changes in the rule base. The non-learning nodes in that case only have to deliver product recommendations but don't have to update their rule base. That works because the load balancer sends all customer requests to the learning node as well as to one non-learning node. The learning node immediately responds with http return code 200 and then starts computing the received request. The non-learning node in contrast performs a lookup into its rule base to find appropriate recommendations and sends them back to the load balancer which is now able to respond to the customer's (which technically means Otto.de's) request.
The fact that the learning node can now calculate a larger number of requests is based on the utilisation of a statistical effect: If a certain number of requests are exceeded, responses are always the same. For this reason the learning node has a queue for requests. If this queue reaches a certain level, the learning node starts to skip requests. Due to the large amount of previously processed data the computed results are the same.
Of course there is a number of parameters, e.g. hardware size, which determine when the queue should be used. In practice we have not needed the queue yet. For a state-of-the-art quad-core system our current traffic of about 100 requests per second is not a challenge. However, more important for us is that this system architecture has enough capacity for future peaks concerning load as well as the utilisation of sophisticated filters for selection and computation of suitable recommendations.
Another new function helps if a node crashes: Every node is now able to synchronize itself with the rules of another node. By default, the non-learning nodes sychronize themselves with the learning one. Furthermore, this ability allows us to easily set up test environments with the same data as the production system. Finally, to reduce the performance impact of a crashed node for the complete system, every non-learning node has a slave system which takes over if its master crashes.
Compared to former versions of the prudsys RDE, the load balancer has also been improved. It can now import updated product data in a full import mode as well as in an incremental update mode and distribute it to all system nodes. Additionally, the load balancer is now able to import new recommendation rules, e.g. for the computation of similar products.
The close collaboration with prudsys has led to a more powerful recommendation system and requirements for future capabilities were implemented. The system's performance was improved and new functions added such as the continuos provisioning with product data or a cluster resync function for single systems after a restart.
As a result, OTTO now runs a high-performance recommendation system with fast request processing, high availability and a consistent base of recommendation data.
[…] Ein gewaltiger Aufwand. Denn die Herausforderung dabei ist es nicht nur, mittels verschiedener Templates die optimale Positionierung einzelner Artikel im Shop zu finden, sondern gleichzeitig auch das Potenzial der Echtzeit-Personalisierung im Rahmen einer ganzheitlichen Strategie über alle Kundenkommunikations-Kanäle zu nutzen und zugleich seine markenspezifischen Besonderheiten zu erhalten. So hat beispielsweise der Anbieter Prudsys seine Empfehlungsmaschine weiterentwickelt, um die Architektur-spezifischen Besonderheiten von Otto.de zu berücksichtigen. Damit alle Kunden, salopp gesagt, überall und jederzeit nach den gleichen Regeln ein konsistenten Set an Empfehlungen bekommen – das dann natürlich noch jederzeit und schleunigst um neue Produkte oder neue Regeln ergänzt werden kann. Den Aufwand dahinter schildert Otto in einem Beitrag im Developer-Forum. […]