With a user base of over 100+ million users, and a database of 200+ million resumes, Zhaopin is one of the largest online recruiting and career platforms in China. As a bilingual job board, it provides one of the largest selections of real-time job vacancies in China.
A key part of the software supporting Zhaopin is our event center system. This system, which is responsible for all of the intra-service messaging within Zhaopin, supports mission-critical services such as resume submission and job search. The event center system needs to handle more than 1 billion messages per day under normal conditions, scaling to multiple billion messages per day during the peak recruiting season.
Our previous technology to support the pub-sub messaging and message queuing needs for our event center consisted of two separate systems. Our work queue use cases were implemented with RabbitMQ, while our pub-sub messaging and streaming for use cases used Apache Kafka.
We use work queues throughout our microservices architecture as an intermediate layer to decouple front-end services and backend systems, allowing our systems to run reliably even in the event of traffic spikes. At Zhaopin, our typical work-queue use cases have the following characteristics:
A tracing mechanism is required for tracing the lifecycle of messages for mission-critical services. For instance, when a user submits a résumé, the service handling resume submission will first enqueue a message into the messaging system. All the other services (such as database updates, sending notifications, and recommendations) will then consume the messages from the messaging system and process them.
Although we did have technology in place to support our event center system, we saw that we faced a number of growing challenges and limitations with that technology as we grew.
One of our primary problems with our previous approach was the hardware cost and administrative burden of deploying and managing multiple messaging technologies side by side just for the sake of serving different use cases. Several additional problems arose from having to publish the same event data into dual data pipelines as well. First, the same data needed to be stored in two separate systems, which doubled our storage requirements. Second, it was very difficult to keep the copies consistent across the two different systems. Lastly, having two different technologies introduced additional complexity because our developers and DevOps team had to be familiar with both messaging platforms.
In addition to the operational overhead of running two messaging systems, there were architectural shortcomings within each of our previous technologies, making the decision to switch even more compelling. While RabbitMQ supports work queue use-cases very well, it does not integrate well with popular computing frameworks. It also has several limitations such as scalability, throughput, persistence and message replay. We attempted to overcome by implementing our own distributed message queue service based on Apache Zookeeper and RabbitMQ. However, we found this decision also came with a large, ongoing maintenance burden that we felt was unsustainable.
During our peak usage periods, we were unable to meet our service level SLA without increasing the number of consumers on some of our Kafka topics. However, we found that the ability to parallelize consumption in Kafka is tightly coupled with the number partitions on a given topic. Therefore, in order to increase the number of consumers on a particular topic required us to increase the number of partitions, which was not an acceptable approach for us for these reasons: The newly created partitions did not contain any data, resulting in no real increase in consumption throughput. In order to remedy this, we had to also force partition reassignment on the topic to distribute the backlogged data onto these new partitions.
Both of these steps were not easy to automate, forcing us to monitor the topics for backlog, and manually correct the issues. Although we had deployed Kafka for our pub-sub messaging use cases, Kafka does not provide the data durability guarantees that were increasingly a must-have requirement for our mission-critical business services. For these reasons, in early 2018 we decided to simplify our entire messaging technology stack by selecting a technology that would support both our work queue and streaming use cases, thereby eliminating all of the cost and complexity of hosting separate systems.
With this in mind, we compiled the following list of capabilities that we required from our messaging platform:
Unable to find a better solution among the technologies we knew about, our initial plan was to build our own platform on top of the strongly-consistent distributed log storage platform Apache BookKeeper, which offers an excellent log storage API and has been deployed at internet scale by Twitter, Yahoo, and Salesforce. However, after getting touch with the BookKeeper community, they pointed us to Apache Pulsar — the next generation pub-sub messaging system built on BookKeeper.
After working closely with the Pulsar community and diving deeper into Pulsar, we decided to adopt Pulsar as the basis for Zhaopin’s event center system for following reasons:
In addition to these key architectural features, Apache Pulsar also offers many different enterprise-grade features that are critical for supporting our business critical applications, such as multi-tenancy, geo-replication, built-in schema support and tiered storage. Recently added features such as serverless compute with Pulsar Functions and Pulsar SQL are essential for building event-driven microservices here at Zhaopin.com
We are very happy with our choice of Pulsar and the performance and reliability it provides, and are committed to contributing many great features back to the Apache Pulsar community, such as a Dead Letter Topic, Client Interceptors, and Delayed Messages just to name a few.
If you are running multiple messaging platforms just for the sake of serving different use cases, you should consider replacing them with Apache Pulsar to consolidate your messaging infrastructure into a single system capable of supporting both queuing and pub-/sub messaging.
We are very pleased to be hosting the next Apache Pulsar meetup on 12/15 at our offices in Shanghai. Please stop by to hear from our engineers at @Zhaopin_com on their experiences and best practices for running Pulsar in production and using Pulsar + @ApacheFlink to power their recommendation system. More info: buff.ly/2BFYrBy
We would like to especially thank all the committers to the Apache Pulsar project, as well as the technical support we received from members of its large and growing community.