GeTui is one of the largest third-party push notification service providers in China. It helps mobile application developers set up and send notifications to users across iOS, Android, and other platforms, by leveraging data-driven analysis on user profiles.
Since 2010, GeTui has successfully supported over hundreds of thousands of applications and billions of users, including DiDi, JD.com, Weibo, NetEase, People's Daily, Xinhua News Agency, CCTV, and so on.
As a notification service provider, the message queuing system plays an extremely significant role within GeTui.
Figure 1 illustrates the overview of GeTui push notification service. When a GeTui customer needs to send push notifications to its end users, it first sends messages to GeTui's push notification service. The push notifications are queued in the service based on their priorities.
However, resource contention increases when the number of push notifications waiting in message queues increases. It drives demands for a priority-based push notification design because we need to allocate more resources to customers with high priorities.
Our first priority-based push notification solution was implemented by using Apache Kafka.
Kafka is a high-performance distributed streaming platform developed by LinkedIn, which is also widely used within GeTui, from log aggregation to online and offline message distribution, and many other use cases.
In this solution, we set the priority of messages into three levels: high, normal, and low. Messages of each priority are stored in a group of topics. The push notification tasks are sent to different topics based on their priorities. Downstream consumers receive messages based on their priorities. The push notification tasks with the same priority are polled in a round-robin way. It guarantees push notifications with higher priorities can be sent as early as possible, and push notifications with low priority can be eventually sent as well.
When the business grows and the number of applications using our service increases, the Kafka solution ran into problems as below:
To solve the problems stated previously, we need to evaluate another messaging system that supports a large number of topics while maintaining as high throughput as Kafka. After doing some investigations, Apache Pulsar catches our attention.
Apache Pulsar is a next-generation distributed messaging system developed at Yahoo, it was developed from the ground up to address several shortcomings of existing open-source messaging systems and has been running in Yahoo's production for three years, powering critical applications like Mail, Finance, Sports, Flickr, the Gemini Ads Platform, and Sherpa (Yahoo's distributed key-value store). Besides, Pulsar was open-sourced in 2016 and graduated from the Apache incubator as an Apache top-level project (TLP) in September 2018.
After working closely with the Pulsar community and diving deeper into Pulsar, we decided to adopt Pulsar for the new priority-based push notification solution for the following reasons:
After extensive discussions, we settled down a new solution using Apache Pulsar.
The Pulsar solution is close to the Kafka solution, but it solves the problems we encountered in Kafka by leveraging Pulsar's advantages.
Pulsar has been successfully running on production for months serving the new priority-based push notification system. During the whole process of adopting and running Pulsar on production, we have collected some best practices on how to make Pulsar work smoothly and efficiently on our production.
subscriptionNameto subscribe. Monitor your backlog when adding new subscriptions. Pulsar uses a subscription-based retention mechanism. If you have an unused subscription, please remove it; otherwise, your backlog will keep growing.
dbStorage_rocksDB_blockCacheSizeto prevent slow-down in reading large volume of backlog.
stats-internalto retrieve topic statistics when troubleshooting a problem in your production cluster.
backlogQuotaDefaultLimitGBin Pulsar is 10 GB. If you are using Pulsar to store messages for multiple days, it is recommended to increase the amount or set a large quota for your namespaces. Choose a proper
backlogQuotaDefaultRetentionPolicyfor your use case because the default policy is
producer_request_hold, which rejects produce requests when you exhaust the quota.
We have successfully run the new Pulsar based solution on production for some use cases for a few months. Pulsar has shown great stability. We keep watching the news, updates, and activities in the Pulsar community and leverage the new features for our use cases.
Graduated from the ASF incubator as a top-level project in 2018, Pulsar has plenty of attractive features and advantages over competitors, such as geo-replication, multi-tenancy, seamless cluster expansion, read-write separation, and so on.
The Pulsar community is still young, but there is already a fast-growing tendency of adopting Pulsar for replacing many legacy messaging systems.
During the process of adopting and running Pulsar, we run into a few problems, and a huge thank you goes to Jia Zhai and Sijie Guo from StreamNative for providing quality support.