How Pulsar 3.0 will help teams run Pulsar faster and more reliably at scale
As one of the more prolific open-source contributors to Apache Pulsar, StreamNative was excited to lead the recent efforts to release Pulsar 3.0. And it is now available for StreamNative Cloud customers to test out on new clusters or upgrade existing clusters.
What We’ve Learned from Operating Pulsar at Scale
Many of the improvements in Pulsar 3.0 allow teams to run Pulsar faster and more reliably at scale. A large part of StreamNative’s contribution came from the input and feedback we’ve learned from our customers - many of which are teams pushing Pulsar to new limits and use cases - as well as our own experience managing more Pulsar clusters than any organization in the world.
A Big Milestone for the Community…
Pulsar 3.0 signifies the growth evolution of the Apache Pulsar project has made over the last few years.
- The community is getting bigger! Over 140 contributors submitted about 1500 commits to the Pulsar 3.0 release, which is the largest contribution yet for a project that is fast becoming one of the biggest open-source projects.
- It includes support for LTS which delivers the predictability and stability that larger enterprise teams need to deliver a stable and reliable messaging and streaming service.
What's new in Pulsar 3.0
This release improves the performance and stability for teams operating Pulsar at scale as well as making it more stable (and predictable) for powering messaging and data streaming services for mission critical use cases.
Here are a few highlights of whats included in 3.0. You can get the full list in the official announcement.
Introducing LTS for Pulsar:
As the Pulsar community has matured, more companies are adopting Pulsar for mission critical workloads and want to minimize the risk around version upgrades in production.
3.0 is the first version that introduces long term support and the community has committed to releasing long term versions with feature releases between them, allowing teams who want a more stable release to use versions 3.0.x, while those seeking new features can use versions 3.x.
For risk-adverse teams running Pulsar for mission critical workloads, this will allow them to run the last ‘stable’ major release and upgrade less frequently.
For those of us who are contributing to building Pulsar, it enables the balance between moving fast and introducing exciting new features while balancing the need for stable releases for the many mission critical use cases that it supports.
This brings two main benefits to teams with organization wide implementations of Pulsar:
Stability in production:
Previously, a new Pulsar version was available every 3-4 months and teams were forced to upgrade to the next version to take advantage of needed bug fixes and security patches, while risking the introduction of new features and functionality into their production environments.
Now, with the long-term support version, teams can remain on a stable version of Pulsar and choose to update only bug fixes and maintain security patches while allowing them to try and experiment with the new features and functionality in the latest version in a testing environment, since these versions will be supported for a shorter amount of time.
Predictable bug fixes and updates:
Previously new versions were released on a quarterly-ish schedule but releases were frequently delayed which created uncertainty around when new capabilities and improvements would be available, making it hard for teams to plan development.
Going forward, there will be a code freeze on each release 3 weeks prior, allowing teams to have certainty of what will be included in the next update and when it will be released.
StreamNative will continue to release a weekly update of our Pulsar distribution with bug fixes and patches so that StreamNative Cloud customers can benefit from these updates in advance of the Pulsar releases.
NEW Updated Load Balancer
The new load balancer delivers on the promise of Pulsar’s horizontal scalability and enables teams with ‘spikey’ workloads to rest easy knowing that traffic will be quickly distributed as they scale up their brokers to meet the demands of their business.
One of Pulsar’s key differentiators from Kafka is its ability to scale horizontally without downtime or performance impact as workloads increase. This is achieved by Pulsar load balancer that equalizes traffic across all of the brokers in a cluster. It ensures that some brokers do not become overloaded as traffic scales up and that idle brokers take on their fair share of the workload.
However, in situations where traffic was not well distributed - such as higher workloads for certain topics or sudden spikes in traffic - it took anywhere between a few minutes to a few hours for the load balancer to propagate the traffic correctly, leading to some brokers becoming overloaded while others sat idle. This was difficult to avoid or prevent since the cluster appeared to have ample capacity to handle the traffic, but because of the imbalance in traffic distribution, there would be performance degradation in some of the brokers (where the load was higher).
Note that because Pulsar is a stateful system and the data has locality, when traffic is changing rapidly, there can be topics with differing amounts of traffic and a spike for any specific topic.
Enter the New and Improved Load Balancer in Pulsar 3.0!
The new load balancer available in Pulsar 3.0 equalizes the traffic much faster to the ideal state (ie. each broker is serving an equal amount of traffic)
- when traffic spikes and there are idle brokers, or
- when new brokers have been added
This enables teams to manage their excess cluster capacity better and delivers a more uniform and predictable pattern for scaling up. Learn more about it in the Pulsar docs.
How much faster is it? We are currently running performance tests to show how quickly a system can rebalance when adding more brokers and will publish those in a follow-up blog post.
Performance Improvements
Pulsar 3.0 brings a large number of performance related improvements. While some of these improvements were introduced in Pulsar itself, the bulk of the changes come from the new Apache BookKeeper 4.16 release. We have concentrated the efforts in making the handling of a lot of small messages in BookKeeper more efficient. This can happen in situations where the load is spread over a large number of topics, or where for some reason, message batching cannot be applied. At the same time, BookKeeper 4.16 brings a new storage option to use DirectIO for completely bypassing the OS page cache mechanism, relying instead on the in-process caches. We have seen great improvements in both CPU usage, latency and overall maximum throughput for Pulsar. We will provide an in-depth analysis of all these changes and their impact in future blog posts.
Optimizations for Scheduled Messages
Teams using Pulsar scheduled messages can track hundreds of millions of delayed (ie. scheduled) messages without having to worry about memory overloads or slow restarting/re-indexing time. This reduces delays as well as the resources needed to store delayed messages.
A key messaging feature in Pulsar is the ability to schedule millions of messages to be delivered or retried at a future time, as this creates delayed messages that are stored in memory the until the time comes to deliver them.
Note that this is an important differentiator for Pulsar compared to other messaging services, as Kafka does not support delayed messages at all and RabbitMQ does not easily support delayed messages at high volumes.
Pulsar 3.0 includes some significant improvements as to how delayed messages are tracked so that they take up significantly less memory, eliminating the possibility of a memory overload, and do not require expensive index rebuilding.
This change allows the indexing of delayed messages to be more scalable and allows it to be broken down into micro-segments. These segment do not need to be stored in memory at the same and and that they do not need to be re-indexed when the broker restarts.
Docker Images for Arm64 Deliver Improved Local Performance
Mac users rejoice! Pulsar is about to become a lot more stable to run on Mac local environments! This small but significant improvement in 3.0 is due to the change that Pulsar will publish Docker images with versions both for Intel x86-64 and Arm64 architectures.
Previously, Pulsar only published intel based images which meant that when run on Arm64 architecture (such as Macs) it could run very slowly or even crash. Now, users can use Pulsar standalone or run TestContainer tests on a Mac M1/M2 laptop with improved performance and avoid the issues with the Docker container engine when it emulates x86-64 CPU within an Arm64 host. At the same time, this image will make it possible to run Pulsar in a Docker/Kubernetes production environment on Arm64 machines.
Ready to Try Pulsar 3.0?
StreamNative Cloud Customers can try Pulsar 3.0 on a new cluster or reach out to support about upgrading existing clusters.
{{cta-blog}}
Newsletter
Our strategies and tactics delivered right to your inbox