Oct 12, 2022
7 min read

Improving Regular Expression-Based Subscriptions in Pulsar Consumers

Andras Beni
Apache Pulsar
No items found.

A closer look at regular expression-based subscriptions

Currently, you have two options when creating a consumer with a Pulsar client:

  1. Specify the list of topics you want to consume. Often this list consists of only one topic name, but it does not need to.
  2. Specify a regular expression as a topic pattern. Initially, this is equivalent to listing all topics that match the pattern. But as time goes by, new topics are created, while some may be deleted. The regular expression-based consumer has an auto-discovery mechanism. Under the hood, consumers regularly ask a broker for the current list of topics. Whenever the consumer finds a change in the set of topic names that match the pattern, it subscribes to the new topics. You can set the period at which the consumer will check for updates and refresh its list of topics at creation time. By default, this happens once every minute.

The following image shows a typical interaction between the consumer and the broker:

Figure 1. Topic discovery

Inefficiencies in the current implementation

As seen in the previous section, the consumer asks for the complete list of topics. Although a pattern cannot cover more than one namespace, the list of topics in the namespace can grow to a few thousand in size. These names travel over the network every minute, even when there has been no change in the information which the consumer is interested in.

You could set a larger auto-discovery period to reduce the network load. However, this would mean new topics are discovered slower, and the first messages produced to the topic would be processed with a minute or more latency.

Optimizing regular expression-based subscriptions

To address the above issues, we proposed a few modifications to the consumer and the broker:

  • Pattern matching moved to the broker.
  • Broker’s ability to respond with “Nothing changed”.
  • Notifications from the broker to the client for real-time discovery.

Let’s take a look at these changes in more detail.

Broker-side pattern matching

Consumers can add the pattern to their requests when asking for the topic list. If the broker supports this feature, it can filter the list of topics by the pattern and only return those that match. This filtering method dramatically reduces the response size in most cases.

Figure 2. The broker sends a filtered response.
Figure 2. The broker sends a filtered response.

Skipping updates when nothing has changed

Often there’s no change in the response from the broker between subsequent requests. This is either because no new topic is created at all or because the new topic(s) don’t match the pattern. In such cases, there’s no value in responding with the list of topic names; the broker should indicate that nothing has changed, so that the consumer can continue without updates to its list of topics. One way to enable such a response is for brokers to track the last response to a specific consumer. However, this would put an unnecessary burden on brokers, so we decided to let consumers keep track of this state. Specifically, brokers calculate a hash from the topic list and include it in the response. The next time the consumer requests the list of topics, it adds the hash to the request. If the broker finds that the current hash is the same as the one in the request, it sends a response with a flag instead of topic names to show that the state the consumer knows about is still current.

Figure 3. The broker indicates that no change happened.
Figure 3. The broker indicates that no change happened.

Notifications for faster discovery

The above features solved the issue of unnecessary network traffic but did not help discover topics earlier and avoid lags for the first messages. For that, we introduced topic list watchers.

As shown in Figure 4, consumers register as watchers with brokers. The initial exchange resembles what we’ve discussed in the sections above. The difference is that the broker keeps track of watchers. Brokers get notifications from the metadata store whenever a new topic is created (possibly through another broker) and immediately send a message to the consumers that registered with a pattern that the new topic’s name matches. This way, the consumers can know about newly created topics within seconds.

Figure 4. The life cycle of topic list watchers

Polling and notifications in parallel

The procedure for topic list updates described above involves multiple services and steps. For example, the metadata store needs to notify every broker, and those brokers need to process the notification and then update every consumer that is interested in the newly created topic. During the process, it is possible that a broker experiences an issue right after the topic is created. In such events, it will be unable to send notifications to consumers even though the topic was successfully created. Put simply, creating a topic and sending notifications is not an atomic operation. As a result, consumers can’t rely on notifications exclusively; they still must use the polling mechanism to make up for updates they did not get. Conveniently, missing notifications do not cause errors or inconsistencies in the consumer; they merely delay message processing in new topics until the consumer polls again for matching topics.

Conclusion

The enhancements described above will be available in Apache Pulsar release 2.11.0. They will address issues that Pulsar users at scale face in connection with regular expression-based subscriptions. First, applying the topic pattern on the broker side and omitting updates altogether in most cases will reduce network utilization significantly. Second, watchers provide an efficient way of discovering topics right after creation, thus eliminating the lag in processing the first messages produced to those topics.

More on Apache Pulsar

Pulsar has become one of the most active Apache projects over the past few years, with a vibrant community driving innovation and improvements to the project. Check out the following resources to learn more about Pulsar.

Andras Beni
Andras Beni is a Platform Engineer at StreamNative. He is focusing on Apache Pulsar and has been working on data platforms based open source software in his latest positions.

Related articles

Apr 11, 2024
5 min read

The New CAP Theorem for Data Streaming: Understanding the Trade-offs Between Cost, Availability, and Performance

Mar 31, 2024
5 min read

Data Streaming Trends from Kafka Summit London 2024

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
No items found.