Introducing the StreamNative AI Hub — Agent Engine, MCP Server & more.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

June 27, 2025

6 min read

Reliability That Thinks Ahead: How Pulsar Helps Agents Stay Resilient

Matteo Merli

CTO, StreamNative & Co-Creator and PMC Chair Apache Pulsar

Explore the Series — Building AI Agents with Apache Pulsar:

This article is part of a three-part deep dive into how messaging architectures—especially the Pulsar protocol—can meet the evolving infrastructure demands of AI agents. From speed and flexibility to built-in resilience, this series unpacks the core messaging principles that power more capable and reliable AI systems.

--------------------------------------------------------------------------------------------------------------------

‍

Real-world AI agents operate in unpredictable environments. An agent might encounter transient errors – a large language model (LLM) call that times out, a database that’s briefly down, or a sensor message that fails validation. How your messaging system handles these hiccups is critical. In this post, we focus on message acknowledgments, retries, and dead-letter queues – features that keep your agentic pipelines resilient. We’ll contrast Apache Pulsar’s reliability features with Apache Kafka’s more basic offset model, to show how Pulsar can help your agents recover gracefully from failures.

(If you missed it, in Post 1: Streams vs Queues: Why Your Agents Need Both—and Why Pulsar Protocol Delivers, we covered how Pulsar supports both streaming and queueing patterns natively. Now we build on that foundation by examining what happens after a message is sent – does it get processed successfully? And what if it doesn’t?)

The Challenge of Failure in Message Processing

Consider an AI workflow where each message triggers a sequence of actions. For example, a message might instruct an agent to invoke an external API or run an ML inference. If one of those actions fails for a particular message, we’d like to retry it or handle it specially, without losing the message or blocking the entire pipeline. We also want to avoid duplicating messages or processing them out of order unnecessarily. Traditional message queues (like JMS brokers) have long provided per-message acknowledgment and dead-letter queues (DLQs) to address this – ensuring no message is lost and problematic ones can be set aside. Let’s see how Pulsar and Kafka differ here:

Kafka’s offset model: In Kafka, a consumer’s progress is tracked by committing offsets. The consumer periodically records “I have processed up to message X in partition Y.” However, Kafka does not acknowledge individual messages. A commit always implies all prior messages in that partition are handled. This is often called a high-watermark commit model. The implication: if your consumer fails on message 100, it cannot tell Kafka “only message 100 failed” – it either doesn’t commit offset 100 (meaning it will reprocess it and any following messages on restart), or it skips it by committing offset 101 (thereby implicitly acknowledging 100 as well, even though it failed). There’s no built-in concept of NACK (negative acknowledgment) to say “retry this one and don’t advance the offset.” This all-or-nothing batch acknowledgment makes fine-grained error handling tricky. Developers end up implementing workarounds: one pattern is to process messages one-by-one per partition and commit immediately after each, to know exactly which message caused an issue. If a message fails, the consumer can stop without committing that offset, effectively pausing that partition. But this means other messages in that partition (even those already fetched) won’t be processed until the consumer is restarted and the message is either skipped or handled. Alternatively, teams implement custom logic to store offsets externally (e.g. in a database) so they can mark individual messages as processed or failed – but this is complex and outside Kafka’s native support.
Kafka and retries/DLQ: Since Kafka doesn’t track individual message acknowledgment, it also doesn’t automatically redirect failed messages to a DLQ. Handling a poison message (one that consistently fails processing) is entirely up to the application. A common approach is: if processing fails, produce that message to a special “error topic” (the DLQ) for later analysis, and then commit the offset to skip it in the main topic. This approach works, but you have to code it manually and ensure atomicity (you don’t want to lose the message between failing and writing to the DLQ). There are Kafka libraries/patterns to help with this, but again, nothing built-in prior to newer Kafka streams APIs or Kafka Connect error handling (and those are limited to those frameworks). Simply put, Kafka’s design assumes consumers manage their own retries. If a consumer dies, Kafka will allow another consumer in the group to take over the partition, but that new consumer will by default re-read from the last committed offset – meaning it may replay some messages (including the one that caused the crash if it wasn’t committed). This provides at-least-once delivery, but the burden is on you to handle duplicates and failures.

Now let’s see how Pulsar handles the same scenarios:

Pulsar’s per-message ack & NACK: Pulsar consumers explicitly acknowledge each message (or a batch of messages) to the broker when processed. This acknowledgement is tracked per message, not just by offset. If a consumer fails to process a message, it can send a negative acknowledgment (NACK) for that single message. A NACK tells the Pulsar broker “I couldn’t process message X, please redeliver it later.” Crucially, this does not block the acknowledgement of other messages. For example, if messages 100 and 101 were fetched by a Pulsar consumer and 100 fails while 101 succeeds, the consumer can ack 101 and NACK 100. Message 100 will be redelivered (to the same consumer or another, depending on subscription mode) after a configurable delay, while message 101 is not reprocessed since it was acked. This fine-grained control means a slow or problematic message need not stall the pipeline – other messages keep flowing. Pulsar also has an acknowledgment timeout feature: if a consumer forgets to ack a message within a configured time (say the process died mid-task), the broker will automatically consider it failed and redeliver it. This answers the question “how do we detect if a processing instance died?” – the broker’s ack timeout handles it by ensuring unacked messages don’t disappear.
Retries and Dead-Letter Topics: Pulsar supports automatic retry and dead-lettering policies at the consumer level. You can configure a subscription such that if a message fails to be processed a certain number of times (i.e., it’s NACKed or times out repeatedly), Pulsar will route it to a Dead Letter Topic (DLQ) associated with your subscription. This is analogous to the “dead-letter queue” concept in traditional message queuing systems. The message is then out of the main flow, so your consumer group isn’t stuck on it, but it’s safely stored for inspection or special handling later. Pulsar’s DLQ feature is built-in and easy to enable, whereas with Kafka, you would have to create and manage the dead-letter topic manually. Additionally, Pulsar can use a retry letter topic alongside the DLQ. The idea is that Pulsar will requeue the message to a retry topic for a certain number of attempts (optionally with some delay between attempts), and only if it still fails after max retries will it go to the DLQ. The original consumer can be set up to automatically consume from the retry topic after a delay, implementing a backoff strategy – all configured declaratively. This kind of baked-in retry mechanism “thinks ahead” for you, simplifying what would otherwise be custom retry loop code.
No-block processing: Because of individual acking, Pulsar consumers don’t have to process strictly in sequence if they don’t want to. For example, with a Shared subscription (our queue scenario), one slow message doesn’t prevent other consumers from processing subsequent messages from the topic. Even a single consumer can use multiple threads to process messages in parallel (fetching a batch and acking each as done). In Kafka, parallelizing within a partition is dangerous because you can’t ack messages out of order – Pulsar doesn’t have that limitation. As an illustration, if our agent receives 100 tasks in a Pulsar queue, it could farm them out to multiple worker threads and acknowledge each as they finish. A Kafka consumer would either have to increase partitions (one thread per partition, effectively) or process sequentially within one partition to avoid offset issues. Pulsar’s design thus yields better utilization and throughput especially for heterogeneous workloads where some messages take longer than others.

Recovering from Agent Failures: Example

Let’s say we have an AI agent that monitors news articles and, for each article event, the agent must call an LLM to summarize it and then index the summary in a database. Suppose one particular article causes the LLM to hang or produce an error (maybe it’s too long or has problematic content). Here’s how Kafka vs Pulsar would handle it:

In Kafka: The agent consumes an event from the “articles” topic. If using auto-commit, it may have already marked prior messages as consumed and is now stuck on this bad one. If using manual commit, it withholds the commit. Either way, that partition’s processing is halted until this is resolved. You have a few choices:
1. Crash or stop the consumer, log the error, and restart later from the last commit (which will re-read the bad message, and likely fail again unless code changed or external state changed).
2. Skip the message: catch the exception, produce the event to an “article_errors” topic for later, then commit the offset past it so the main consumer can continue. But you must implement that production + commit carefully to not lose data. Also, you’ve now introduced a secondary flow (the error topic) which you need to monitor.
3. Move the logic that might fail (LLM call) out-of-band: for example, quickly commit the message, and process the LLM call asynchronously so the consumer isn’t holding up Kafka. But if that async fails, you’d still need to send that info to a separate channel because Kafka already marked it done.
  **None of these is impossible, but they all put the responsibility on the developer to implement reliability.

In Pulsar: The agent’s consumer receives the article event. If the LLM call fails, the consumer can simply call "consumer.negativeAcknowledge(message)" (in code) for that message. Pulsar will record that as a NACK. The consumer could even continue to process further messages in the meantime (depending on config). Pulsar will redeliver that message after a default delay (say 1 minute), giving the system time to recover or handle temporary issues. If the message keeps failing every time (e.g., the article is too large for the LLM consistently), after, say, 3 attempts, it will be routed to the dead-letter topic automatically. Your main consumer will never be stuck on it – it can move on to other messages. Meanwhile, your team can have a separate process or a monitoring dashboard consuming from the dead-letter topic “articles-DLQ” to inspect what went wrong with those problematic events. Perhaps the team finds that those DLQ’d articles were in an unsupported format and can take action, but importantly, the agent system as a whole kept chugging along despite the hiccup. No manual offset fiddling or urgent intervention was needed in the moment – Pulsar’s reliability features did the heavy lifting.

Another aspect of resilience is how the system behaves when scaling consumers up or down. Kafka users are familiar with the rebalance process: if you add a new consumer to a group or one dies, Kafka will pause consumption briefly to redistribute partition ownership. During this rebalance, no messages are delivered to consumers of that group. In large Kafka deployments with many partitions, rebalances can take quite some time, meaning a scaling event or a single consumer failure causes a delay in processing for the whole group. Pulsar’s shared subscription has a smoother ride here – since any consumer can grab any message, adding or removing consumers doesn’t require an explicit rebalance pause. If a consumer goes away, its unacked messages simply become available for others immediately; if a new consumer joins, it starts receiving a share of messages without broker-side re-partitioning. There’s effectively no downtime when scaling Pulsar consumers in a shared subscription. This “graceful scaling” further boosts resilience for agent systems, which might need to dynamically adjust to load.

Key Takeaways

Per-message acking: Pulsar allows acknowledging individual messages, whereas Kafka can only acknowledge by advancing the offset watermark. This means Pulsar consumers can succeed or fail messages independently, preventing one bad message from holding up others.
Built-in retry and DLQ: Pulsar has native support for retrying messages and sending them to dead-letter topics after a max retry count. Kafka lacks built-in DLQ; implementing it requires custom logic and managing separate error topics. Pulsar’s approach simplifies error handling and improves reliability in complex pipelines.
Negative acknowledgments: Pulsar’s NACK feature lets consumers explicitly signal a failure, triggering message redelivery. Kafka consumers have no native NACK – they must either not commit (causing a rebalance or stall) or manually requeue the message elsewhere. Pulsar’s NACK + ackTimeout together ensure that crashed or slow consumers don’t result in lost or stuck messages.
Resilience in scaling: Pulsar’s no-stop consumer scaling (no rebalance needed for shared subscriptions) means the system adapts to consumer failures or additions without a processing halt. Kafka consumer group rebalances, in contrast, temporarily stop message processing during partition reassignments.

All these features add up to a messaging foundation that “thinks ahead” about reliability. For AI agents, which may run 24/7 and deal with unpredictable inputs, having the messaging layer automatically handle retries and failures is a game-changer. Your agents can stay focused on what to do with data, while Pulsar ensures the delivery of that data is rock-solid even when things go wrong.

Try out Pulsar!

This is some text inside of a div block.

Button Text

Matteo Merli

Matteo is the CTO at StreamNative, where he brings rich experience in distributed pub-sub messaging platforms. Matteo was one of the co-creators of Apache Pulsar during his time at Yahoo!. Matteo worked to create a global, distributed messaging system for Yahoo!, which would later become Apache Pulsar. Matteo is the PMC Chair of Apache Pulsar, where he helps to guide the community and ensure the success of the Pulsar project. He is also a PMC member for Apache BookKeeper. Matteo lives in Menlo Park, California.

Show all

Blog

Aug 7, 2025

6 min read