Ursa Wins VLDB 2025 Best Industry Paper: The First Lakehouse-Native Streaming Engine for Kafka

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

September 3, 2025

10 min read

Pulsar Newbie Guide for Kafka Engineers (Part 5): Retention, TTL & Compaction

Penghui Li

Director of Streaming, StreamNative & Apache Pulsar PMC Member

Hang Chen

Director of Storage, StreamNative & Apache Pulsar PMC Member

Neng Lu

Director of Platform, StreamNative

‍TL;DR

Pulsar offers flexible message retention policies and features like Time-to-Live (TTL) and Topic Compaction, which differ from Kafka’s approach. By default, Pulsar retains messages until they are acknowledged (no time limit) and deletes them immediately once acknowledged. But you can configure retention to keep acknowledged messages for a duration or size (like Kafka’s log retention), as well as TTL to discard unacknowledged messages after a while (prevent infinite backlog). Pulsar also supports log compaction to keep the latest value per key, similar to Kafka’s compaction but implemented via a separate compacted ledger. We’ll explain these settings and how to use them to manage Pulsar topic storage, using Kafka’s behavior as a reference point.

Message Retention in Pulsar vs Kafka

Kafka’s model: In Kafka, retention is typically time-based or size-based per topic. For example, you might retain logs for 7 days or 10 GB. Kafka does not consider whether a message was consumed – it will delete messages older than the retention period regardless of consumer status. This means Kafka brokers can delete old data even if some slow consumer hasn’t processed it yet (that consumer would then miss those messages).

Pulsar’s default model: Pulsar, being a messaging system with acknowledgments, by default behaves differently:

Pulsar will keep all unacknowledged messages indefinitely (in storage) by default, to ensure consumers can get them whenever they come online.
Once a message is acknowledged by all subscriptions, Pulsar will immediately mark it for deletion (it can be deleted from storage).

In other words, Pulsar’s out-of-the-box behavior is: “retain data as long as someone still needs it; delete it as soon as nobody needs it.” This is more akin to a traditional messaging queue – messages don’t pile up once consumed.

This is basically opposite to Kafka’s strategy of time-based retention. If you hooked up a Pulsar topic with no special retention config and a consumer, and that consumer always stays caught up (acking messages), the topic would use almost no storage (only very recent unacked messages). In Kafka, the topic would accumulate data up to the retention period regardless of consumption.

Configurable Retention: Pulsar allows you to alter this behavior via retention policies. You can set a retention period (time and/or size) for messages even after acknowledgment. For instance, you might say: “Keep messages for 1 day or 1 GB, whichever comes first, even after consumers ack them.” That way, consumers could potentially reconnect within a day and replay data, or you could attach a new subscription within a day to reprocess history.

This is done at the namespace or topic level using pulsar-admin namespaces set-retention. For example:

This would keep acknowledged messages for 24 hours or until 1 GB per topic is reached. After that, older messages are removed (even if not acked? Actually, acked messages only – unacked are still kept as backlog; more on that next).

To clarify: Pulsar retention policy applies to acknowledged messages (the ones that normally would be deleted immediately). Unacknowledged messages are governed by TTL (time-to-live) settings, not the retention policy.

So you have two separate concepts:

Retention (Acknowledged messages): Keep some history of consumed messages.
TTL (Time-to-Live for Unacknowledged messages): After a certain time, treat unacknowledged messages as acknowledged (essentially drop them).

Time-to-Live (TTL) for Unacked Messages

Why TTL? Consider a scenario where a consumer goes offline or is very slow – by default, Pulsar will keep feeding it its backlog forever. If that backlog grows massive, it could consume a lot of storage. In Kafka, if a consumer falls behind beyond retention, it just misses data (or if using a compacted topic, older state vanishes). Pulsar gives an option to say: “If messages haven’t been acknowledged for X time, we assume they won’t be and we discard them.”

This is message TTL (a per-namespace or per-topic setting). For example, set TTL to 7 days and any message not acknowledged more than 7 days of being published will be automatically marked as acknowledged (expired) and won’t be deliverable to consumers. It essentially protects the system from an infinite backlog due to a stuck consumer.

Using pulsar-admin:

(604800 seconds is 7 days). This would mean messages older than 7 days that are still unacked are expired.

From the docs: “If disk space is a concern, you can set a time to live (TTL) that determines how long unacknowledged messages will be retained. The TTL parameter is like a stopwatch attached to each message... when it expires, Pulsar automatically moves the message to the acknowledged state (and thus makes it ready for deletion)”.

That nicely summarizes TTL: after TTL, a message is considered acknowledged (even if the consumer never acked it), so it will be removed like any other acked message.

TTL is somewhat analogous to Kafka’s retention for the tail of the log, but specifically for unconsumed messages. Kafka doesn’t differentiate – it just kills old records. Pulsar, with TTL, gives you a safety net: normally you might not want to lose unconsumed messages, but at some point, you might prefer dropping them than letting them endlessly accumulate.

Backlog Quota: Another related concept is backlog quota. You can set a limit on how large a backlog (unacked messages) can grow (by size or time), and what to do when that limit is reached (e.g., reject producers, or start discarding oldest messages). This is configured separately (set-backlog-quota). For example, you might allow up to 50 GB of backlog; if more, either block producers (to exert backpressure) or throw oldest messages away. Backlog quota policies can complement TTL for robust control.

Kafka-like Retention in Pulsar

If a Kafka engineer wants to emulate Kafka’s log retention (i.e., retain data for X days regardless of consumption), you can do that by:

Setting a retention period for acknowledged messages (so data sticks around even if consumed).
Also potentially setting a TTL for unacknowledged to that same period (so that if a consumer is not there, we don’t keep forever beyond that period).

For example, to mimic “retain messages for 7 days no matter what”:

Set namespace retention to 7 days (acknowledged messages retained 7 days).
Set TTL to 7 days (unacknowledged messages expire after 7 days).

Now Pulsar will behave more like Kafka: any message will exist for at most 7 days, whether or not it’s consumed.

However, be careful: If you have TTL=7d and your consumer is down for 8 days, it will lose messages from that gap (similar to Kafka consumer falling behind retention). If you truly never want to lose unconsumed data, you might leave TTL off (infinite) but then you rely on disk capacity or backlog quotas to handle runaway consumers.

By default, Pulsar doesn’t expire unacked messages (TTL off) and doesn’t retain acked messages (retention 0). So default is “only store what’s needed”. Kafka default is typically something like “store for a week”.

Compaction: Maintaining Latest State Per Key

Kafka’s log compaction feature allows topics to retain only the latest value for each key (removing older values, except the latest and maybe some history). This is useful for state change events or last-known-value semantics. Pulsar offers a similar feature: Topic Compaction.

However, the implementation has a twist. In Pulsar, compaction doesn’t rewrite the existing data in place (since data is stored in BK ledgers). Instead, running compaction produces a new compacted ledger that contains the latest values per key. Consumers can then choose to read from the compacted ledger if they want a compressed view of the topic.

In practice:

You trigger compaction manually via CLI or set it to run periodically. For example:

‍

This will initiate compaction. The broker goes through the topic’s backlog and builds a new ledger with only the latest message for each key.
After compaction, the topic has two sets of data: the full log (uncompacted backlog) and a compacted snapshot. Pulsar retains both. Why? Because some consumers might want to read the full log (e.g., if they’re processing every change), while others might want just the latest state.
A consumer can choose to read from the compacted view by setting readCompacted(true) on the consumer (only allowed for subscriptions with certain types, typically exclusive or failover subs, since shared subs could break the model). When readCompacted is true, the broker will serve from the compacted ledger (for earlier data) and then live data for new writes, essentially giving an experience similar to Kafka’s compacted topic.
Compaction respects retention: if retention has removed some messages entirely, those won’t be in the compacted log either. Also, compaction doesn’t delete the original data immediately; it just provides a compacted copy. The older ledgers remain (and could still be consumed normally or for auditing). You can configure Pulsar to truncate older ledgers once compacted up to a point, but by default, you might manually manage that or rely on retention.

One key difference: Kafka’s compacted topics can still optionally have a retention time to delete old tombstones or limit log size. Pulsar’s compaction essentially ensures at least the latest per key is kept, and if you want old data removed beyond that, you’d use retention or TTL.

Tombstones: Pulsar honors the concept of a null message as a deletion marker (tombstone). If a message with key K and null value is published, compaction will remove K from the compacted log (so it won’t appear at all for consumers reading compacted). This is like Kafka’s tombstone mechanic.

One limitation mentioned: “Pulsar is slightly less flexible in this regard. Messages can only be removed from the compact ledger via explicit deletion by key, otherwise you can expect to store at least the latest message for all keys”. This means Pulsar compacted topics always keep the last value for each key until you explicitly delete by sending a null (Kafka allows you to also set a retention on compacted topics to eventually drop even the last values after a time if needed). Pulsar’s approach is “keep last forever (or until explicit tombstone)”.

Use cases: If you want a topic that holds, say, the latest status of each user, you would use compaction. Produce updates with a key (user ID) and value (status). Compaction will ensure only the most recent status per user is kept in the compacted view. A new consumer can read the compacted log from start and quickly get the latest state of all users without going through all historical changes.

Running compaction doesn’t block the topic – you can run it while publishing is happening. It’s an operation that reads the backlog and writes a new ledger. It might consume resources, so schedule it appropriately (e.g., off-peak).

Putting It All Together

Let’s consider how you might configure a Pulsar namespace for different scenarios, drawing parallels to Kafka:

Ephemeral stream (like Kafka’s default): If you want data to vanish after some time regardless of consumption (like a Kafka topic with 7-day retention and maybe consumers that are expected to keep up or else miss data), you’d set a retention period (time-based) and maybe TTL the same or slightly larger. For example, retention 7 days, TTL 7 days. This way, acked or not, after 7 days data is gone. Consumers that fall behind by more than 7 days lose data. This is a trade-off for bounded storage.
Work queue (at-least-once, but not infinite backlog): Perhaps you have a queue that should not grow unbounded if consumers are down. You might set TTL for unacked messages to, say, 2 days. If consumers are down for >2 days, those tasks expire. But if they come back before that, they get everything. You might not bother retaining acknowledged messages at all in this case.
Durable log (don’t lose anything; like Kafka with infinite retention or very long retention): Keep TTL off (or extremely high) so you never drop unacked messages. Consumers can always come back and get their backlog. Also, maybe set retention for acked messages to some large value if you want the ability to re-read even after ack (like an audit trail). Or use pulsar-admin topics terminate to mark an endpoint and handle archival externally. Keep an eye on storage though – infinite retention needs infinite storage or periodic offloading to cold storage (Pulsar has tiered storage to move old ledger data to, e.g., S3).
Compacted topic for state: Set topic to compacted. Also likely set a retention policy so that even after compaction, you keep data (compaction will keep last keys by design, but what about keys that got tombstoned? They’ll be removed in compacted log, but original ledger entries may still exist until retention kicks in). Usually, you combine compaction with an infinite retention (or very long) but it’s compacted so storage doesn’t blow up with old updates. You may still want to purge tombstoned keys after some time – which retention can do for the underlying data.

How to trigger compaction: Kafka’s compaction runs continuously in the background on brokers. Pulsar’s approach is manual or scheduled. In a production Pulsar cluster, you’d typically run an automatic compaction periodically for the topics that need it (via a scheduler or perhaps using Pulsar Functions or external scripts to call the compact command). There’s also a “threshold” based compaction strategy (for example, compact when backlog reaches a certain size). Check Pulsar docs for auto-compaction configs if needed.

Monitoring and Admin for Retention/TTL

pulsar-admin topics stats <topic> will show retention stats and backlog size. You can see how many messages are stored, backlog size, etc.
If a backlog is consuming too much space, as an admin you might decide to set a TTL or remove a subscription (if a subscription is not needed but still has backlog, dropping it will free those messages).
Pulsar has a concept of inactive subscriptions (subscriptions that have no consumers but still have backlog). If a subscription lingers with backlog and no consumers, those messages will sit forever unless TTL or an admin explicitly expires them. Kafka doesn’t have that scenario because if no consumer reads, data still gets deleted by time. Pulsar’s durability means you should watch for ghost subscriptions. If using Pulsar as a Kafka replacement where you only care about consumer groups that are active, make sure to clean up subscriptions when they are no longer needed (or set a TTL/backlog quota so they don’t live forever).
Tiered storage: If you need long retention but don’t want to burden hot storage, Pulsar can offload older ledger data to cloud storage. That’s beyond our scope here, but know that infinite retention is possible by pushing old data out to cheaper storage, somewhat analogous to Kafka’s tiered storage solutions.

Key Takeaways

By default, Pulsar retains unacknowledged messages forever and immediately deletes acknowledged messages. This ensures no data loss for slow consumers by default, unlike Kafka which will eventually delete old messages regardless of consumer progress.
Pulsar’s retention policy allows you to keep acknowledged messages for a configured time/size. This can make Pulsar topics behave more like Kafka logs, where data is available for reprocessing or late joiners for a window of time after consumption.
TTL (Time-to-Live) deals with the flip side: unacknowledged messages. It sets a limit on how long a message can remain unconsumed before Pulsar drops it. This prevents unbounded growth of backlog if consumers disappear. Kafka’s equivalent (not direct) is just its retention policy which would also delete data not consumed; Pulsar distinguishes between consumed and not consumed.
Log Compaction in Pulsar allows keeping the latest value per key, similar to Kafka’s compacted topics. Pulsar’s compaction generates a separate compacted view that consumers can opt into. Use compaction for stateful topics where you only care about the latest update per key (with tombstones to delete keys).
By combining retention, TTL, and compaction settings, Pulsar gives fine-grained control over data lifespan:
- You can achieve at-least-once delivery with bounded storage (via TTL).
- You can achieve replay of recent history (via retention of acked messages).
- You can maintain a compact state topic for lookup of current values (via compaction).
For a Kafka engineer, remember that Pulsar does not, by default, throw away data after X days blindly – you must configure it to do so if that’s desired. Conversely, you must monitor and manage backlogs or use TTL to avoid a stuck consumer filling up storage, a scenario Kafka would handle by data expiration but Pulsar will handle by pausing producers or requiring admin action if no TTL/quota set. Pulsar provides the tools to do this safely and more flexibly.

Next up, in Part 6, we’ll explore Schema Management in Pulsar, where we’ll see how Pulsar’s built-in schema registry compares to Kafka’s schema registry concept and how to enforce schema evolution rules on topics.

‍

---------------------------------------------------------------------------------------------------

‍

Want to go deeper into real-time data and streaming architectures? Join us at the Data Streaming Summit San Francisco 2025 on September 29–30 at the Grand Hyatt at SFO.

30+ sessions | 4 tracks | Real-world insights from OpenAI, Netflix, LinkedIn, Paypal, Uber, AWS, Google, Motorq, Databricks, Ververica, Confluent & more!

[Explore the Full Agenda]

[Register Now]

This is some text inside of a div block.

Button Text

Penghui Li

Penghui Li is passionate about helping organizations to architect and implement messaging services. Prior to StreamNative, Penghui was a Software Engineer at Zhaopin.com, where he was the leading Pulsar advocate and helped the company adopt and implement the technology. He is an Apache Pulsar Committer and PMC member.

Hang Chen

Hang Chen, an Apache Pulsar and BookKeeper PMC member, is Director of Storage at StreamNative, where he leads the design of next-generation storage architectures and Lakehouse integrations. His work delivers scalable, high-performance infrastructure powering modern cloud-native event streaming platforms.

Neng Lu

Neng Lu is currently the Director of Platform at StreamNative, where he leads the engineering team in developing the StreamNative ONE Platform and the next-generation Ursa engine. As an Apache Pulsar Committer, he specializes in advancing Pulsar Functions and Pulsar IO Connectors, contributing to the evolution of real-time data streaming technologies. Prior to joining StreamNative, Neng was a Senior Software Engineer at Twitter, where he focused on the Heron project, a cutting-edge real-time computing framework. He holds a Master's degree in Computer Science from the University of California, Los Angeles (UCLA) and a Bachelor's degree from Zhejiang University.

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.