Introducing Scalable Topics in Apache Pulsar 5.0

Apache Pulsar 5.0 introduces a new kind of topic: a Scalable Topic. It scales transparently through range splits and merges, preserves key ordering across topology changes, and lives behind a type-safe client API that replaces the decade-old Consumer<T> interface with three purpose-named consumers. It is being delivered across three Pulsar Improvement Proposals — PIP-460, PIP-466, PIP-468 — and it composes on top of the existing Pulsar broker, managed ledger, and BookKeeper stack.

This post is the announcement. The two goals are to explain, concretely, what the scalable topic model is and what it isn't; and to set the scope honestly — what lands in 5.0, what's deferred, and how teams already running Pulsar can think about adoption.

Why a new topic type

The partitioned-log model has shaped streaming infrastructure for the last decade. Apache Kafka popularized it. Apache Pulsar inherited it. Most systems that came after — including the ones running at hyperscale inside the largest companies — built on the same foundation. The model is well understood, broadly deployed, and has carried the industry a long way.

It has also accumulated three structural problems that get worse at scale, and they are not Pulsar-specific. Any system built on partitioned logs inherits them.

Key ordering breaks when partition count changes. Partitioned topics route keyed messages using modulo hashing: hash(key) % numPartitions. When an operator increases the partition count — say from 4 to 8 — the modulo mapping changes completely. A key previously routed to partition 2 may now hash to partition 6. Messages produced after the change land on a different partition than messages produced before. For any application relying on per-key ordering — session aggregation, stateful stream processing, deduplication, anything with keyed state — this silently breaks the ordering contract for all in-flight traffic. The only safe remedy is to drain the old topic before switching, which requires downtime. This is true in Kafka. It is true in Pulsar's partitioned topics today. It is true in every system that uses modulo routing to map keys to partitions.

Partitions never decrease. Once a topic is scaled up to 64 partitions, it stays at 64 partitions forever, even if traffic drops to a level where 4 would suffice. Decreasing the count would require rerouting keys (breaking ordering), migrating unconsumed data from removed partitions, and coordinating the change across every active consumer. No production-grade mechanism for this exists in any major partitioned-log system. The practical result is partition-count drift: topics accumulate more partitions than they need, wasting broker resources, metadata store entries, and consumer connections, across every cluster in the industry.

The cost of getting the partition count wrong is permanent. Because partitions cannot shrink and growing them breaks ordering, operators must predict the right count at topic creation. Over-provisioning wastes resources. Under-provisioning produces write-hot partitions that eventually require re-creating the topic with a larger count — and a synchronized cutover. Neither outcome is acceptable for infrastructure that is supposed to be operationally simple at scale.

Why a new client API

The subscription side of the API has a related structural problem, with Pulsar's existing Consumer<T> being a clear example. A single consumer interface exposes operations that are meaningful for some subscription types and silent no-ops for others. Cumulative acks only make sense for ordered consumption. Negative acks only make sense for shared consumption. Transactions, dead-letter queues, and individual acks behave differently per subscription type. The subscriptionType enum hides these differences at the interface level, and invalid combinations become runtime bugs instead of compile errors. Every team that onboards Pulsar eventually asks the same question in an architecture review: "Which subscription type should we use here, and why?" The answer has never been obvious from the API.

The design, in three parts

Pulsar 5.0 addresses both problems in a coordinated way. The topic-scaling problem is addressed by PIP-460. The subscription-interface problem is addressed by PIP-466. They were designed together because the two concerns compose — a new topic type is the right moment to revisit the client API, and vice versa. And because the partitioned-log limitations are industry-wide, the design Pulsar is shipping in 5.0 is a meaningful step forward for the open-source streaming community as a whole — built in the open, under Apache governance, and reviewable as it lands.

1. Range-based routing and the segment DAG (PIP-460)

A Scalable Topic is a dynamic directed acyclic graph of range segments. Each range segment owns a contiguous slice of the keyspace — a hash range — and is backed by exactly one managed ledger on one broker. At produce time, a keyed message is routed to the active segment whose hash range covers the message's key hash. This is a stable routing decision: when a segment splits into two children, each child inherits a sub-range of the parent's keyspace, and the keys a producer was routing into the parent continue to route into the correct child. Modulo arithmetic disappears.

A range segment can be split either automatically (e.g., when it becomes a write bottleneck) or manually through the admin API. The split proceeds in three steps: the current segment is sealed with a termination marker, two child segments are created with defined sub-ranges, and the updated DAG is pushed to all connected clients through a persistent watch session. Producers begin routing to the appropriate child. The consumer that was reading from the parent is preferentially assigned to one of the children to minimize rebalancing. Merging adjacent segments is supported symmetrically, with cross-broker coordination when the segments live on different brokers.

The DAG itself encodes a strict happens-before relationship: all messages in a parent segment precede messages in any of its children. A consumer that is catching up through a topology change traverses the DAG in that order, so per-key processing order is preserved without any special logic in the application.

2. A type-safe client API (PIP-466)

Alongside the new topic type, Pulsar 5.0 introduces a new client API namespace (org.apache.pulsar.client.api.v5) with three separate, type-safe consumer interfaces:

StreamConsumer registers with the consumer controller, receives exclusive segment assignments, and processes messages in key order within each assigned segment. It uses cumulative acknowledgements. This is the replacement for today's Exclusive and Failover subscriptions.
QueueConsumer subscribes to all active segments with shared dispatch, no controller interaction, individual acknowledgements, negative acks, dead-letter queues, and the full semantics teams have been using via the Shared subscription for years. It is the direct descendant of Pulsar's long-standing queue support.
CheckpointConsumer registers with the controller and receives segment assignments like the StreamConsumer, but tracks read positions externally via a serializable Checkpoint snapshot rather than broker-side cursors. It is designed for stream processing frameworks — Flink, Beam, Spark — that manage their own state and need restorable read positions across topology changes.

Each interface exposes only the operations valid for its pattern. A StreamConsumer builder does not expose dead-letter-queue configuration. A QueueConsumer builder does not expose key-ordered positioning. Invalid combinations become compile errors, not runtime surprises.

The existing Consumer<T> API remains available in Pulsar 5.0 and is not deprecated. Teams running on existing Pulsar topics can continue using it indefinitely. Deprecation is planned for Pulsar 6.0 LTS with removal in Pulsar 7.0 LTS — a multi-year runway.

3. The consumer controller (PIP-468)

StreamConsumer and CheckpointConsumer require coordinated segment assignment across a dynamic topology. A broker is elected as the controller for a given subscription, using leader election in the metadata store. Consumers register with the controller through a persistent bidirectional stream, identified by a stable consumer identity — a name chosen by the client, not a connection-scoped UUID. The controller pushes segment assignments through this stream; the consumer reports back as it finishes segments.

If a consumer disconnects because of a transient network issue or a client restart, its segment assignments are held in reserve for a grace period (default ~1 minute, configurable). If it reconnects with the same identity before the grace period expires, the session resumes with its previous assignments unchanged. If it doesn't reconnect, the segments are redistributed to the remaining consumers. The controller persists the hash range assignment state, so resuming from a failure or broker restart allows the consumer to continue key ordered processing with the same assigned hash.

This is the part of the design that most directly addresses the operational pain of today's rebalance protocols. A blip no longer triggers a redistribution. Graceful restarts don't need to coordinate with the rebalance mechanism. The session is the unit of consumer identity, not the connection.

What this unlocks

The design choices above translate into a concrete set of capabilities that no Apache-governed open-source streaming system has offered before. For teams running Pulsar today, and for teams evaluating their streaming platform more broadly, these are the practical wins:

Elastic capacity in both directions. Scalable Topics can grow and shrink in response to load. Range segments split when a partition becomes a write bottleneck and merge when traffic drops. Operators no longer have to over-provision at topic creation to avoid the permanence of a wrong choice. This is the single biggest operational shift, and it is the one capability the partitioned-log model has never offered.

Per-key ordering is preserved across every topology change. Splits and merges produce children that inherit a sub-range of the parent's keyspace, and the segment DAG encodes a strict happens-before relationship between parent and children. Stateful applications — session aggregation, deduplication, keyed stream processing — get their ordering contract honored across the resize, with no application-level logic and no drain-and-cutover.

Right-size topics dynamically. Partition counts no longer need to be predicted at creation time and lived with forever. Topics adjust to actual load. The cost of getting the initial sizing wrong drops from "permanent" to "self-correcting."

Type-safe consumer interfaces. StreamConsumer, QueueConsumer, and CheckpointConsumer each expose only the operations valid for their pattern. The runtime surprises that come from mismatched subscription types and configuration flags become compile errors. New team members can read the API and know which interface fits their use case.

Session-based consumer identity. The consumer controller treats the session — a stable client-chosen identity — as the unit of consumer membership, not the network connection. Transient disconnects, graceful restarts, and short network blips no longer trigger a full rebalance. The session resumes with its previous assignments. This directly addresses one of the most consistent operational complaints across every streaming system.

First-class support for stream processing. CheckpointConsumer exposes restorable, serializable read positions across topology changes, designed for the way Flink, Beam, and Spark already manage state. Stream processing frameworks no longer have to work around broker-side cursor semantics; they get a primitive built for their access pattern.

Not a rewrite of Pulsar. Scalable Topics layer onto the existing proven Pulsar broker, managed ledger, and BookKeeper stack. Teams adopting them stay on the same storage, the same operational tooling, the same deployment topology. This is what the next section gets into in detail, and it is the property that makes adoption a code change rather than an infrastructure project.

Taken together, these are not incremental improvements on the partitioned-log model. They are the properties partitioned logs have not been able to offer, delivered through Apache governance in an open PIP process, on top of a proven storage layer that thousands of organizations are already running in production.

~98% of the existing Pulsar system is reused

This is the part of the design we want to emphasize, because it shapes how teams should think about adopting the scalable topic model.

BookKeeper is unchanged. Managed ledgers are unchanged. The broker's core storage path is unchanged. The existing subscription cursors, the schema registry, transaction support, tiered storage offloaders, and geo-replication infrastructure all continue to operate on existing topics as they do today. PIP-460 adds a new coordination layer — the segment DAG, the consumer controller, the new client API — but it adds it on top of the same proven stack. The numbers vary by module, but across the client SDK, roughly 98% of the existing code is reused.

The practical consequence matters for adoption. Moving an existing topic to a scalable topic is a short code change for most applications: switch the topic URL scheme, migrate to the new consumer interface, and the application runs against the same broker running on the same BookKeeper cluster. This is not a rewrite. And for teams not ready to migrate yet, existing partitioned and non-partitioned topics keep working exactly as they do today.

Pulsar 5.0 will include migration tooling for converting existing topics to scalable topics (PIP-475). It defines a single atomic, one-way admin command — migrate-to-scalable — that flips an existing partitioned or non-partitioned topic into a scalable topic in place, with no data copy and no cursor migration. The V5 SDK interoperates with un-migrated regular topics from day one, so applications can upgrade at their own pace, and the migration moment stays small and surgical once every client on a topic is on V5. Larger-scale operational concerns — automated fleet-wide rolling migration, and cross-cluster (geo-replication) migration coordination — remain deferred to Phase 4 (Pulsar 5.1 and beyond).

Credit where it's due

The range-segment model in PIP-460 is not a new idea in the industry. Two internal systems pioneered it:

Pravega (Dell/EMC, 2017, CNCF) introduced segment-based streams with SLO-driven automatic splitting, Reader Groups that coordinate segment assignments across readers, and Checkpoints as serializable restore points for stream processing. The Pulsar CheckpointConsumer descends directly from the Pravega Checkpoint + Reader Group model.
LinkedIn Northguard (announced June 2025) replaced Kafka inside LinkedIn at a scale of 32 trillion records per day across 400,000 topics. Northguard's data model — records → segments → ranges → topics — and its buddy-algorithm for splits and merges have impacted the PIP-460's segment DAG design.

PIP-460 credits both explicitly, and adopts specific design choices from each. The buddy-split constraint, in particular, comes from Northguard: a range segment can only be merged with its unique buddy, which makes the parent-child happens-before relationship a strict partial order. That strictness is what makes catch-up reads correct across topology changes.

The difference between PIP-460 and its predecessors is not the model but the implementation path. Pravega and Northguard both built new storage stacks. PIP-460 layers the same pattern on top of the existing Pulsar broker and BookKeeper infrastructure, with the ~98% SDK reuse that comes from composing on a proven system. Different tradeoffs, arriving at a similar design for similar reasons.

Inside Pulsar itself, Scalable Topics builds on two earlier PIPs: PIP-379 (Key_Shared Draining Hashes) contributes the per-key ordering mechanics used inside a single range segment, and PIP-335 (Oxia metadata store) contributes the streaming watch sessions that make server-pushed topology updates practical without polling. Oxia is a robust, scalable metadata store and coordination system designed for large-scale distributed systems, with built-in support for stream index storage to optimize real-time data management. It is fully open source and licensed under Apache 2.0 license. Oxia replaces Zookeeper in Pulsar 5.0 as the default metadata store and coordination system.

What ships, and what's deferred

The Pulsar 5.0 work arrives in phases, so teams can adopt incrementally:

Phase 1 (Pulsar 5.0.0-M1): New client API (PIP-466), range segment abstraction, a basic scalable topic with a single segment, and manual splitting via the admin API. Validates the storage and metadata model end-to-end.
Phase 2 (Pulsar 5.0.0-M2): Consumer controller (PIP-468), stream consumer rebalancing on segment changes, I/O-threshold auto-split, and the QueueConsumer model. Limited geo-replication support.
Phase 3 (Pulsar 5.0.0 GA, targeted for September/October 2026): Range merging, finalized client API, CheckpointConsumer, all three consumer types production-ready.
Phase 4 (Pulsar 5.0.1 and later): Replicated Subscriptions across scalable topics, and full geo-replication.

The deferred items are important and none of them is easy. Replicated Subscriptions require a new model for tracking subscription positions across independently-evolving topology DAGs in different clusters. Geo-replication requires the new message entry format to support broker-level routing without decompressing payloads. Each of these will be a dedicated sub-PIP with its own design document, its own community review, and its own release timing.

We want to be explicit about this up front. A phased delivery lets the community validate the primitives in 5.0.0 milestone releases before the 5.0 LTS commitment. It also means some things that today's Pulsar does well — geo-replicated subscriptions, atomic transactions spanning multiple partitions — will take additional releases to reach the same capability level for scalable topics. Applications that depend on those features today can continue using partitioned topics for as long as they need.

A note on applicability beyond Pulsar

A Pulsar-first innovation, with room to grow

Scalable Topics are landing in Apache Pulsar 5.0, and Pulsar is where the model is being proven. The reason this matters beyond Pulsar is that the coordinator layer in the design — the segment DAG, the leader-elected controller, the persistent session model with grace-period lease — is defined at the protocol level, not bound to Pulsar-specific storage. That is an intentional design property. It means the primitives the community is building and reviewing in the open today have the structural shape to support other surfaces later, with different storage adapters underneath.

We mention this so the door stays open. It is not the focus of this launch. Pulsar 5.0 is where Scalable Topics ship, where they get production-validated, and where the community shapes them between now and GA. Any broader cross-protocol conversation is downstream of that work, not parallel to it.

What's in the rest of the series

We have a few blog posts planned for diving deeper into the implementation of Scalable Topics, including but not limited to Range-Based Routing and the Segment DAG, the new client API, the consumer controller and a discussion on generic architecture and what scalable topic unlocks next.

The three PIPs are the authoritative technical source. Each post in the series will link back to the specific PIP sections it draws from. We are writing these posts because the PIPs are dense, and the community benefits from explanations that meet different readers at different levels of detail. If you find discrepancies between a post and a PIP, trust the PIP.

The work is happening in public, on the Apache Pulsar dev mailing list and in the apache/pulsar repository. Issues, comments, and review on the PIPs are welcome. The best-case outcome of this launch is that the community shapes the implementation between now and the 5.0 GA target, rather than reacting to it after the fact.

Subscribe on streamnative.io/blog to get each post as it ships. The three PIPs: PIP-460, PIP-466, PIP-468.