Beyond the Broker: Standardizing the Streaming API

In the messaging and streaming arena, there has never been a one-size-fits-all protocol. Apache Kafka, RabbitMQ, Apache Pulsar, NATS, MQTT, AMQP – each was created with different assumptions and goals. This diversity is reminiscent of the early database world with multiple query languages, before SQL became a standard. But unlike databases, streaming systems have fundamental semantic differences that make a single unified “standard API” challenging. Instead of forcing one protocol to rule them all, the emerging consensus is to embrace multiple protocols (each optimized for certain use cases) and ensure they can interoperate at a deeper level. It’s analogous to the data lakehouse philosophy: multiple query engines can coexist (Spark, Trino, TensorFlow), as long as they operate on the same unified data storage. Similarly, multiple streaming protocols might coexist while sharing the same underlying event streams.
First, let’s acknowledge why multiple protocols exist and persist:
- Different messaging semantics: Kafka popularized the idea of a durable log with replayable events and consumer-driven offsets – great for streaming analytics and event sourcing. RabbitMQ (AMQP) and similar MQs focus on push-based, worker queue semantics (each message goes to one consumer, often for task processing) with features like acknowledgments, routing keys, and transactions for reliability in business processes. Pulsar designed a system to handle both patterns (pub-sub and work-queues) in one, introducing the concept of exclusive vs shared subscriptions. Meanwhile, systems like MQTT and NATS cater to lightweight, transient messaging (IoT devices, in-memory microservices) where low overhead and simplicity matter more than durability. No single protocol covers all these scenarios perfectly, because optimizing for one can mean trade-offs for another (e.g., a design for ultra-low latency ephemeral messaging might not guarantee durability or ordering needed for financial event streams).
- Historical ecosystems: Companies and open-source communities have built rich ecosystems around these protocols. Kafka, for example, has an entire ecosystem of connectors, Stream processing libraries, and a large installed base. JMS (Java Messaging Service) tried to standardize an API for message queues, but it mainly provided a common abstraction in Java – it didn’t unify wire protocols across vendors. The inertia of existing applications means any “new standard” would have to either seamlessly emulate these protocols or convince everyone to rewrite their systems, which is unlikely.
That said, we do see a convergence in capabilities. Modern Kafka is adding features that look more like traditional queues: for instance, Kafka 4.0 introduced “Queues for Kafka” (KIP-932), which enables true shared consumption where a group of consumers can cooperatively consume from a topic without fixed partitions. This essentially gives Kafka point-to-point queue semantics (multiple consumers dividing up messages of a topic) similar to JMS or Pulsar’s shared subscription. On the flip side, Apache Pulsar from day one offered both queue and pub-sub in one API (you can create a subscription as exclusive, shared, or failover), and even introduced transaction support to match Kafka’s exactly-once features. RabbitMQ has added streams (a new data structure for persistent logs) to catch up with high-throughput use cases that Kafka handles. We see that protocols are evolving and borrowing features: the gaps are narrowing.
However, this doesn’t mean they are becoming identical or that one will subsume all others. Each community prioritizes different aspects – for example, Kafka prioritizes throughput and a simple partition model, Pulsar prioritizes multi-tenancy and infinite retention via tiered storage, RabbitMQ prioritizes flexible routing and ease of use for work queues, etc. Therefore, expecting a single standard API (akin to ODBC or JDBC in databases) to replace these is unrealistic in the near term. The richer the semantics, the harder to standardize without lowest-common-denominator.
So, what’s the path beyond the broker? It’s to look below or behind the broker API – towards the storage and data layer. Instead of standardizing the API that producers/consumers use, standardize how the data is stored and shared so that different APIs can access it. This is exactly how the lakehouse works for batch data: engines don’t need the same API, they just need to agree on the format of data (Parquet, Iceberg metadata). In streaming, this could mean agreeing on a common log or table format for the messages, and building adapters so that a Kafka client and a Pulsar client, for example, could read from the same stream of events. We already discussed how Ursa writes data to open formats – envision that a Kafka application writes to a stream and a Pulsar application reads from that same stream’s storage, each using their own API, but the data interchange happens at the storage layer in Parquet/JSON format. StreamNative’s platform actually moves in this direction: they allow Kafka clients to produce to a Pulsar-managed topic (via KSN: Kafka on StreamNative which uses Pulsar underneath). In that scenario, Pulsar’s broker is translating the Kafka protocol into the underlying Pulsar log, and because Pulsar offloads data to tiered storage in open format, any other protocol handler or tool that knows how to interpret that format could also consume it.
In essence, multi-protocol streaming is becoming a reality, much like multi-engine lakehouses. Apache Pulsar’s architecture supports pluggable protocol handlers – already there are implementations for Kafka (so Kafka apps talk to Pulsar as if it were a Kafka broker), for AMQP (starlight for RabbitMQ), and others. This means one data stream can be accessed via multiple APIs. Another approach is at the ingestion level: for instance, an event could be produced via an HTTP API (e.g., a REST call) and consumed via a WebSocket or Kafka API – again multiple interfaces to the same stream. The data layer unifies them. We can draw a parallel to multi-engine lakehouse: in a lakehouse, you don’t force all queries to use one SQL dialect or one engine; you let each engine do what it’s best at (Spark for large ETL, Pandas for small-scale data science, Dremio/Trino for ad-hoc SQL) but ensure they operate on the same single source of truth. For streaming, one protocol might be best for one scenario (say MQTT for IoT ingestion, because it’s lightweight), another for another scenario (Kafka API for connecting to legacy systems that speak Kafka), and a third for something else (Pulsar’s own API for its rich feature set). If they all write to/read from the same stored stream, we’ve achieved interoperability without forcing a single API.
Let’s consider a concrete example: imagine an e-commerce company with a stream of orders. Some internal systems are built with Kafka and use its API to produce and consume order events. Meanwhile, a new microservices team prefers Pulsar for its flexibility and multi-tenancy. In a traditional world, you’d either run two parallel pipelines (duplication) or try to bridge them with connectors (added complexity). In the emerging world, you could use a unified storage format for the “orders” stream – say an Iceberg table or a distributed log on S3. The Kafka producers send events, a Pulsar cluster (with a Kafka protocol handler) ingests them into that storage, and Pulsar consumers or even Athena queries can access the data. Both teams see the same events consistent in storage, even if one team thinks in terms of Kafka topics and the other in Pulsar subscriptions. This scenario is already hinted at by cloud offerings: for example, Cloudera’s cloud platform had a unified messaging where multiple interfaces sat on top of the same store, and Azure’s Event Hubs can speak Kafka protocol while using its own storage underneath.
It’s worth noting that attempts have been made to define a common messaging API (e.g., AMQP as an open wire protocol, and the OpenMessaging initiative under Linux Foundation). AMQP is used by many systems (including RabbitMQ, Apache Qpid, Azure Service Bus) – it provides a standardized wire format for messaging operations. Yet Kafka notably did not adopt AMQP, and neither did most log-based systems, because it didn’t align with their design. OpenMessaging aimed to be a cloud-era abstraction to allow applications to be messaging-system-agnostic. It defined some common concepts (Message, Producer, Consumer, Namespace) and even a benchmark suite. However, it’s still not broadly accepted as the API – partly because again, the lowest common feature set may be too limiting, and performance optimizations often rely on protocol-specific tweaks.
Given this reality, the industry trend is toward protocol adapters and bridges rather than a new unified protocol. Multi-protocol brokers like Pulsar can natively speak multiple languages to clients. Kafka itself, via the community (Confluent’s efforts or others), might gain bridging capabilities (e.g., ingest MQTT directly, etc.). There’s also the idea of event formats like CloudEvents (a CNCF standard for event message schema) to standardize the content of messages even if transport differs.
The phrase “beyond the broker” implies we should look past the broker-specific APIs to the underlying substrate of streaming. That substrate is the log of events itself. Standardizing that – via open file formats, shared object stores, and common metadata – is more feasible and arguably more useful than trying to get everyone to use the same client API. It means, for example, a company could run multiple broker technologies (Kafka for some parts of the workload, Pulsar for others, maybe AWS Kinesis for something else) but decide that all will offload their data to a unified storage layer (say an S3 data lake in Iceberg format). In that unified storage, each topic/stream from any source is just a table or folder. Consumers that really don’t care about the live sub-second latency could even read directly from that store (batch or micro-batch style), while real-time consumers attach to the brokers. Over time, as brokers themselves evolve to separate compute/storage (which Pulsar already does, Kafka is also evolving with Tiered Storage), the storage becomes the source of truth, and brokers are more like caching and routing layers.
To sum up, trying to standardize the streaming API is a bit like trying to standardize programming languages – it’s not necessary if you can standardize the ABI or runtime under the hood. Each protocol will continue to serve its niche and play to its strengths – there is no one-size-fits-all protocol, and that’s okay. The focus should instead be on interoperability: ensuring data can flow from one system to another with minimal friction. The unified log/table storage approach is a promising path to achieve this. It decouples the “language” of the streams from the data itself. In practical terms, we’ll see more systems where a single stream of events can be accessed via multiple APIs. It’s already happening with Pulsar’s multi-protocol support and Kafka’s foray into queue semantics.
In the future, we might not need to ask “should I use Kafka or Pulsar or RabbitMQ for this?” as an either-or question. We might publish data, and that data can be consumed by any number of different protocol clients depending on what’s convenient – much like data in a lakehouse can be queried by SQL, or read via Python, or processed with R, all equally. The broker becomes less of a monolith that holds data hostage in its format, and more of a serving layer. Going beyond the broker means designing streaming systems where the value lies in the data and its open accessibility, rather than in proprietary APIs. It’s an exciting convergence of ideas: messaging systems learning from data lakes, and vice versa. By standardizing on storage and embracing multiple protocols, we get the reliability and maturity of existing systems without forcing a single new standard. In short, the future of streaming will be multi-protocol, and that’s not a drawback but a strength – as long as we ensure they can all talk to each other’s data. The lakehouse for streams is on the horizon, and it speaks many languages fluently.
Newsletter
Our strategies and tactics delivered right to your inbox