July 3, 2025
8 min read

Anatomy of a Stream: Data vs Metadata vs Protocol

Sijie Guo
Co-Founder and CEO, StreamNative

In our previous post - Why Streams Need Their Iceberg Moment, we introduced the vision of a three-layer architecture for streaming systems inspired by the Apache Iceberg-fueled lakehouse revolution. Now, let’s dissect each of those layers in detail. How exactly do we separate a streaming system into data, metadata, and protocol, and what does each part do? In this deep dive, we’ll explore the anatomy of a stream through the lens of these three layers. We’ll also draw parallels to the lakehouse model (the separation of table storage, table metadata, and query engine) to solidify the concepts.

Modern streaming platforms can be reimagined as three cooperative components:

1. The Data Layer: Stream Storage Reimagined

What it is: The data layer is responsible for storing the actual events/messages that flow through streams. In a traditional broker, this corresponds to the log files on disk (e.g. Kafka’s segment files or Pulsar’s BookKeeper ledgers). In the new layered model, the data layer is broken out as an independent, stream storage service or medium.

Today’s situation: In Kafka’s classic design, each broker stores segments of the log on its local filesystem, and brokers replicate these segments to each other for fault tolerance. Apache Pulsar took a different approach by using Apache BookKeeper bookies as a distributed storage cluster for logs, separate from the broker that handles clients. Pulsar’s architecture was an early step toward decoupling: the broker became stateless for data, and storage responsibilities were offloaded to bookies. However, even BookKeeper clusters have their own disks and replication to manage, and the data is in a proprietary format not directly queryable by external analytics engines. Most streaming stacks still treat stored data as ephemeral: something to keep around for a retention period and then drop or offload elsewhere for analysis.

The new approach: In a truly disaggregated data layer, stream data lives in a scalable, low-cost store that can retain data indefinitely and make it accessible beyond just the streaming consumers. The prime candidate for this is cloud object storage (like Amazon S3, Google Cloud Storage, etc.) or a distributed file system, combined with an open file format. Instead of brokers writing to local disk, imagine that when a producer sends an event, it gets persisted directly into an “open table” format file on S3 (or a similar store). For example, a stream’s data could be stored as a set of Parquet files in a directory, or as append-only files following a format that external readers understand. This is analogous to how Iceberg stores table data as files in object storage. The data layer would handle chunking and writing events in an efficient way (perhaps buffering in memory and writing larger blocks, similar to how columnar file formats work). It would also handle replication or durability – with cloud storage, replication is often built-in (multi-AZ redundancy), so we avoid the cost of triple-replicating at the application level.

A key benefit of this approach is “streaming data = lakehouse data”. The moment an event is written, it’s not only available to stream consumers, but it’s also sitting in an analytics-friendly store. Your streaming history is your data lake. This eliminates the need for separate ETL processes to copy data from a message queue into data lake storage. Anyone with access to the storage (and proper permissions) could run SQL queries or AI model training on the live data, using tools like Spark or Flink, because it’s already in an accessible format. In practice, there will be indexes or additional metadata to help readers efficiently find their position in the stream (e.g. an index mapping offsets to file ranges), but those are part of the metadata layer (covered next).

Challenges and considerations: Decoupling storage like this raises questions about latency and throughput. Directly writing to object storage can introduce higher latency than writing to a local disk due to network hops. However, there are ways to mitigate this: use caching for hot data, memory buffers, and perhaps a tiered approach (write immediately to a WAL in memory or fast storage, then flush asynchronously to S3). Many systems (including StreamNative Ursa) use such tricks to get the best of both worlds – the durability of object storage with the latency of fast storage. Another consideration is format: should streams be stored as pure log (sequence of events) or also translated into columnar formats for efficiency? Projects like Apache Iceberg are exploring streaming ingest, where streams of inserts can continuously produce new table files, so the line between “stream” and “table” blurs. Regardless of implementation details, the data layer’s overarching role is clear: keep all the data safe and accessible, decoupled from the serving compute. This layer can be scaled by simply adding storage capacity (or letting the cloud storage scale automatically), independent of how many clients or how much compute power is needed to serve the data.

2. The Metadata Layer: The Stream Catalog and Brain

What it is: The metadata layer manages all the information about the streams – their definitions, the location of data, the consumer positions, and any coordination needed. Think of it as the catalog or meta-store for streaming data. In databases, this would be your system tables or Hive Metastore; in Iceberg’s world, it’s the catalog services that track table schema and snapshots.

Today’s situation: In current streaming systems, metadata is often entangled with the brokers or tied to external systems like ZooKeeper. For example, Kafka (pre-KIP-500) stored partition assignments, leader election, and consumer group offsets in ZooKeeper (and later in internal topics on brokers). The metadata about what topics exist, how many partitions, who is the leader, etc., was spread across ZK and the brokers’ memory/state. Pulsar, on the other hand, uses a pluggable metadata store (like ZooKeeper, Etcd, or Oxia) to store metadata about topics, cursors (subscriptions), and so forth. This means scaling or changing how metadata is handled can be as complex as the data scaling problem itself. A lot of the “complexity” of running a Kafka cluster, for instance, has historically been about managing this metadata: ensuring ZooKeeper is healthy, performing controlled leader elections, and recently, migrating to the new Kafka Raft metadata quorum (as ZooKeeper is phased out). The bottom line is that metadata hasn’t been a first-class, independent component – it’s been tightly linked to the runtime of the streaming cluster.

The new approach: In a three-layer design, the metadata layer is a dedicated, standalone service or set of services that act as the source of truth for stream state. We can envision it as a Stream Catalog analogous to an Iceberg catalog. It would hold information like: a list of all streams (topics) and their configurations, schemas for each stream (if using schema registry integration), the mapping of stream partitions to data files or objects in the data layer, and consumer group offsets or positions (i.e., where each consumer has read up to). Essentially, any piece of information needed to coordinate producers and consumers lives here, rather than being implicitly known by a broker process.

Designing this layer brings questions: Do we implement it as a highly-available database? As a set of metadata files on the same object storage (like how Iceberg maintains a metadata JSON and manifest lists)? The answer could be either or a mix. One promising pattern is to leverage the same tech as the lakehouse: for instance, one could treat each stream as akin to an Iceberg table internally – with snapshots pointing to new data files, etc. In fact, if the data layer writes Parquet files for a stream, an Iceberg table’s metadata could naturally catalog those files. However, streaming has extra needs (like real-time consumer offsets and perhaps event time indexes) that might extend beyond a static table definition. It might be that the metadata layer uses a lightweight distributed consensus system (e.g., etcd or a Raft-based service) to manage monotonically increasing sequences for offsets and to manage subscribers. The key is that brokers (protocol servers) consult this metadata service rather than owning that knowledge exclusively.

Benefits: By isolating metadata, we gain clarity and consistency. Multiple protocol servers can all refer to one canonical source of truth, ensuring they behave consistently. It also improves governance and interoperability: a well-defined catalog of streams could be exposed via standard APIs. Imagine being able to query the streaming metadata layer to discover what streams exist and their schema, just like you’d query an Iceberg Catalog for available tables. This makes it easier to integrate streaming data with other systems – for example, an ETL job could use the catalog to find the latest snapshot of a stream, or an auditor could verify all streams comply with certain retention policies. Another boon is independent scaling and tuning: the metadata store can be optimized for high-write, low-latency operations (like committing a new event sequence or updating a consumer offset) and scaled out with consensus nodes, without touching the data path or client handling logic. If the volume of streams or consumer groups grows, we beef up the metadata service accordingly, without needing to muck with the storage layer.

Iceberg again provides a guiding analogy: Iceberg’s metadata layer (its catalogs and metadata files) enabled features like time travel, schema evolution, and concurrent writes in the batch world. A robust streaming metadata layer could similarly enable new stream features – think seal/unseal of streams, consistent replay from points in time (since the metadata could mark snapshots of a stream at intervals), or even transactions across streams. It becomes the brain that coordinates complex behaviors that would be very hard to bolt onto a monolithic broker. As a bonus, if the metadata is stored in a standardized way (say, using Iceberg’s format for log segments), external systems might even read it to understand stream contents or hook streaming data into data lineage tools.

3. The Protocol Layer: Pluggable Brokers and Interfaces

What it is: The protocol layer is what the outside world interacts with – it’s the API and the delivery mechanism for streaming data. In simple terms, these are the servers that clients connect to using some protocol (e.g., the Kafka binary protocol, REST, MQTT, etc.). Their job is to accept data from producers, serve data to consumers, and enforce the semantics of the streaming system (ordering guarantees, acknowledgments, subscription management). In a traditional architecture this is tightly bound to the storage – the broker both speaks the protocol and writes to disk. Here, we split it out: the protocol layer’s components speak the language of streaming but delegate actual data persistence and state to the other layers.

Today’s situation: Kafka’s protocol is quite complex but well-understood; clients talk to specific broker hosts which handle both the network IO and disk IO for partitions they lead. Scaling the throughput means scaling brokers since they do all the work. If you want a different interface (say an HTTP API for Kafka topics), you typically need a separate bridge that still ultimately talks to the Kafka brokers. Pulsar took a step here by supporting multiple protocols (via protocol handlers) – Pulsar brokers can natively understand not just the Pulsar protocol but also MQTT or Kafka (with a plugin), translating those calls to the Pulsar core. That hints at what could be possible if the protocol handling was more modular. But even in Pulsar, the broker is still where data lives (in memory until written to bookies) and where subscription cursors update, etc., so it’s not fully isolated.

The new approach: In the three-layer model, the protocol servers are stateless or near-stateless. They can be thought of as edge servers or API gateways to the streaming system. Their responsibilities would include: handling client connections, implementing the nuances of a protocol (e.g. Kafka’s fetch and produce requests, or Pulsar’s subscribe and flow control commands), orchestrating data flow between clients and the data layer, and making calls to the metadata layer for coordination (like finding out where the latest data for a topic is, or updating a consumer’s position). Crucially, these protocol nodes do not store the full stream data on local disk (except perhaps a cache); they don’t have exclusive ownership of a partition’s data. Instead, they might temporarily cache recent messages for speed, but the authoritative storage is the data layer. In case of a failure, any other protocol node can take over serving a given client, because all the state it needs (what data has been written, where to find it, what the consumer offset is) is in the data and metadata layers.

This design means you could have multiple different protocol services running in parallel. For example, you might run a set of “Kafka API servers” that let Kafka clients produce/consume to/from the streams, and alongside them a set of “Pulsar API servers” for applications using Pulsar’s features – both accessing the same underlying streams. Because these servers are stateless, you can scale out each type as needed – if you have 1000 Kafka clients and only 10 Pulsar clients, you deploy more of the Kafka protocol instances. The streaming system thus speaks many languages without duplicating the data.

Benefits: The protocol layer being separate yields immediate flexibility. Adopting new protocols or client standards becomes much easier – you don’t need to overhaul how data is stored, you just stand up a new front-end. It’s similar to how in databases, you might add new query endpoints (like a REST API to an SQL database) without changing the storage engine. Another benefit is resilience and elasticity: since these nodes keep no critical data, you can auto-scale them based on traffic patterns (spin up more during peak ingest times, scale down in off hours), all without migrating any stored data. If one crashes or needs maintenance, you remove it from the load balancer and traffic seamlessly flows to others. No more worrying that a broker failure means data might be temporarily unavailable – as long as some protocol node is up, it can retrieve data from storage and serve it.

Ordering and consistency: One might wonder, how do we preserve ordering guarantees or consistency if any stateless server can serve data? The answer lies in smart coordination via the metadata layer. The system might still elect a “leader” for a partition but that leader’s role is just to coordinate writes (to ensure ordering) – it could be one of the protocol nodes assigned dynamically, or the data layer itself could enforce ordering (e.g., an object store might allow appends in order via a lock). There are multiple ways to implement it. The key is, even if a particular node is leader for a partition’s writes, that leadership can be quickly handed off if needed (since no long-lived data lives on the leader). This is, in fact, how Pulsar’s design is imagined: brokers handle ordering and act as a “cache & coordinator” while the data lives on an external log storage like Apache BookKeeper. So we still ensure that each partition’s events are delivered in order – the clients don’t directly all write to storage concurrently; they go through a protocol node which sequences them. But unlike the old model, that node doesn’t own the data forever – once written, the data is on durable storage and any node can read it.

To draw an analogy, consider a content delivery network (CDN) for websites: the CDN edge servers don’t store the master copy of the website content; they cache and serve it, while the origin server (storage) holds the source of truth. In our streaming case, the protocol layer are like those edge servers, and the data layer is the origin. It’s a pattern proven to work for scaling web content to millions of users – now we are applying it to event streams.

From Lakehouse to Lakestream: Comparing the Layers

It’s worth explicitly comparing this three-layer streaming model to the lakehouse (Iceberg) model to cement the understanding:

  • Data Layer: In a lakehouse, this is the object store with parquet/orc files containing table data. In streaming, it’s a durable log store (ideally also an object store or distributed FS) holding event data. Both serve as the single source of truth for raw data. The difference is streams are continually appending, whereas tables see batches of appends/updates – but conceptually, it’s analogous.
  • Metadata Layer: In the lakehouse, this is Iceberg’s table metadata (manifest files, snapshots, and the catalog service like Hive Metastore or Glue) which tracks where data files are and what the schema is. In streaming, the metadata layer tracks active topics/partitions, where the latest offset is, who the leader is (if using leaders), and consumer read positions. Both provide transactional metadata that can be updated atomically (commit a new snapshot or commit an event sequence).
  • Protocol/Compute Layer: In the lakehouse world, this is the query engine or processing engine – Spark, Trino, Flink, etc. They read data via the metadata layer and compute results. In streaming, the protocol servers are like a continuously running “query” that pulls data for consumers or ingests from producers. They are the compute layer that interfaces with clients. One could even view a streaming consumer as analogous to a continuous query over the data layer. The protocol layer ensures that the continuous queries (subscriptions) get the right data in the right order, just as a batch query engine ensures a SQL query reads the right snapshot of a table.

The separation of concerns is remarkably parallel. By adopting this layered approach, streaming systems become “lakestreams” – first-class real-time data streaming stores that maintain the openness and reliability of a lakehouse. We can use the term lakestream to denote a streaming system built with the same architectural ideals as a data lakehouse.

"Lakestream" is preferred over "streamhouse" or "streaming lakehouse" to denote an architecture that stores a single copy of data for both streams and tables, unlike many "streaming lakehouse" concepts, which maintain two separate copies.

Conclusion: Embracing the Modular Future of Streaming

Decomposing streams into Data, Metadata, and Protocol layers is more than an academic exercise – it’s a blueprint for the next generation of streaming infrastructure. This approach addresses the core pain points we outlined in part 1: high costs drop when you utilize object storage and stateless scaling; slow evolution flips to rapid innovation when each layer can change independently; operational burdens lighten when state is centralized and immutable in storage, rather than spread across dozens of servers.

We’re already seeing early implementations of this vision. Apache Pulsar’s design validated the benefits of splitting storage from serving, and it’s evolving further to remove even more coupling (e.g., eliminating ZooKeeper, integrating with tiered storage). Newer platforms like StreamNative’s Ursa are pushing the envelope by combining the Kafka API with a lakehouse storage foundation – essentially an embodiment of the three-layer idea: protocol flexibility, a unified stream/table storage, and a separate metadata store. All these efforts, from open-source projects to cloud services, point in the same direction: streams are becoming cloud-native and open.

For CTOs and engineering leaders, the message is clear. To stay ahead in a world of real-time data and AI, it’s time to rethink your streaming architecture. Just as you wouldn’t build a data lake today without an open table format and a separation of compute/storage, soon we’ll consider it equally antiquated to build streaming systems on a 2010s-style monolithic broker. The three-layer “Iceberg moment” for streams will mean your data infrastructure is more interoperable, future-proof, and cost-efficient. It will enable use cases like instant replays of years of event history, on-the-fly stream processing with SQL engines, and streaming analytics that seamlessly blend historical and real-time data. And crucially, this can be achieved in a vendor-neutral way – through open standards for stream storage and metadata, and widely adopted protocols.

In conclusion, the anatomy of a modern stream is one of independence and unity: independent layers each doing one job well, and a unified vision of data that transcends the old batch vs streaming divide. By embracing this architecture, we stand to unlock the full potential of streaming data, much as the lakehouse did for batch data. The iceberg has shown us only the tip of what’s possible – now it’s up to us to complete the picture for streaming.

This is some text inside of a div block.
Button Text
Sijie Guo
Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Lakehouse
Lakehouse
Ursa