Introducing the StreamNative AI Hub — Agent Engine, MCP Server & more.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

June 26, 2025

6 min read

Why Streams Need Their Iceberg Moment

Sijie Guo

Co-Founder and CEO, StreamNative

David Kjerrumgaard

Sales Engineer, Author of "Apache Pulsar in Action"

Apache Iceberg and similar lakehouse table formats have revolutionized the data analytics landscape. By completely decoupling the storage and computing layers, this approach has led to significant improvements in efficiency and flexibility. The swift embrace of open table storage formats, which are entirely separate from the analytical engine, underscores the importance of vendor neutrality.

The data streaming landscape has long been dominated by Apache Kafka, a platform that, while revolutionary in its time, is now showing its age. Its tightly-coupled design leads to escalating costs, operational complexity, and sluggish innovation. We believe it's time for a breakthrough in data streaming platforms, similar to advancements seen in the data analytics space – put simply, streaming needs its own Iceberg moment. In this post, we will present an architecture that separates concerns into a three-layer model – and how this vision can slash costs and accelerate evolution while staying vendor-neutral.

The Pain of Tightly-Coupled Streaming

Traditional streaming platforms like Kafka rely on brokers as their all-in-one workhorses. These monolithic server processes are responsible for managing data storage, metadata, and client protocols. While effective initially, this tight coupling now presents several challenges as these platforms scale:

High Infrastructure Costs: Since these brokers store data on local disks, scaling the cluster to handle increased throughput necessitates adding more brokers. While replicating data across brokers is vital for data durability in the event of a broker failure, it comes with considerable disk and network overhead. This includes expensive cross-zone replication fees, particularly in cloud environments. Studies have shown that decoupling storage can trim streaming costs by up to 90% by leveraging cheaper object stores. In the current model, however, you’re paying for triple-replicated storage and idle capacity on every broker node.
Operational Complexity: Because tightly-coupled brokers are stateful, scaling or upgrading them is challenging. Adding a new broker initiates data rebalancing, a slow and risky process of reshuffling all of the existing data across the brokers and onto new partitions. Broker failures also lead to a heavyweight partition recovery process. These issues consume countless hours of team time on cluster maintenance (e.g., planning maintenance, manually reassigning partitions) instead of feature development.
Slow Feature Evolution: Kafka's monolithic architecture has historically impeded innovation. Implementing improvements, such as new replication mechanisms or consistency guarantees, requires extensive modifications to the core broker software, impacting the entire system. Efforts within the Kafka community, such as the multi-year initiative to eliminate ZooKeeper for metadata (KIP-500), highlight the challenges posed by tight broker integrations. Similarly, integrating tiered storage (transferring cold data to cloud storage) into Kafka was a significant undertaking, yet it remains an incomplete solution. The tightly coupled architecture means storage, metadata, and protocol are intertwined, making any evolution—like adopting a new storage engine or supporting a new client API—a slow and arduous process.

High costs, scaling bottlenecks, and stagnant feature velocity are the symptoms of this architectural debt. In short, today’s streaming systems carry the baggage of an earlier era – an era when coupling everything in one broker made sense for simplicity. But in the cloud-native, real-time AI world of 2025, that all-in-one model is creaking under the strain.

Lessons from the Lakehouse Revolution

To find a solution, we can draw parallels with the recent transformation of data lakes. Just a few years ago, data lakes faced a similar predicament: data stored in affordable storage (such as HDFS or cloud blobs) proved challenging to manage and query effectively. The proliferation of engines and pipelines resulted in inconsistency, sluggish queries, and pipeline failures, while duplicate data and redundant work inflated costs. The underlying issue? A flawed architecture – doesn't that sound familiar?

The "Iceberg moment" marked a shift in data management. Apache Iceberg introduced the concept of an open table format that separated data storage from processing engines. It also incorporated a metadata layer for overseeing table states. This innovation, along with similar initiatives like Delta Lake and Apache Hudi, transformed a chaotic data lake into an organized lakehouse. Key aspects of this transformation include:

Separation of Concerns: Scalable object storage houses data files written in well defined formats such as Parquet and ORC, while a separate catalog manages table metadata, including schemas, partitions, and snapshots. Query engines like Spark, Trino, and Flink interact through a standardized table API, rather than relying on assumptions about data disk layout.
ACID and Governance: The metadata layer transforms a chaotic blob store into an organized system with transactional integrity (ACID commits) and schema evolution. This enables seamless coordination among multiple writers and readers, ensuring data consistency and reliability.
Multi-Engine Interoperability: Iceberg's open and standardized storage format and metadata enable diverse tools to share data seamlessly. This means a single Iceberg table can simultaneously handle streaming ingestion and batch SQL queries. Such unified access to both batch and streaming data facilitates real-time analytics, a capability previously difficult to achieve without intricate ETL pipelines.
Rapid Innovation: Independent evolution is now possible for each layer. A new query engine can be implemented by simply integrating the Iceberg API, eliminating the need to rewrite data storage methods. Improved compression or encodings in storage can be immediately leveraged by engines, provided the format aligns with the metadata specification. This modularity has spurred significant innovation within the data ecosystem, all built upon the foundation provided by Apache Iceberg.

Adopting a lakehouse approach has delivered dramatic results for companies, leading to significant improvements such as faster queries, reduced costs, and simplified architectures. Crucially, these benefits were achieved through the use of vendor-neutral, community-driven technology. Apache Iceberg exemplifies this by being an open standard, not confined to a single vendor's ecosystem, and widely adopted across the industry. Its broad acceptance has solidified its position as a de facto modern standard for analytic data.

Similar to the evolution of data lakes before Iceberg, current streaming platforms face comparable challenges. Fortunately, the core principles of decoupling, standardization, and opening up the architecture—which proved effective for data lakes—are equally applicable to streaming data.

Early indicators of this transformation are already evident: Apache Kafka's roadmap now features proposals for "diskless" topics that write directly to object storage, aiming to reduce costs. Additionally, there are plans to modernize metadata management by replacing ZooKeeper with an internal metadata quorum.

Apache Pulsar adopted a two-tier architecture, separating compute brokers from BookKeeper storage nodes. This design breaks away from traditional monolithic systems and provides a unified messaging model, which allows for independent consumption and storage of data. While these steps are positive, we can achieve more. We need to fundamentally re-envision streaming systems as three separate layers: data, metadata, and protocol. This approach mirrors how the lakehouse model disaggregated analytics and represents the "Iceberg moment" for streams: a streamlined, layer-centric architecture that frees us from prior compromises.

A Three-Layer Vision for Streaming

Imagine a streaming data platform built from the ground up on three independent layers:

Data Layer – A scalable, durable storage substrate for the raw streaming data (the actual event log).
Metadata Layer – An authoritative repository for stream-related metadata, encompassing details such as existing streams (topics), their schemas, offsets, and retention policies.
Protocol Layer – Stateless services, designed to speak various streaming protocols (such as Kafka, Pulsar, and MQTT), manage client connections and orchestrate read/write operations. However, these services do not offer long-term data persistence.

In this model, the traditional "broker" — a single, monolithic server — is replaced. Brokers now function as stateless routers or protocol translators. The substantial state (data and metadata) is offloaded to specialized layers, allowing them to scale and evolve independently. Let’s briefly examine the benefits, which closely mirror those seen in the Iceberg/data lakehouse world:

Cost Efficiency & Scalability: The data layer can reside on cheap, infinite storage like cloud object stores, rather than on tightly-managed broker disks. This means you only pay for storage once and grow it as needed, instead of over-provisioning every broker. Brokers no longer need large disks, reducing their footprint to mostly CPU and memory for processing. With brokers being stateless, you can scale out or in the computing layer on demand (spin up new protocol servers during traffic spikes, shut them down when not needed) without moving any data – no lengthy rebalances or replication storms. The compute and storage scales independently, just as in a decoupled lakehouse architecture.
Faster Evolution of Each Layer: Each component can progress on its own timeline. For example, the protocol layer could support new client features or even entirely new protocols (say an MQTT interface or a new streaming SQL interface) without changing how data is stored. The data layer could adopt better formats or storage engines (imagine switching from segment files to columnar storage, or integrating directly with Apache Iceberg/Delta Lake tables) without affecting client applications – they still talk the same protocol. The metadata layer could introduce stronger consistency, new subscription types, or integration with governance tools, all without touching the other layers. This modularity accelerates innovation since changes are localized and replaceable behind stable interfaces.
Multi-Tool and Multi-Use-Case Support: A three-layer streaming system is inherently more open. For instance, if the data is stored in an open format (like Parquet files with Iceberg metadata, or any self-describing log format), then external tools can read streaming data directly for analytics or AI training. Your streaming archive effectively doubles as a live data lake – no more one-way ETL from Kafka into data lakes just to run batch queries. At the same time, the protocol layer could allow multiple protocols to access the same data. It’s conceivable to have one unified storage of events but serve them through Kafka APIs, Pulsar APIs, and other interfaces simultaneously, depending on application needs. This breaks the silos between different streaming technologies and avoids vendor lock-in.
Reliability and Simplified Operations: Decoupling improves fault tolerance. The durable data layer (especially if using cloud storage) can offer very high availability and durability – e.g., object stores like S3 automatically replicate data across zones with 11-nines durability. The metadata layer, if built with a proper consensus or using a robust external catalog, ensures the stream definitions and cursors are always preserved. Meanwhile, stateless protocol servers mean failures are far less dramatic: if one goes down, clients can reconnect to another with zero data loss (since no unique data was on the failed node). Upgrades and maintenance become easier – you could even roll out new protocol server versions one at a time (since they don’t hold unique state) or swap out the storage backend without app downtime. Overall, operations begin to look more like managing a stateless microservice plus a database, rather than herding a fragile cluster of pet brokers.

The three-layer vision for streaming offers cloud-native efficiency, flexibility, and openness. This approach applies the principles of the lakehouse to event streams, transforming the streaming pipeline from a closed broker into an extensible data infrastructure. Much like Iceberg elevated "dumb storage" to a smart data platform, this blueprint is vendor-neutral, allowing any project or vendor to implement these layers in their unique way, provided they adhere to open interfaces. We’re already seeing movement in this direction: for example, Apache Pulsar separates serving and storage layers (a step toward stateless brokers), and emerging projects like StreamNative’s Ursa Engine build on this idea by writing streaming data directly to Iceberg/Delta lakehouse tables in object storage (making streaming “lakehouse-native” storage) while providing a Kafka-compatible protocol on top. The industry as a whole is converging on the notion that streams deserve the same architectural reboot that batch data got.

This “Iceberg moment” for streams isn’t just about any single technology – it’s about a change in philosophy. By breaking apart the old broker, we can solve the pain points that have plagued streaming for a decade. The result will be streaming platforms that evolve faster, cost a fraction of today’s setups, and integrate seamlessly with the rest of the data ecosystem.

Up next: In part 2 of this series, we’ll take a deeper dive into each of these three layers – Data, Metadata, and Protocol – to understand their roles and how they compare to the analogous pieces in a lakehouse architecture. Stay tuned for a technical anatomy of a modern stream.

‍

This is some text inside of a div block.

Button Text

Sijie Guo

Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California.

David Kjerrumgaard

David is a Principal Sales Engineer and former Developer Advocate for StreamNative. He has over 15 years of experience working with open source projects in the Big Data, Stream Processing, and Distributed Computing spaces. David is the author of Pulsar in Action.

Show all

Blog

Jul 18, 2025

6 min read