Introducing the StreamNative AI Hub — Agent Engine, MCP Server & more.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

July 16, 2025

6 min read

Catalogs for Streams: Lessons from Iceberg’s REST Spec

Sijie Guo

Co-Founder and CEO, StreamNative

David Kjerrumgaard

Sales Engineer, Author of "Apache Pulsar in Action"

Text Link

Ursa

Iceberg

Navigate the series — De-composing Streaming Systems:

This article is one chapter in a five-part deep dive into the future of real-time data. Explore the rest of the series here:

-------------------------------------------------------------------------------------------------------

‍

When you adopt the idea of streams as tables, a new question arises: How do we track and discover all these streaming tables? In traditional streaming platforms, the “catalog” of topics (streams) is often just the broker or cluster itself – for example, Kafka brokers know what topics exist, and clients ask the broker for metadata. There isn’t a global, standardized catalog for streams akin to a Hive Metastore or Glue Catalog in the batch world. However, as streaming data starts living in open table formats, the need for a stream catalog becomes clear. We want a central place to register, enumerate, and manage stream metadata (namespaces, schemas, retention policies, etc.), ideally in a vendor-neutral, interoperable way. Here is where lessons from Apache Iceberg’s REST Catalog specification can be applied.

Iceberg’s REST Catalog spec was introduced to solve a metadata interoperability problem for tables. Previously, each deployment might use a different catalog backend (Hive Metastore, AWS Glue, etc.), making it hard to integrate across systems. The Iceberg REST spec defines a uniform HTTP API for table operations – creating tables, listing tables and namespaces, retrieving table metadata, and committing changes (snapshots) – regardless of the underlying implementation. This standardization brought several benefits that are just as relevant for streaming catalogs:

Interoperability: A RESTful catalog API means any client (in any language) can manage and query the metadata of data objects using simple HTTP calls. For streams, this could mean different streaming engines or services could all register their streams in one central catalog service.
Decoupling Metadata Store: The spec abstracts what the metadata is (tables and schemas) from where it is stored. In Iceberg’s case, you can have a REST catalog backed by a relational DB, NoSQL store, or even a Git repo – clients don’t need to know. Similarly, a stream catalog could be backed by a highly available service (perhaps built on a consensus DB or cloud service), but clients just see a uniform REST interface.
Multi-Tenancy and Cloud-Native Design: REST catalogs are designed to be cloud-friendly (HTTP-based, stateless) and support auth tokens for multi-tenant security. A streams catalog should offer the same, since organizations will have many teams registering streams and need access control and auditing at a central point.

How would a catalog for streams differ from one for tables? The core entities are similar – we have namespaces (or tenants), stream names, and schema – but streams also have traits like partitions, replication factors, and retention policies. Operations on a stream (like “create stream”, “delete stream”) are analogous to table operations. Iceberg’s spec already covers creating and dropping tables and even transactions for commits. One can imagine extending a similar RESTful approach: e.g., "POST /v1/streams" to create a new stream in the catalog (with parameters like number of partitions, etc.), or "GET /v1/streams/{name}" to fetch metadata about a stream (its schema, location, status). The key lesson from Iceberg is to use open and standard APIs for these operations, rather than proprietary RPCs tied to one vendor’s platform.

In fact, we’re starting to see this pattern. StreamNative’s Ursa engine, when writing Pulsar streams into Iceberg tables, uses Iceberg’s REST Catalog under the hood. When a new topic is created in Ursa, it calls the Iceberg REST API to create a corresponding table for that stream. The catalog (which could be an AWS Glue or a Snowflake’s Iceberg implementation, etc.) now knows about the table for that stream. This means any external tool or analytics service can discover the stream’s data via standard catalog queries. For example, AWS’s analytic services (like Athena or SageMaker) can list and query those Iceberg tables once they are registered, without special integration to Ursa. The stream metadata (table schemas, partition info) lives in the same catalog as batch tables, breaking down the wall between real-time and batch datasets.

Figure: StreamNative Ursa integrates with an Iceberg REST Catalog to map streaming topics into table metadata on cloud object storage (Amazon S3 in this case). Each Pulsar/Kafka topic (left) gets an Iceberg table in a catalog (center), stored under a namespace corresponding to the topic’s tenant and namespace. This allows external query engines and services (right) to discover and query stream data using the standard table interface, treating streams as just another set of tables.

From these lessons, a vision emerges for catalogs for streams:

Streams should be first-class entries in a unified metadata store. Whether it’s an Iceberg REST catalog or another open standard, we need a place where all data streams are registered just like tables. This makes streams discoverable by data analysts and engineers who might not be familiar with the messaging system details.
The catalog would store stream schema (much like a table schema), and possibly stream-specific properties (number of partitions, retention period, etc.). It could also track current status (for instance, is the stream live or paused) and the mapping to storage (e.g., the cloud bucket or path where the stream’s table data lives).
By using a REST API or similar open interface, any tool or platform can integrate to create or query streams. Imagine a CI/CD pipeline calling "DELETE /v1/streams/orders" to clean up a stream, or a data catalog UI listing all streams under a project by calling "GET /v1/namespaces/projectX/streams". This decoupling means your streaming metadata isn’t locked inside a single vendor’s broker – it’s accessible and portable.
Importantly, a stream catalog can help manage consistency between multiple protocols. If the same underlying stream is accessible via, say, a Pulsar API and a Kafka API (multi-protocol access), a shared catalog entry can represent that one logical stream. Clients of either protocol could then consult the same catalog to understand the stream’s schema and history.

By looking at Iceberg’s REST spec, we also learn the value of transactions in metadata for streaming. In Iceberg, when data is appended to a table, the commit is a transactional API call to update the table state (with optimistic concurrency control). Ursa leverages this by committing each batch of events as an Iceberg transaction, ensuring no partial or corrupt metadata states. A future streams catalog spec might similarly allow committing offsets or watermarks as part of metadata. For instance, a commit could encapsulate “I’ve added these new files (or log segments) to the stream’s storage, corresponding to events up to timestamp X.” Having a standardized way to commit and track stream progress in the catalog could enable cross-system consistency (imagine a Flink job advancing a streaming query and recording its point of consistency in the catalog).

In summary, the world of streaming is borrowing the playbook of data lakehouse metadata. Apache Iceberg’s REST catalog spec teaches us that open, RESTful metadata services can foster interoperability across diverse tools. Applying this to streams means treating streams similarly to tables in our organizational data catalog. It’s a shift from the siloed view (where only the message broker knows about the stream) to a global view where streams are discoverable data assets. The payoff is huge: easier integration of real-time data in analytics, unified governance (one can apply data policies uniformly), and the ability to mix streaming and batch sources seamlessly in data pipelines. As streaming data continues to grow, adopting standard cataloging practices will ensure that real-time datasets don’t become second-class citizens in the data ecosystem. Instead, they will be as easily searched, understood, and integrated as any table – thanks to lessons learned from Iceberg and the lakehouse community.

This is some text inside of a div block.

Button Text

Sijie Guo

Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California.

David Kjerrumgaard

David is a Principal Sales Engineer and former Developer Advocate for StreamNative. He has over 15 years of experience working with open source projects in the Big Data, Stream Processing, and Distributed Computing spaces. David is the author of Pulsar in Action.

Show all

Blog

Aug 7, 2025

6 min read