Explore the Future of Data Streaming & AI - Data Streaming Summit · Sept 29-30 · Grand Hyatt at SFO.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

July 9, 2025

5 min read

Inside Stream Format: A Table for Infinite Logs

Sijie Guo

Co-Founder and CEO, StreamNative

David Kjerrumgaard

Sales Engineer, Author of "Apache Pulsar in Action"

Navigate the series — De-composing Streaming Systems:

This article is one chapter in a five-part deep dive into the future of real-time data. Explore the rest of the series here:

-------------------------------------------------------------------------------------------------------

‍

Streaming data has historically been treated differently from batch data. Streams are often seen as infinite logs – unbounded sequences of events – whereas batch processing uses static tables. Modern stream processing frameworks like Apache Spark or Apache Flink have blurred this line by treating streaming data as an “infinite table” that is processed incrementally. This insight is powerful: if we can model an ever-growing log as a table, we unlock the rich ecosystem of tools and guarantees from the data lake world (SQL queries, schema evolution, ACID transactions, etc.) for real-time data. In essence, a stream can be viewed as a continuously appending table, where each new event is like a new row added forever.

Ursa, StreamNative’s data streaming engine, embodies this concept by storing streams in open table formats. In Ursa’s architecture, each topic (stream) is materialized as a table on object storage. When data flows into Ursa, it is immediately written in a table-friendly way: using a combination of a row-oriented Write-Ahead Log (WAL) for fast appends, and columnar files (Parquet) for efficient long-term storage. In practice, this means incoming events are first captured in small WAL files (ensuring low-latency writes and durability), and then compacted into larger Parquet files for analytics-friendly storage. Ursa’s stream format effectively turns a live log into a table with partitions and snapshots, making it queryable by engines like Spark or Trino without any ETL step.

What makes this approach powerful is the use of open table standards (such as Apache Iceberg, Delta Lake, or Hudi) as the underlying format. Ursa doesn’t invent a proprietary storage format; it leverages proven table formats that support schema evolution, indexing, and transactional updates. For example, with Delta Lake as the format, the infinite log of events gains features like ACID transactions and time-travel queries (since Delta maintains a transaction log of all changes). In Ursa’s case, the streaming engine writes data in Lakehouse table format, meaning every event appended to the stream is also an insert into an Iceberg/Delta table. This bridge between streaming and table paradigms yields huge benefits:

Immediate Queryability: As soon as data lands in the stream, it’s part of a table that can be queried with SQL or read by any tool that understands Parquet. There’s no need to wait for a batch ETL job to dump the stream into a database – the stream is the database.
Unified Storage: Instead of keeping “hot” data in a messaging system and “cold” data in a separate warehouse, Ursa’s format uses a single storage layer (cloud object storage) for both real-time and historical data. A Pulsar or Kafka topic managed by Ursa will offload older segments to cheap storage in open format, effectively retaining infinite history at low cost.
Schema and Governance: By treating streams as tables, you can enforce schemas on event data and manage them with the same governance tools used for batch data. Schema registries and table catalogs ensure that as your infinite log evolves, consumers always know the data schema and can handle changes safely.
Interoperability: Perhaps most importantly, an open table format for logs means you are not locked into one vendor’s tools. Multiple frameworks (Flink, Spark, Pandas, Presto, etc.) can all read from the same streaming table. This is analogous to how many query engines share access to a Parquet/Delta Lake files on a data lake. In streaming, Ursa’s format makes the “log as a table” accessible to any engine or language, fostering a rich ecosystem instead of a siloed stream.

Ursa’s implementation serves as a reference architecture of this general idea. It demonstrates that you can achieve real-time streaming performance while simultaneously structuring the data as a table in cloud storage. Other systems are moving in a similar direction. For instance, Redpanda has introduced an option to automatically write Kafka topic data into Iceberg table format on S3, and Cloud providers are enabling streaming inserts into table formats. The big picture is a shift toward treating streaming data as first-class table data, eliminating the divide between “streams” and “tables.” By using a table for infinite logs, organizations get the best of both worlds: the continuous, low-latency updates of messaging systems and the strong consistency, queryability, and openness of data lake tables.

Ursa Stream Format in Action

To cement the concept, let’s walk through how Ursa’s stream-table format works when data arrives:

Incoming events → WAL: When a producer publishes messages to a topic, Ursa immediately writes these events to a write-ahead log file on object storage. This WAL is a lightweight append-only file that accumulates recent events quickly (much like Kafka’s segment logs, but stored in the cloud). It ensures durability and low latency. Once the WAL reaches a certain size or time threshold, Ursa will rotate it.
WAL → Columnar Files: Ursa’s engine continuously takes those WAL segments and compacts them into columnar Parquet files. During this compaction, it can also partition the data (e.g., by event time or key) and sort it, which optimizes later queries. The Parquet files constitute the permanent storage of the stream’s data, organized in the directory structure of an Iceberg/Delta table (with partition folders, metadata files, etc.). Each compaction may also create a new snapshot in the table’s metadata, much like a batch job commit.
Metadata Management: Alongside data files, Ursa updates the table format’s metadata (for example, Iceberg manifest lists or Delta transaction log) to record the new Parquet files and delete the WAL segments that have been compacted. This metadata update is atomic and transactional, thanks to the table format. It’s as if every so often the “infinite table” of the stream gets a new committed batch of rows. These frequent small transactions keep the table up-to-date with the stream.
Retention & Evolution: Because the data is in a table format, enforcing retention policies (e.g. drop or archive data older than 1 year) becomes a matter of table maintenance (expiring or deleting old partitions) rather than broker-specific cleanup. Likewise, if the schema of the stream changes (new fields, etc.), the table schema can evolve using the format’s schema evolution features. Ursa’s approach handles this seamlessly, syncing Pulsar topic schemas with the table schema so that both stream consumers and batch readers see a consistent view.

In summary, inside Ursa’s stream format, a log truly behaves like an ever-growing table. This design meets the needs of high-throughput streaming (via append-optimized logs) and the needs of analytics (columnar storage, indexing, schema management) at once. The concept is vendor-neutral: any streaming system could, in theory, adopt a similar architecture of writing to an open table format. The advantage of Ursa is providing this out-of-the-box, turning Apache Pulsar (or Kafka, via compatibility mode) into a “lakehouse-native” streaming system. The takeaway lesson is that as data platforms evolve, the line between streaming and batch storage is disappearing. By viewing streams as infinite tables, we gain a unified data foundation that simplifies architectures and accelerates data access for all use cases.

This is some text inside of a div block.

Button Text

Sijie Guo

Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California.

David Kjerrumgaard

David is a Principal Sales Engineer and former Developer Advocate for StreamNative. He has over 15 years of experience working with open source projects in the Big Data, Stream Processing, and Distributed Computing spaces. David is the author of Pulsar in Action.

Show all

Blog

Aug 21, 2025

6 min read