What Is a Lakestream?

For most of the last decade, the central question in data architecture was where your data should come to rest so you could analyze it. That question produced the data warehouse, then the data lake, and finally the lakehouse, which merged the two. A new generation of workloads has now changed the question entirely.

AI agents and real-time applications do not want to analyze yesterday's data. They need to act on the world as it is right now. An agent deciding whether to approve a transaction, reroute a shipment, or answer a customer needs the latest events --- seconds old --- together with the full history that gives those events meaning, and it needs all of it governed enough to be trusted. That combination has a name: real-time context.

Real-time context is precisely what neither half of the modern data stack was designed to provide. The streaming system has the freshest events but no memory and no governance. The lakehouse has the history and the governance but is hours behind. To give an agent real-time context today, you either stitch the two systems together with pipelines, or you hand the whole problem to a single vendor and let them own it. Neither is good enough.

The Lakestream is the architecture that closes that gap.

A bit of history: how we ended up with two systems

To see why real-time context is so hard to get, it helps to see how the split happened.

Underneath every database is a log --- an ordered, append-only sequence of records, the most fundamental structure we have for capturing change over time. A stream is that log in motion; a table is that same log folded down to its latest state. The stream and the table are not two kinds of data. They are the same data seen at two moments in its life.

Yet for a decade we ran them as two separate systems. Streaming lived in Apache Kafka and Apache Pulsar: low-latency, append-only, built for events in flight. Analytics lived in the warehouse and then the lakehouse: columnar, governed, built for data at rest. Between them we ran connectors, change-data-capture, and nightly jobs whose only purpose was to copy the same records from one side to the other.

That split was never a law of nature. It was an accident of technology --- we kept streaming and storage apart because the storage we had could not be both fast enough for live events and cheap enough for full history. It was a tolerable arrangement when all anyone wanted was a dashboard refreshed every morning. It falls apart the moment something needs context in real time.

Why the two-tier setup can't deliver real-time context

Three problems make the classic architecture unable to provide real-time context.

Staleness. The system that holds the context --- the lakehouse --- is fed by batch pipelines. By the time an event is queryable alongside its history, it is minutes or hours old. For a morning report that is fine. For an agent acting now, the context is already wrong.

Copies and drift. Bridging the two systems means copying data and maintaining a pipeline that can fail, lag, and silently diverge. The context an agent reads is only as trustworthy as the last successful sync --- and you often cannot tell when it broke.

The captive fix. The obvious response --- and the one the largest platforms are now shipping --- is to pull streaming inside the data platform so events land instantly next to the history. This does solve staleness. But it solves it by making your real-time context a permanent tenant of one vendor: governed by their catalog, processed by their compute, reachable only on their terms. You trade stale context for captive context. In the agent era, where that context becomes the foundation everything else is built on, that is a steep price.

What agents actually need is context that is fresh, grounded in full history, governed, and open --- owned by you, not rented. That is the requirement the Lakestream is built to meet.

What is a Lakestream?

A Lakestream is an architecture that unifies the stream and the lakehouse into a single open substrate, so that real-time context is available the instant it is created, grounded in full history, governed by open standards, and owned by you.

In one sentence: a single stream, written once, that is at the same time a low-latency event stream you can produce to and consume from, and an open Apache Iceberg table any engine can read --- stored in object storage you own. The log that backs the stream is the table. One copy of the data serves both the real-time view and the historical view at once.

The name and the analogy are deliberate. A decade ago, the lakehouse merged the data lake and the data warehouse into one open tier and freed enterprises from copying data between them. The Lakestream does the same thing one step earlier in the data's life: it merges the stream and the lakehouse. The lake met the warehouse; now the stream meets the lakehouse.

A Lakestream is defined by a specific set of properties:

One copy, no pipeline. The streaming log's system of record is the open table. There is no connector copying data to a second system, and no second copy to drift.
Real-time and historical at once. The same substrate serves sub-second event delivery and full historical query --- the present and the past of the data in one place.
Open formats. Data is stored as open Apache Iceberg (or Delta), not a proprietary format only one engine can read.
In storage you own. The data lives in your own object storage --- your bucket, your cloud account --- not inside a vendor's platform.
Governed by open catalogs. Access, schema, and lineage are managed through open catalog standards, so the same governed copy is trustworthy everywhere.
Engine- and control-plane-neutral. Any engine --- Snowflake, Databricks, Trino, Spark, or whatever comes next --- can read the same data, because the data was never inside an engine to begin with.
Full streaming and lakehouse semantics. Real replay and consumer groups on the streaming side; transactions and time travel on the table side. Not one bolted onto the other.

When those properties hold together, real-time context becomes a first-class capability rather than something you assemble by hand.

How a Lakestream delivers real-time context

Put those properties together and you get exactly what an AI agent needs to act well.

The context is fresh, because the agent reads from a live stream, not a batch table that lags reality. It is grounded, because that same substrate is also the full historical table --- the latest event arrives already sitting next to everything that came before it. It is trustworthy, because it is governed once, through an open catalog, rather than re-secured in every system it is copied into. And it is yours, because it lives in open formats in your own storage, readable by whatever model, engine, or agent framework you choose --- today and in three years.

That last property is the one the captive approaches cannot match. When your real-time context lives inside a single vendor, your agents live there too; the substrate itself becomes the lock-in. A Lakestream keeps the context open, which keeps the agents built on it open. In the AI era, owning your real-time context is the difference between building on a foundation you control and renting one you do not.

What makes it a Lakestream, and not just "export to Iceberg"

This is the part that is easy to claim and hard to build, so it is worth being precise.

If a streaming system keeps its own copy in its own proprietary format and ships a second copy out to an open table on a schedule, nothing has actually converged. You still have two systems, two copies, and a pipeline that can fail, lag, and drift --- the very problem you set out to remove. "Open" as a downstream export is just the old two-system architecture in a nicer coat, and it cannot deliver real-time context, because the open copy is always behind.

A Lakestream means the streaming log's system of record is the open table --- one copy, written once, that is simultaneously a live stream with replay and consumer groups and an open Iceberg table any engine can read, with no connector in between and the data in your own bucket. You cannot get there by bolting an exporter onto a classic broker. It takes rebuilding the engine from the storage layer up: making object storage the primary store, removing the local disks and leader-based replication that chain a broker to specific machines, and laying the log out so the same bytes are valid as both a stream and a table.

This is buildable today. At StreamNative we built Ursa as exactly that kind of engine --- leaderless, object-storage-native, writing a single copy that is at once a live stream and an open Iceberg table in your own account --- and Ursa-for-Kafka (UFK) puts the Kafka API on top, so existing Kafka applications point at it, change nothing, and every topic lands as an open table underneath. But the Lakestream is the architecture, not any one product, and the properties above are what matter however it is implemented.

Read the full research paper on Ursa, the inner workings of a Lakestream implementation.

Closing thoughts

The lakehouse answered the last era's question: where should data come to rest so we can analyze it. The Lakestream answers this era's question: how do we give agents and real-time applications the live, grounded, governed context they need to act --- without locking that context inside a single vendor.

Real-time context is becoming the foundation of the AI-native enterprise. The Lakestream is what that foundation looks like when you build it in the open.