Introducing the StreamNative AI Hub — Agent Engine, MCP Server & more.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

March 19, 2025

20 min

Definitive Guide for Streaming Data into Snowflake – Part 2: Lakehouse-Native Data Streaming with Apache Iceberg and Snowflake Open Catalog

Neng Lu

Director of Platform, StreamNative

Welcome back to our three-part blog series, The Definitive Guide to Streaming Data into Snowflake. In the first part of this series, we explored how to stream data into Snowflake with connector-based approaches. While connectors work well for many scenarios, they can become expensive and complex to manage at large scale.

In this second blog post, we’ll introduce a modern alternative - a zero-copy data streaming approach, which uses the StreamNative Ursa engine to stream data directly into Snowflake via Apache Iceberg and Snowflake Open Catalog. This approach eliminates the need for connectors, simplifies data streaming architecture, and enables real-time AI and analytics at scale. By the end of this post, you will have a clear understanding of how to build a modern, real-time data streaming solution on Snowflake.

Introduction to Apache Iceberg, Snowflake Open Catalog, and StreamNative Ursa

The zero-copy data streaming approach involves 3 major components: Apache Iceberg, Snowflake Open Catalog, and StreamNative Ursa.

Apache Iceberg is a high-performance table format designed for large analytic datasets. It provides consistency and ACID guarantees to data lakes, making it possible to handle petabyte-scale datasets efficiently. By treating data in distributed storage (e.g., S3 or other cloud object stores) as a table with columnar layouts, Iceberg simplifies schema evolution and accelerates queries.

Snowflake Open Catalog is a fully managed service for Apache Polaris, which implements Iceberg’s REST catalog API and provides centralized, secure read and write access to Iceberg tables across different REST-compatible query engines. It allows Snowflake to read directly from external Apache Iceberg tables, providing a unified approach to managing and accessing large analytic datasets without copying data or using additional connectors. This simplifies data ingestion workflows, allowing external Iceberg tables to be treated as native Snowflake tables.

StreamNative Ursa is a Kafka-compatible data streaming engine built for the Lakehouse architecture, storing data in object storage as Apache Iceberg format. With Ursa, there is no need to deploy additional connectors; you can produce and consume data using the Kafka protocol or reconfigure existing Kafka applications to a StreamNative Ursa cluster. Data produced into a Kafka topic is continuously stored in an Iceberg table in real-time. Kafka topic schemas are automatically mapped to Lakehouse Table schemas, and data is written to Lakehouse Tables using open standards like Apache Iceberg. The engine will also commit metadata into Snowflake Open Catalog so that the catalog then enables querying of those tables without duplicating data.

Why Choose This Approach?

By leveraging Ursa and Snowflake Open Catalog together, this approach creates a reliable and scalable zero-copy data streaming architecture. It provides the following benefits:

Lakehouse-Native Architecture: Ursa stores streaming data directly as Iceberg tables, which Snowflake can discover via Open Catalog and query without duplicating data.
Optimized for Data Streaming: Iceberg’s structured data management, combined with Ursa’s native data streaming capabilities, ensures the Iceberg data lakehouse remains up to date with minimal operational overhead.
Scalability: Using Apache Iceberg and object storage enables handling growing data volumes more efficiently than connector-based ingestion.
Cost Efficiency: Data is directly written to Iceberg tables for optimized reads in Snowflake, eliminating redundant storage and excessive data transfer.
Consistency and ACID Guarantees: Iceberg ensures atomic commits, snapshot isolation, and schema evolution, eliminating many data consistency headaches.
Open Ecosystem: Avoid vendor lock-in by utilizing open table formats and object storage.

How It Works

Below is an overview of how the components interact:

Data Streams: Data is published to Kafka topics in a StreamNative Ursa cluster.
StreamNative Ursa: Ursa continuously transforms the streaming data and writes it to Iceberg tables in object storage.
Snowflake Open Catalog: Iceberg tables are registered in Snowflake Open Catalog, allowing Snowflake to access them directly.
Query in Snowflake:Data practitioners can write SQL queries against these Iceberg tables as if they were native to Snowflake.

Step-by-Step Guide

Follow the step-by-step guide below to set up a modern approach for streaming data into Snowflake using StreamNative Ursa and Snowflake Open Catalog. You can watch this playlist for more details.

Prerequisites

Before you get started, ensure you have the following three resources:

AWS Account: An AWS account to create an S3 storage bucket for storing Iceberg tables.
Snowflake Account: A Snowflake account to create a Snowflake Open Catalog and run Snowflake queries.
StreamNative Cloud Account: A StreamNative Cloud account to install and run Ursa clusters.

Step 0: Prepare a Cloud Storage Bucket

Before setting up this modern approach, you need to create a cloud storage bucket, which will store Iceberg tables. This S3 bucket must be accessible by both StreamNative and Snowflake.

Important:

The Snowflake Open Catalog, S3 bucket, and StreamNative Ursa cluster must be in the same AWS region to avoid excessive cross-region traffic.
Snowflake Open Catalog does not support cross-region buckets.

Assuming you use the following bucket and path:

s3://<your-bucket-name>/<your-bucket-path>

First, you must grant StreamNative access to this storage bucket, allowing StreamNative’s Ursa cluster to access this bucket.

Grant StreamNative access to the storage bucket

StreamNative provides a Terraform module to allow users to grant the storage bucket access to its control plane for setting up Ursa clusters. Use the following Terraform script to grant access:

module "sn_managed_cloud" {

source = "github.com/streamnative/terraform-managed-cloud//modules/aws/volume-access?ref=v3.18.0"

‍

external_id = "<your-organization-name>"

role = "<your-role-name>"

buckets = [

"<your-bucket-name>/<your-bucket-path>",

]

‍

account_ids = [

"<your-aws-account-id>"

]

}

Replace the placeholders with your actual values:

`<your-organization-name>`: Your StreamNative Cloud organization ID.
`<your-bucket-name>/<your-bucket-path>`: Your AWS S3 storage bucket.
`<your-aws-account-id>`: Your AWS account ID hosting the storage bucket.
`<your-role-name>`: The IAM role name that will be created for storage bucket access.

Once you execute the Terraform script, it will grant StreamNative’s control plane access to the storage bucket. This allows the StreamNative Ursa cluster to write data to the storage bucket.

As Ursa continuously writes produced data to the storage bucket, it automatically compacts the data into Iceberg tables. These tables will be rewritten in the following path:

s3://<your-bucket-name>/<your-bucket-path>/compaction

In the next step, you will need to configure Snowflake Open Catalog to grant access to this path.

Step 1: Configure Snowflake Open Catalog

Before setting up a StreamNative Ursa cluster, you must grant Snowflake Open Catalog access to the storage bucket with the following IAM policy:

{

"Version": "2012-10-17",

"Statement": [

{

"Effect": "Allow",

"Action": [

"s3:PutObject",

"s3:GetObject",

"s3:GetObjectVersion",

"s3:DeleteObject",

"s3:DeleteObjectVersion"

],

"Resource": "arn:aws:s3:::<your-bucket-name>/<your-bucket-path>/*"

},

{

"Effect": "Allow",

"Action": [

"s3:ListBucket",

"s3:GetBucketLocation"

],

"Resource": "arn:aws:s3:::<your-bucket-name>/<your-bucket-path>",

"Condition": {

"StringLike": {

"s3:prefix": [

"*"

]

}

]

}

Follow this documentation to set up the IAM policy and role for accessing the storage bucket.

Note: This IAM policy and role will be used for creating Snowflake Open Catalogs.

Create a Snowflake Open Catalog

Use the following settings when creating a Snowflake Open Catalog:

Name: The name of the Open Catalog.
External: Keep this disabled.
Storage Provider: Select "S3".
Default Base Location: Use the storage bucket path created in Step 0, which should look like: s3://<your-bucket-name>/<your-bucket-path>/compaction. Here, the compaction folder stores all the compacted lakehouse tables.
S3 Role ARN: The ARN of the IAM role created above.
External ID: The External ID used in the IAM policy setup.

After creating the Snowflake Open Catalog, retrieve its IAM user ARN from the Open Catalog details page. Next, update your IAM policy to grant the Open Catalog access to your S3 bucket by adding this IAM user ARN to the Principal:AWS field.

At this point:

✔️ Your storage bucket is accessible by StreamNative and Snowflake Open Catalog.
✔️ A Snowflake Open Catalog is ready to use.

Create a Service Connection in Snowflake Open Catalog

To allow StreamNative access to Open Catalog, create a Service Connection in Snowflake Open Catalog.

Follow this documentation to complete this step and record the Client ID and Client Secret for configuring the StreamNative Ursa cluster in the next step.

Step 2: Setup StreamNative Ursa Cluster

Once StreamNative has permission to access the storage bucket and you have the Client ID and Secret for Snowflake Open Catalog, you can proceed to create a StreamNative Ursa cluster. Refer to this documentation for detailed step-by-step instructions.

Once the Ursa cluster is up and running, it exposes the Kafka API via its Kafka endpoints. You can configure your application to produce Kafka messages to a topic.

StreamNative provides tutorials on using Kafka clients to interact with StreamNative Cloud.

Data Storage Structure

When messages are produced to Kafka topics:

They are immediately written to the storage bucket configured for the cluster.
The compaction folder stores all compacted Iceberg tables.
The storage folder stores write-ahead logs for raw data.

Example file structure. (See the screenshot below)

s3://<your-bucket-name>/<your-bucket-path>/

├── compaction/ # Stores compacted Iceberg tables

├── storage/ # Stores write-ahead logs

All Iceberg tables are automatically registered in Snowflake Open Catalog.

You can navigate to the Snowflake Open Catalog console to view tables and schemas.

Step 3: Query Iceberg Tables in Snowflake AI Data Cloud

Once the tables are available in Snowflake Open Catalog, you can use Snowflake AI Data Cloud to query them.

For more details:

Refer to Snowflake documentation (add a link if available).
Watch this video, which provides a detailed query walkthrough.

Choose the Right Lakehouse Table Mode

When setting up the Ursa Engine, it supports two different storage modes for writing streaming data into Iceberg tables. Currently, this setting is configured at a per-cluster level and will soon be supported at a per-topic level. These two storage modes are:

SBT Mode (Stream Backed by Table): Also known as Ursa Managed Table, where Ursa Engine manages, compacts, and preserves metadata such as offsets.
SDT Mode (Stream Delivered to Table): Also known as Ursa External Table, where Ursa Engine does not manage the table’s lifecycle but instead appends or upserts records to it. The Iceberg Catalog provider manages the table’s lifecycle.

SBT Mode: Stream Backed by Table

‍Ursa's default lakehouse-native storage mode follows the "stream backed by table" concept. This approach, as described earlier, compacts all streaming data into columnar Parquet files, organizing them into Iceberg tables.

With this mode:

Only one copy of the streaming data is stored.
All streaming-related metadata (such as offsets and ordering) is preserved.
You can replay the entire stream by reading the Parquet files from the backed table.
You achieve "stream-table duality" while maintaining a single copy of data governed by a catalog service.

This mode is also known as “Ursa Managed Table” because Ursa manages the entire data lifecycle based on retention policies and automatically registers the table in a data catalog for easy discovery.

Best Use Cases for SBT Mode

Storing raw data in a Medallion Architecture as bronze tables, which retain all historical data for replay and auditing purposes.

SDT Mode: Stream Delivered to Table

‍In contrast, SDT tables do not preserve all streaming-related metadata (such as offsets). Instead, the data is delivered to an Iceberg table through append or upsert operations. However, this table is managed externally, outside of Ursa Engine.

Key Differences in SDT Mode

SDT tables do not back up the stream, meaning streaming reads via the Kafka protocol are not feasible.
Since the stream and table lifecycles are decoupled, this mode is better suited for storing compacted data using upsert operations.
Ursa can either append or upsert changes into the external table, offering flexibility in partitioning strategies.
SDT tables are often referred to as "External Tables" because Ursa does not manage the table’s lifecycle. Instead, it is typically managed by a data catalog service provider, which may also optimize tables through maintenance services.

Best Use Cases for SDT Mode

Storing compacted, curated, and transformed data—such as silver and gold tables in a Medallion Architecture.
Aggregated data optimized for production analytics.

SBT Mode vs. SDT Mode: Decide Which Mode Fits Your Use Case

Following is a table summarizing the differences between SBT mode and SDT mode. You can use it as a guide to determine which mode fits your use cases better.

Choosing the Right Mode

✔ If you need a single data copy that can be replayed using Kafka protocol at any time → Choose SBT Mode.
✔ If you prefer a decoupled approach where the streaming engine just delivers changes (e.g., upserts) to a lakehouse table → Opt for SDT Mode.

Other Best Practices

Optimize SDT table for Snowflake Queries

If you are using SDT tables, ensure you align Iceberg partitioning with your primary Snowflake query patterns. This will help:

Reduce query latency
Minimize data-scanning costs

Retention and Lifecycle Policies

In SBT Mode, configure Ursa’s data retention settings to automatically remove or compact older data based on compliance and cost constraints.
In SDT Mode, schedule periodic compactions or optimizations via your data lakehouse service (e.g., file merges, vacuuming, or upserts).

Monitor Stream Lag and Table Snapshots

Keep track of how frequently new data commits land in Iceberg.
Balance commit frequency with ingestion throughput to:

Avoid tiny files that slow down queries.
Prevent stale data caused by infrequent commits.

By understanding the nuances of SBT vs. SDT tables, you can architect your data streaming architecture to meet specific business and analytical needs.

Use Ursa with Your Existing Kafka Clusters

If you already have an existing Kafka cluster, you can transition to this modern architecture without major operational changes by using Universal Linking (UniLink). UniLink allows you to link any Kafka cluster—whether it is MSK, Confluent, RedPanda, or a self-managed Apache Kafka deployment—to StreamNative Ursa.

What Is Universal Linking?

Universal Linking is a cost-effective solution for Kafka data replication and migration. With UniLink, you can seamlessly mirror data from any Kafka-compatible source cluster into Ursa while preserving offsets, consumer groups, schemas, ACLs, and configurations.

Unlike traditional topic mirroring mechanisms, UniLink does not replicate data over a network between brokers. Instead, it utilizes object storage as both the networking and replication layer, eliminating expensive inter-AZ network transfers and reducing infrastructure overhead. This approach significantly lowers costs while maintaining high data fidelity across multiple environments.

Once UniLink is configured, data from your source Kafka cluster is seamlessly written to the storage bucket and compacted into Iceberg tables, making it immediately available for querying in Snowflake.

By leveraging UniLink, you can bridge your existing Kafka clusters with Iceberg tables efficiently, enabling a cost-effective solution to migrate and stream data into Snowflake without modifying your existing Kafka setup.

Summary

By combining StreamNative’s Ursa, Apache Iceberg, and Snowflake Open Catalog, you can build a scalable, zero-copy data streaming solution for Snowflake. This approach offers several benefits:

Avoid duplicating data storage and trasnfer
Simplified architecture
Direct access to fresher data in Snowflake
Centralized data goverance

Key Advantages

✔ Apache Iceberg + Snowflake Open Catalog eliminates the need for a dedicated connector cluster, simplifying the overall architecture.
✔ StreamNative Ursa automatically writes streaming data to Iceberg tables, ensuring your data is always fresh.
✔ Snowflake queries Iceberg tables in near real-time, delivering the best of both worlds—flexible data lake storage and powerful data warehouse analytics.

Additionally, Universal Linking allows you to connect any existing Kafka clusters with Ursa, enabling you to enjoy the same architectural benefits without re-programming your applications.

What’s Next?

In our third and final blog post, we’ll compare the connector-based approach from the first blog post with the Zero-Copy (Iceberg/Open Catalog) method covered in this post. We’ll explore the trade-offs, performance considerations, cost implications, and operational complexity of each approach to help you determine which best fits your organization’s needs. Stay tuned for Part 3—coming soon! 🚀

This is some text inside of a div block.

Button Text

Neng Lu

Neng Lu is currently the Director of Platform at StreamNative, where he leads the engineering team in developing the StreamNative ONE Platform and the next-generation Ursa engine. As an Apache Pulsar Committer, he specializes in advancing Pulsar Functions and Pulsar IO Connectors, contributing to the evolution of real-time data streaming technologies. Prior to joining StreamNative, Neng was a Senior Software Engineer at Twitter, where he focused on the Heron project, a cutting-edge real-time computing framework. He holds a Master's degree in Computer Science from the University of California, Los Angeles (UCLA) and a Bachelor's degree from Zhejiang University.

Show all

Blog

Jul 30, 2025

8 min read

Pulsar Newbie Guide for Kafka Engineers (Part 1): Kafka → Pulsar CLI Cheatsheet

Apache Pulsar

Learn Pulsar

Pulsar

Kafka

Blog

Jul 18, 2025

6 min read

Introducing StreamNative Cloud Notifications for Functions

StreamNative Cloud

Cloud

StreamNative Functions

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Definitive Guide for Streaming Data into Snowflake – Part 2: Lakehouse-Native Data Streaming with Apache Iceberg and Snowflake Open Catalog

Introduction to Apache Iceberg, Snowflake Open Catalog, and StreamNative Ursa

Why Choose This Approach?

How It Works

Step-by-Step Guide

Prerequisites

Step 0: Prepare a Cloud Storage Bucket

Grant StreamNative access to the storage bucket

Step 1: Configure Snowflake Open Catalog

Create a Snowflake Open Catalog

Create a Service Connection in Snowflake Open Catalog

Step 2: Setup StreamNative Ursa Cluster

Data Storage Structure

Step 3: Query Iceberg Tables in Snowflake AI Data Cloud

Choose the Right Lakehouse Table Mode

SBT Mode: Stream Backed by Table

Best Use Cases for SBT Mode

SDT Mode: Stream Delivered to Table

Key Differences in SDT Mode

Best Use Cases for SDT Mode

SBT Mode vs. SDT Mode: Decide Which Mode Fits Your Use Case

Choosing the Right Mode

Other Best Practices

Optimize SDT table for Snowflake Queries

Retention and Lifecycle Policies

Monitor Stream Lag and Table Snapshots

Use Ursa with Your Existing Kafka Clusters

What Is Universal Linking?

Summary

Key Advantages

What’s Next?

Related articles

Pulsar Newbie Guide for Kafka Engineers (Part 1): Kafka → Pulsar CLI Cheatsheet

Introducing StreamNative Cloud Notifications for Functions

Blog title heading will go here

Blog title heading will go here

Blog title heading will go here

Newsletter