February 10, 2025
20 min

Definitive Guide for Streaming Data into Snowflake: Part 1 - with Connectors

Neng Lu
Director of Platform, StreamNative

Modern data architectures often rely on distributed messaging and data streaming systems like Apache Kafka and Apache Pulsar to handle real-time data streams. These systems excel at ingesting, processing, and delivering high volumes of streaming data with low latency (milliseconds to sub-second). However, to derive value from this data, it often needs to be integrated into data warehouses like Snowflake or data lakehouses for storage, analysis, and further processing.

Snowflake is a cloud-based data platform designed for data warehousing, data lakes, and data engineering. It provides a fully managed, scalable, and secure environment for storing and analyzing structured and semi-structured data. Snowflake's architecture separates compute and storage, allowing users to scale resources independently and pay only for what they use. Its support for real-time data ingestion, combined with powerful analytics capabilities, makes it an ideal destination for streaming data from systems like Kafka and Pulsar.

Organizations increasingly demand real-time data ingestion into Snowflake for analytics, AI, and other data-intensive applications. However, efficiently streaming data into Snowflake remains a key challenge due to the numerous ingestion methods available. To address this, we’ve created a definitive guide, Streaming Data into Snowflake, to explore various approaches for ingesting and streaming data from data streaming engines into Snowflake.

This is the first post in a three-part blog series. In this first blog post, we will explore the connector-based approach, using different connectors to send data from Kafka, Pulsar, or their respective fully managed services to Snowflake, enabling seamless and efficient data integration. We’ll dive into the capabilities of these connectors, their setup processes, and how they ensure reliable and scalable data transfer. By the end, you’ll have a clear understanding of how to leverage these tools to unlock the full potential of your real-time data in Snowflake.

Connectors Framework

Before diving into the specifics of streaming data from Apache Pulsar or Apache Kafka to Snowflake, it’s important to understand the foundational frameworks that make these integrations possible: Kafka Connect and Pulsar IO. These frameworks are designed to simplify data movement between distributed streaming platforms and external data systems such as Snowflake, databases, and cloud storage.

Kafka Connect

Kafka Connect is a scalable, fault-tolerant framework designed to simplify data integration between Kafka-compatible systems and external data sources. It provides a standardized method for ingesting and exporting data, reducing the need for custom development. With a rich ecosystem of pre-built connectors, Kafka Connect enables seamless integration with databases, cloud storage, and data warehouses like Snowflake. Optimized for high-throughput, real-time data streams, it is ideal for organizations that require efficient and reliable data movement.

Pulsar IO

Pulsar IO is Apache Pulsar’s equivalent of Kafka Connect, natively built for the Pulsar ecosystem to enable data engineers to build and manage data pipelines between Pulsar and external systems. Designed for scalability and flexibility, Pulsar IO simplifies data ingestion and export through a collection of pre-built connectors for databases, cloud services, and data warehouses like Snowflake. By leveraging Pulsar’s distributed architecture, it efficiently handles high-throughput, real-time data streams. Pulsar IO also supports extensibility, allowing users to develop custom connectors for specialized use cases. Additionally, its deep integration with Pulsar’s messaging capabilities ensures reliable and efficient data movement.

StreamNative Cloud for Kafka Connect and Pulsar IO

StreamNative Cloud provides cloud-native support for both Kafka Connect and Pulsar IO through Universal Connect, enabling seamless integration with external systems like Snowflake, databases, and cloud storage. This unified approach eliminates the complexity of setup and maintenance, allowing organizations to leverage both connector frameworks effortlessly.

Snowflake APIs

Now that we’ve explored the popular connector frameworks—Kafka Connect and Pulsar IO—let’s examine the APIs Snowflake provides for data ingestion.

Snowflake offers two powerful mechanisms for streaming data into its platform: Snowpipe and Snowpipe Streaming. Each serves a different role in ingesting data into Snowflake, catering to varying performance, cost, and complexity requirements.

Snowpipe: Micro-Batch File-Based Ingestion

Snowpipe is a continuous, micro-batch ingestion service that loads staged files into Snowflake automatically and asynchronously. It is best suited for near real-time ingestion where data arrives frequently but does not require immediate processing.

How Snowpipe Works:

  • Data is staged in an external cloud storage location (Amazon S3, Google Cloud Storage, or Azure Blob Storage).
  • Snowpipe detects new files in the storage location using event notifications or polling.
  • Data is ingested into Snowflake via an automated COPY command, making it available for querying.

Key Characteristics of Snowpipe:

  • Latency: Typically minutes, as it depends on file arrival and ingestion scheduling.
  • Storage Cost: Requires an external storage layer (S3/GCS/Azure), which may add cost and complexity.
  • Best Use Case: Suitable for high-throughput workloads where ultra-low latency is not a strict requirement.

Snowpipe Streaming: Record-level Streaming Ingestion

Snowpipe Streaming is Snowflake’s native real-time ingestion API, allowing for continuous, record-by-record ingestion without requiring intermediate file staging. It is designed for low-latency, high-frequency data ingestion, making it ideal for real-time analytics.

How Snowpipe Streaming Works:

  • Applications or streaming frameworks send individual records directly to Snowflake via the Snowpipe Streaming API.
  • Data is ingested in near real-time, bypassing file staging.
  • Records become queryable in sub-second latency, enabling immediate analytics.

Key Characteristics of Snowpipe Streaming:

  • Latency: Sub-second, ideal for real-time applications.
  • Storage Cost: No need for external file staging, reducing overall cost.
  • Complexity: Simpler integration since data flows directly into Snowflake.
  • Best Use Case: Ideal for low-latency, event-driven use cases, such as real-time dashboards, anomaly detection, or fraud detection.

Here’s a detailed comparison between the two APIs to help you understand their differences and best use cases:

Snowpipe vs. Snowpipe Streaming

Together, Snowpipe and Snowpipe Streaming provide flexible and efficient ingestion options for integrating streaming data from Apache Kafka, Apache Pulsar, or other data sources into Snowflake. This enables organizations to unlock the full potential of real-time data for analytics, AI/ML pipelines, and operational intelligence.

Now that we have explored Kafka Connect, Pulsar IO, and Snowflake’s data ingestion APIs, we are ready to dive into the available connectors that enable seamless, reliable, and efficient data ingestion into Snowflake.

Streaming Data into Snowflake using Kafka Connect

For organizations that already use Apache Kafka or Kafka-compatible data streaming platforms, the Kafka Connect Snowflake Sink Connector is the most efficient way to stream data into Snowflake. Developed and maintained by Snowflake, this connector provides a seamless integration between Kafka topics and Snowflake tables, leveraging both Snowpipe and Snowpipe Streaming for ingestion.

How the Kafka Connect Snowflake Sink Connector Works

The Kafka Connect Snowflake Sink Connector enables reliable, scalable, and automated data transfer from Kafka into Snowflake. It uses Kafka offsets to track message delivery and handle failure recovery, ensuring exactly-once or at-least-once semantics depending on the configuration.

Messages are ingested into Snowflake in JSON, Avro, or Protobuf formats. If using Protobuf, a Protobuf converter must be configured for proper schema handling.

For authentication and security, users can choose between:

  • Key Pair Authentication (recommended for production deployments)
  • External OAuth Authentication (required when using Snowpipe Streaming APIs)

Key Configuration Settings

Setting up the Kafka Connect Snowflake Sink Connector requires several essential parameters:

Basic Connection Settings

  • snowflake.url.name – The Snowflake account URL (e.g., https://xyz.snowflakecomputing.com).
  • snowflake.user.name – The Snowflake username used for authentication.
  • snowflake.private.key / snowflake.oauth.token – The authentication method, which could be key pair authentication or OAuth token-based authentication.

Data Format and Conversion

  • value.converter – Specifies the format of Kafka messages (JSON, Avro, or Protobuf).
  • value.converter.schema.registry.url – If using Avro or Protobuf, this setting points to the Schema Registry for schema validation.

Buffering and Performance Tuning

  • buffer.count.records – Controls the number of records buffered before ingestion (default: 10000).
  • buffer.flush.time – Defines the maximum time (in seconds) before buffered records are sent to Snowflake.
  • buffer.size.bytes – Specifies the maximum buffer size in bytes before flushing.

Failure Handling and Retention

  • behavior.on.error – Determines how errors are handled (FAIL, LOG, IGNORE).
  • snowflake.ingestion.method – Defines the ingestion mode (Snowpipe or Snowpipe Streaming).
  • snowflake.schema.evolution – Enables or disables automatic schema evolution.

Best Practices for Deploying the Kafka Connect Snowflake Sink Connector

To ensure a robust, efficient, and scalable deployment, consider the following best practices:

  1. Use Snowpipe Streaming for Low-Latency Needs
    • If your workload demands real-time ingestion, configure the connector to use Snowpipe Streaming instead of traditional Snowpipe.
  2. Optimize Buffer Settings for Performance
    • Tuning buffer.count.records and buffer.flush.time helps balance latency and throughput.
  3. Secure Authentication
    • Use Key Pair Authentication in production for stronger security.
    • For OAuth authentication, ensure proper permissions are assigned in Snowflake.
  4. Enable Schema Evolution (if using Avro/Protobuf)
    • Activate schema evolution to handle data structure changes automatically.
  5. Monitor and Scale the Connector
    • Set up monitoring for connector health and tune parallelism based on ingestion needs.

Deploying the Connector in StreamNative Cloud

StreamNative Cloud supports the Kafka Connect Snowflake Sink Connector as a fully managed connector. Once the Kafka protocol support and the Kafka Connect feature are turned on, users can deploy it via the UI, API, CLI, Terraform, or Kubernetes CRDs, making it easy to integrate into existing workflows. Below is a diagram showing the UI configuration of a Kafka Connect Snowflake Sink Connector.

Once configured, the Kafka Connect Snowflake Sink Connector can be deployed to StreamNative Cloud to start streaming data into Snowflake over the Kafka protocol.

For detailed instructions on deploying and managing Kafka Connect connectors in StreamNative Cloud, refer to:

Streaming Data into Snowflake using Pulsar I/O

For Apache Pulsar users, you can use Pulsar IO Snowflake Streaming Sink Connector or Pulsar IO Snowflake Sink Connector to ingest data from Pulsar to Snowflake. StreamNative Cloud natively supports both connectors, making it easy to deploy and manage them.

How Snowflake Streaming Sink Connector works

The Snowflake streaming sink connector was announced recently. It pulls data from Pulsar topics and persists data to Snowflake based on the Snowpipe Streaming API with sub-second latency. The connector supports exactly-once semantics to ensure data is processed without duplication. If there are any errors during the messages processing, the connector will simply get restarted and reprocess the messages from the last committed messages.

Messages are ingested into Snowflake in JSON, Avro, or Primitive formats. The connector also supports sinking data into Iceberg tables with the proper configuration. For authentication and security, this connector supports Key Pair Authentication (recommended for production deployments).

Key Configuration Settings

Setting up the Pulsar IO Snowflake Streaming Sink Connector requires several essential parameters:

Basic Connection Settings

  • url – The Snowflake account URL (e.g., https://xyz.snowflakecomputing.com).
  • user – The Snowflake username used for authentication.
  • privateKey – The private key of the user. This is sensitive information used for authentication.
  • database –  The database in Snowflake where the connector will sink data.
  • schema – The schema in Snowflake where the connector will sink data.
  • role – Access control role to use when inserting rows into the table.

Table Format and Schema

  • icebergEnabled – Enable the Iceberg table format. Defaults to false.
  • enableSchematization – Enable schema detection and evolution. Defaults to true.

Performance Tuning

  • maxClientLag – Specifies how often Snowflake Ingest SDK flushes data to Snowflake, in seconds.
  • checkCommittedMessageIntervalMs – Specifies how often the connector checks for committed messages, in milliseconds.

For more details on the connector's design, setup steps, and configuration options, visit the StreamNative Hub documentation: Pulsar IO Snowflake Snowpipe Streaming Connector Guide.

How Snowflake Sink Connector works

The snowflake sink connector receives messages from input topics and converts them into JSON format. These data are buffered in memory until the threshold is reached and then are written to temporal files in the internal stage. And Snowpipes will be created to ingest staged files on a partition basis. Once the ingestion is succeeded, temporal files will be deleted; otherwise it will move files into table stage and produce error messages. This connector supports effectively-once delivery semantics. 

Messages are ingested into Snowflake in JSON, Avro, or Primitive formats. This connector currently supports Key Pair Authentication. Users need to generate a key pair,  then set the private key to the snowflake sink connector configuration and assign the public key to a user account in the snowflake.

This connector was developed earlier due to the lack of Snowpipe Streaming API; Due to its batch loading method, the ingestion latency would be higher. It’s always recommended to use the new Snowpipe Streaming Sink Connector whenever possible.

Key Configuration Settings

Setting up the Pulsar IO Snowflake Sink Connector requires several essential parameters:

Basic Connection Settings

  • host – The host URL of the snowflake service.
  • user – The Snowflake username used for authentication.
  • privateKey – The private key of the user. This is sensitive information used for authentication.
  • database –  The database in Snowflake where the connector will sink data.
  • schema – The schema in Snowflake where the connector will sink data.
  • role – Access control role to use when inserting rows into the table.

Buffering and Performance Tuning

  • bufferCountRecords –The number of records that are buffered in the memory before they are ingested to Snowflake. By default, it is set to 10_000.
  • bufferSizeBytes – The cumulative size (in units of bytes) of the records that are buffered in the memory before they are ingested in Snowflake as data files. By default, it is set to 5_000_000 (5 MB).
  • bufferFlushTimeInSeconds – The number of seconds between buffer flushes, where the flush is from the Pulsar’s memory cache to the internal stage. By default, it is set to 60 seconds.

For more details on the connector's design, setup steps, and configuration options, visit the StreamNative Hub documentation: Pulsar IO Snowflake Sink Connector. StreamNative Academy also provides a tutorial video on how to submit it with the pulsarctl cli tool.

Best Practices for Deploying the Pulsar Snowflake Sink Connectors

To ensure a robust, efficient, and scalable deployment, consider the following best practices:

1. Choose the Right Connector Based on Latency Needs

  • Use the Snowpipe Streaming Sink Connector for real-time ingestion with sub-second latency.
  • Use the Snowflake Sink Connector if batch processing with higher latency is acceptable.

2. Optimize Buffering for Performance

  • Low-latency use cases: Reduce bufferCountRecords and bufferFlushTimeInSeconds to speed up data ingestion.
  • High-throughput use cases: Increase these settings to optimize efficiency and reduce processing overhead.

3. Secure Authentication and Access Control

  • Use Key Pair Authentication instead of username-password authentication, especially in production environments.
  • Restrict permissions to only the necessary databases, schemas, and roles.

4. Monitor and Scale the Connector

  • Enable Pulsar and Snowflake monitoring to track connector health and ingestion performance.
  • Scale horizontally by increasing the parallelism factor of the connector for high-ingestion workloads.

5. Handle Errors and Recovery Gracefully

  • Configure error-handling policies to retry failed messages instead of dropping them.
  • Use dead-letter topics (DLTs) for better troubleshooting and recovery.

6. Deploy Using Infrastructure-as-Code (IaC)

  • Use Terraform, Kubernetes CRDs, or StreamNative API to deploy and manage connectors consistently across environments.

Deploying the Pulsar IO Connectors in StreamNative Cloud

StreamNative Cloud supports both Pulsar IO Snowflake connectors in a fully managed way. Users can deploy them via the UI, API, CLI, Terraform, or Kubernetes CRDs, making it easy to integrate into existing workflows. Below are diagrams showing the UI configuration of Snowflake Sink Connectors.

  • Submit Snowpipe Streaming Sink Connector via StreamNative Cloud UI
  • Submit SnowpipeSink Connector via StreamNative Cloud UI

Once the connector is configured, users can click the `Submit` button and start sending data into Snowflake from Pulsar.

Load Data into Snowflake with S3 Storage Connector 

In addition to the above end-to-end connector approach to send data into Snowflake directly, we also observed some of our customers utilizing the cloud storage s3 sink connector for loading data into Snowflake. You can find the talk recording from the 2024 Data Streaming Summit.

Users will first use the S3 Sink connector to sink data from Pulsar topics into S3 buckets, and then utilize the COPY INTO <table> command to load these blob files into Snowflake tables.

It allows users to batch a big chunk of data into a single file and then load it in a later separate step. This approach is desirable for large organizations where different teams rely on the same set of data but have different processing needs. Each team can load data of its interest into its own tables for different processing. The latency can be at a minute level given the buffering used for generating large-size files.

More detailed information regarding the S3 Sink connector can be found here

Conclusion

The connector-based approach is a powerful way to stream real-time data into Snowflake, leveraging Kafka Connect and Pulsar IO for seamless integration. Whether using the Kafka or Pulsar protocol, and sending data via Snowpipe or the Snowpipe Streaming API, each approach offers different trade-offs in terms of latency, scalability, and operational complexity.

When to Use the Connector Approach

The connector approach is particularly well-suited when:

  • You are already using Kafka or Pulsar for data streaming.
  • You need a managed solution that abstracts ingestion complexity.
  • Your use case involves structured data ingestion into Snowflake.

StreamNative Cloud supports both Kafka Connect and Pulsar IO connectors via Universal Connect, simplifying real-time data movement between Pulsar, Kafka, and Snowflake. The choice of connector should be guided by business requirements, data velocity, and system constraints, balancing ease of use, cost, latency, throughput, and manageability.

  • If your workload demands low-latency ingestion, use the Snowpipe Streaming Sink Connector for sub-second ingestion.
  • If batch processing and cost efficiency are the priority, the Snowflake Sink Connector is a viable option.
  • If you have a large number of topics and tables, the connector approach may introduce management overhead, requiring careful scaling, monitoring, and cost optimization.

Beyond Connectors: Exploring Alternative Approaches

While connectors provide an easy-to-use and managed solution, they may introduce challenges at scale, particularly for organizations managing a high number of topics and tables. These challenges include:

  • Scaling ingestion infrastructure as data volume grows.
  • Managing multiple connectors across different streaming pipelines.
  • Handling potential storage costs associated with Snowpipe-based ingestion.

For organizations looking for greater flexibility and deeper integration with Snowflake, an alternative approach involves using Iceberg tables and Open Catalog integration for direct data streaming.

In the next post, we’ll explore how Iceberg and Open Catalog can be leveraged to stream data into Snowflake without connectors, offering a more scalable and efficient ingestion strategy.

By understanding these different ingestion methods, businesses can make informed decisions to optimize their data streaming architecture based on their unique needs. Stay tuned for Part 2!

This is some text inside of a div block.
Button Text
Neng Lu
Neng Lu is currently the Director of Platform at StreamNative, where he leads the engineering team in developing the StreamNative ONE Platform and the next-generation Ursa engine. As an Apache Pulsar Committer, he specializes in advancing Pulsar Functions and Pulsar IO Connectors, contributing to the evolution of real-time data streaming technologies.Prior to joining StreamNative, Neng was a Senior Software Engineer at Twitter, where he focused on the Heron project, a cutting-edge real-time computing framework. He holds a Master's degree in Computer Science from the University of California, Los Angeles (UCLA) and a Bachelor's degree from Zhejiang University

Related articles

Feb 11, 2025
3 min read

You Don’t Need to Shift Everything Left; Lakehouse-First Thinking is all you need

Feb 3, 2025
30 min

Cut Kafka Costs by 95%: The Power of Leaderless Architecture and Lakehouse Storage

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data Lakehouse