Written by

Hang Chen

Director of Storage, StreamNative & Apache Pulsar PMC Member

Topics

BlogDec 14, 20223 min read

Announcing the Iceberg Sink Connector for Apache Pulsar

Written by

Hang ChenDirector of Storage, StreamNative & Apache Pulsar PMC Member

Topics

ConnectorsAnnouncements

We’re excited to announce the general availability of the Iceberg Sink connector for Apache Pulsar. The connector enables seamless integration between Iceberg and Apache Pulsar, improving the diversity of the Apache Pulsar ecosystem. The Iceberg + Pulsar connector offers a convenient, efficient, and flexible approach to moving data from Pulsar to Iceberg without requiring user code.

For more information on why lakehouse technologies are growing in popularity, check out this blog.

What is the Iceberg Sink connector?

The Iceberg Sink connector is a Pulsar IO connector that pulls data from Apache Pulsar topics and persists data to Iceberg tables.

puslar and iceberg logo

Why develop the Iceberg Sink connector?

In the last 5 years, lakehouse technologies such as Apache Iceberg have seen rapid adoption. Lakehouse architectures provide streaming ingest of data, tools for dealing with schema and schema evolution, improved metadata management and open standards to ease integration across a range of data processing systems.

Apache Pulsar, a distributed, open-source pub-sub messaging and streaming platform for real-time workloads, is a natural fit for lakehouse architectures. Apache Pulsar provides a unified platform that enables queueing data, analytics, and streaming in one underlying system. As a result, integrating Apache Pulsar with Lakehouse streamlines data lifecycle management and data analysis.

StreamNative built the Iceberg Sink Connector in order to provide Iceberg users with a way to connect the flow of messages from Pulsar and use more powerful features, while avoiding problems with connectivity that can appear when there are intrinsic differences between systems or privacy requirements. The connector solves this problem by fully integrating with the rest of Pulsar’s system (including, serverless functions, per-message processing, and event-stream processing). It presents a low-code solution with out-of-the-box capabilities such as multi-tenant connectivity, geo-replication, protocols for direct connection to end-user mobile clients or IoT clients, and more.

What are the benefits of using the Iceberg Sink connector?

The integration between Iceberg and Apache Pulsar provides three key benefits:

Simplicity: Quickly move data from Apache Pulsar to Apache Iceberg without any user code.
Efficiency: Reduce your time spent configuring the data layer. This means you have more time to discover the maximum business value from real-time data in an effective way.
Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet business and operational needs in real time.

How do I get started with the Iceberg Sink connector?

Prerequisites

First, you must run an Apache Pulsar cluster.

Prepare the Pulsar service. You can quickly run a Pulsar cluster anywhere by running $PULSAR_HOME/bin/pulsar standalone. See Getting Started with Pulsar for details. Alternatively, get started with StreamNative Cloud, which provides an easy-to-use and fully managed Pulsar service in the public cloud.
Set up the Iceberg Sink connector. Download the connector from the Releases page, and then move it to $PULSAR_HOME/connectors.

Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to quickly get the connector up and running.

Configure the sink connector

Create a configuration file named iceberg-sink-config.json to send the public/default/test-iceberg-pulsar topic messages from Apache Pulsar to the Iceberg table with the location of s3a://test-dev-us-west-2/lakehouse/iceberg_sink:

{
    "tenant":"public",
    "namespace":"default",
    "name":"iceberg_sink",
    "parallelism":1,
    "inputs": [
      "test-iceberg-pulsar"
    ],
    "archive": "connectors/pulsar-io-lakehouse-{{connector:version}}-cloud.nar",
    "processingGuarantees":"EFFECTIVELY_ONCE",
    "configs":{
        "type":"iceberg",
        "maxCommitInterval":120,
        "maxRecordsPerCommit":10000000,
        "catalogName":"test_v1",
        "tableNamespace":"iceberg_sink_test",
        "tableName":"ice_sink_person",
      "hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
        "catalogProperties":{
            "warehouse":"s3a://test-dev-us-west-2/lakehouse/iceberg_sink",
            "catalog-impl":"hadoopCatalog"
        }
    }
}

Run the sink connector:

$PULSAR_HOME/bin/pulsar-admin sinks localrun --sink-config-file /path/to/iceberg-sink-config.json

When you send a message to the public/default/test-iceberg-pulsar topic of Apache Pulsar, this message is persisted to the Iceberg table with the location of s3a://test-dev-us-west-2/lakehouse/iceberg_sink.

How can I get involved?

The Iceberg Sink connector is a major step in the journey of integrating Lakehouse systems into the Pulsar ecosystem. To get involved with the Iceberg Sink connector for Apache Pulsar, check out the following featured resources:

Try out the Iceberg Sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
Make a contribution. The Iceberg Sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
Contact us. Feel free to create an issue on GitHub, send emails to the Pulsar mailing list, or message us on Twitter to get answers from Pulsar experts.

‍

About author

Hang Chen Hang Chen, an Apache Pulsar and BookKeeper PMC member, is Director of Storage at StreamNative, where he leads the design of next-generation storage architectures and Lakehouse integrations. His work delivers scalable, high-performance infrastructure powering modern cloud-native event streaming platforms.

newsletter

Keep up with Our Stream

Insights, news, and updates from the heart of our community.

Sign up successful

Welcome to the Stream!

Thank you for your interest. We've sent a confirmation link to your email.

Announcing the Iceberg Sink Connector for Apache Pulsar

What is the Iceberg Sink connector?

Why develop the Iceberg Sink connector?

What are the benefits of using the Iceberg Sink connector?

How do I get started with the Iceberg Sink connector?

Prerequisites

Configure the sink connector

How can I get involved?

Keep up with Our Stream

Welcome to the Stream!

Related Articles

Unveiling StreamNative’s Enhanced Connector Experience

Amazon EventBridge connector is now integrated with StreamNative Cloud

Announcing the Amazon EventBridge Sink Connector for Apache Pulsar

Make Your Data Ready and Safe for Agentic AI