Dec 14, 2022
3 min read

Announcing the Iceberg Sink Connector for Apache Pulsar

Hang Chen
Software Engineer, StreamNative & Apache Pulsar PMC Member
Iceberg Sink Connector

We’re excited to announce the general availability of the Iceberg Sink connector for Apache Pulsar. The connector enables seamless integration between Iceberg and Apache Pulsar, improving the diversity of the Apache Pulsar ecosystem. The Iceberg + Pulsar connector offers a convenient, efficient, and flexible approach to moving data from Pulsar to Iceberg without requiring user code.

For more information on why lakehouse technologies are growing in popularity, check out this blog.

What is the Iceberg Sink connector?

The Iceberg Sink connector is a Pulsar IO connector that pulls data from Apache Pulsar topics and persists data to Iceberg tables.

puslar and iceberg logo
Figure 1. Iceberg sink

Why develop the Iceberg Sink connector?

In the last 5 years, lakehouse technologies such as Apache Iceberg have seen rapid adoption. Lakehouse architectures provide streaming ingest of data, tools for dealing with schema and schema evolution, improved metadata management and open standards to ease integration across a range of data processing systems.

Apache Pulsar, a distributed, open-source pub-sub messaging and streaming platform for real-time workloads, is a natural fit for lakehouse architectures. Apache Pulsar provides a unified platform that enables queueing data, analytics, and streaming in one underlying system. As a result, integrating Apache Pulsar with Lakehouse streamlines data lifecycle management and data analysis.

StreamNative built the Iceberg Sink Connector in order to provide Iceberg users with a way to connect the flow of messages from Pulsar and use more powerful features, while avoiding problems with connectivity that can appear when there are intrinsic differences between systems or privacy requirements. The connector solves this problem by fully integrating with the rest of Pulsar’s system (including, serverless functions, per-message processing, and event-stream processing). It presents a low-code solution with out-of-the-box capabilities such as multi-tenant connectivity, geo-replication, protocols for direct connection to end-user mobile clients or IoT clients, and more.

What are the benefits of using the Iceberg Sink connector?

The integration between Iceberg and Apache Pulsar provides three key benefits:

  • Simplicity: Quickly move data from Apache Pulsar to Apache Iceberg without any user code.
  • Efficiency: Reduce your time spent configuring the data layer. This means you have more time to discover the maximum business value from real-time data in an effective way.
  • Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet business and operational needs in real time.

How do I get started with the Iceberg Sink connector?

Prerequisites

First, you must run an Apache Pulsar cluster.

  1. Prepare the Pulsar service. You can quickly run a Pulsar cluster anywhere by running $PULSAR_HOME/bin/pulsar standalone. See Getting Started with Pulsar for details. Alternatively, get started with StreamNative Cloud, which provides an easy-to-use and fully managed Pulsar service in the public cloud.
  2. Set up the Iceberg Sink connector. Download the connector from the Releases page, and then move it to $PULSAR_HOME/connectors.

Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to quickly get the connector up and running.

Configure the sink connector

  1. Create a configuration file named iceberg-sink-config.json to send the public/default/test-iceberg-pulsar topic messages from Apache Pulsar to the Iceberg table with the location of s3a://test-dev-us-west-2/lakehouse/iceberg_sink:
{
    "tenant":"public",
    "namespace":"default",
    "name":"iceberg_sink",
    "parallelism":1,
    "inputs": [
      "test-iceberg-pulsar"
    ],
    "archive": "connectors/pulsar-io-lakehouse-{{connector:version}}-cloud.nar",
    "processingGuarantees":"EFFECTIVELY_ONCE",
    "configs":{
        "type":"iceberg",
        "maxCommitInterval":120,
        "maxRecordsPerCommit":10000000,
        "catalogName":"test_v1",
        "tableNamespace":"iceberg_sink_test",
        "tableName":"ice_sink_person",
      "hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
        "catalogProperties":{
            "warehouse":"s3a://test-dev-us-west-2/lakehouse/iceberg_sink",
            "catalog-impl":"hadoopCatalog"
        }
    }
}
  1. Run the sink connector:
$PULSAR_HOME/bin/pulsar-admin sinks localrun --sink-config-file /path/to/iceberg-sink-config.json

When you send a message to the public/default/test-iceberg-pulsar topic of Apache Pulsar, this message is persisted to the Iceberg table with the location of s3a://test-dev-us-west-2/lakehouse/iceberg_sink.

How can I get involved?

The Iceberg Sink connector is a major step in the journey of integrating Lakehouse systems into the Pulsar ecosystem. To get involved with the Iceberg Sink connector for Apache Pulsar, check out the following featured resources:

  • Try out the Iceberg Sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
  • Make a contribution. The Iceberg Sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
  • Contact us. Feel free to create an issue on GitHub, send emails to the Pulsar mailing list, or message us on Twitter to get answers from Pulsar experts.

Hang Chen
Hang Chen is an Apache Pulsar PMC member and a software engineer at StreamNative. He once worked at BIGO, a Singapore-based technology company that provides video-based social media products. He mainly focuses on Pulsar stability, performance, Flink integration, and KoP.

Related articles

Apr 11, 2024
5 min read

The New CAP Theorem for Data Streaming: Understanding the Trade-offs Between Cost, Availability, and Performance

Mar 31, 2024
5 min read

Data Streaming Trends from Kafka Summit London 2024

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Pulsar Connectors
Apache Pulsar Announcements