Apache Pulsar™ and lakehouse technologies are a natural fit for their scalability and accessibility across a large range of data sets and use cases. Today we’re introducing a new Apache Pulsar + Delta Lake connector that provides one API for real-time and lakehouse systems. The Pulsar + Delta Lake connector enables organizations to build real-time engineering solutions for analytics and ML/AI that are simple, open, and multi-cloud.
Before we dive into the new connector and why you should use it, let’s look at why lakehouse technology adoption is on the rise.
Why Lakehouse Technologies + Pulsar
Lakehouse technologies enable companies to turn data into actionable insights by making application data and events easy to process. A lakehouse combines data lake capabilities with transactions and high-level data management utilities, which you can integrate with existing systems to power traditional BI, batch, and AI/ML use cases in one platform. However, a lakehouse needs the ability to ingest and activate data in real time.
Pulsar is the real-time data platform designed to solve both complex messaging workloads and simplify building end-to-end data pipelines. Its out-of-box connectors and serverless functions in Python, Java, and Go make it a good fit for lakehouse technologies.
We’re excited to introduce the Delta Lake connector to connect these two powerful technologies (lakehouses and Pulsar). The Delta Lake connector allows companies to solve for minimal data latency and easily deliver real-time engineering to lakehouses with a seamless, single-API experience. This connector is part of our plan to create a Pulsar ecosystem that can serve as the universal and sustainable hub of computing for events, enabling new productivity and innovation.
What is the Delta Lake Sink connector?
The Delta Lake Sink connector is a Pulsar IO connector that pulls data from Apache Pulsar topics and persists data to Delta Lake.
Why develop the Delta Lake Sink connector?
In the last 5 years, the rise of streaming data and the need for lower data latency have pushed data lakes to their limits. As a result, lakehouse architectures, a term coined by Databricks and implemented via Delta Lake as well as other technologies such as Apache Hudi and Apache Iceberg, have seen rapid adoption. Lakehouse architectures provide streaming ingest of data, tools for dealing with schema and schema evolution, improved metadata management and open standards to ease integration across a range of data processing systems.
Apache Pulsar, a distributed, open-source pub-sub messaging and streaming platform for real-time workloads, is a natural fit for lakehouse architectures. Apache Pulsar provides a unified platform that enables queueing data, analytics, and streaming in one underlying system. As a result, integrating Apache Pulsar with Lakehouse streamlines data lifecycle management and data analysis.
StreamNative, a company that provides a unified messaging and streaming platform powered by Apache Pulsar, built the Delta Lake Sink Connector to provide Delta Lake users with a way to connect the flow of messages from Pulsar and use more powerful features, while avoiding problems with connectivity that can appear when there are intrinsic differences1 between systems or privacy requirements.
The connector solves this problem by fully integrating with Pulsar (including, its serverless functions, per-message processing, and event-stream processing). The connector presents a low-code solution with out-of-the-box capabilities such as multi-tenant connectivity, geo-replication, protocols for direct connection to end-user mobile or IoT clients, and more.
What are the benefits of using the Delta Lake Sink connector?
The integration between Delta Lake and Apache Pulsar provides three key benefits.
Simplicity: Quickly move data from Apache Pulsar to Delta Lake without any user code.
Efficiency: Reduce your time in configuring the data layer. This means you have more time to discover the maximum business value from real-time data in an effective way.
Flexibility: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet the business and operational needs in real time.
How do I get started with the Delta Lake Sink connector?
First, you must run an Apache Pulsar cluster.
Prepare the Pulsar service. You can quickly run a Pulsar cluster anywhere by running $PULSAR_HOME/bin/pulsar standalone. See Getting Started with Pulsar for details. Alternatively, get started with StreamNative Cloud, which provides an easy-to-use and fully-managed Pulsar service in the public cloud.
Set up the Delta Lake Sink connector. Download the connector from the Releases page, and then move it to $PULSAR_HOME/connectors.
Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to quickly get the connector up and running.
Configure the sink connector
Create a configuration file named delta-lake-sink-config.json to send the public/default/test-delta-pulsar topic messages from Apache Pulsar to the Delta Lake table with the location of s3a://test-dev-us-west-2/lakehouse/delta_sink:
When you send a message to the public/default/test-delta-pulsar topic of Apache Pulsar, this message is persisted to the Delta Lake table with the location of s3a://test-dev-us-west-2/lakehouse/delta_sink.
The Delta Lake Sink connector is a major step in the journey of integrating Lakehouse systems into the Pulsar ecosystem. To get involved with the Delta Lake Sink connector for Apache Pulsar, check out the following featured resources:
Try out the Delta Lake Sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
Make a contribution. The Delta Lake Sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
1Intrinsic differences exist between platforms that have no notion of schema and the ones that have sophisticated schema capabilities because there is no simple way to translate between them. These platform differences range from traditional messaging like Amazon SQS to multi-level hierarchical Avro schema written to a data lake. Distinctions also exist between platforms relying on different data representations, such as Pandas DataFrames and simple messages.
Hang Chen is an Apache Pulsar PMC member and a software engineer at StreamNative. He once worked at BIGO, a Singapore-based technology company that provides video-based social media products. He mainly focuses on Pulsar stability, performance, Flink integration, and KoP.