We’re excited to announce the general availability of the Hudi Sink connector for Apache Pulsar. The connector enables seamless integration between Apache Hudi and Apache Pulsar, improving the diversity of the Apache Pulsar ecosystem. The Hudi + Pulsar connector offers a convenient, efficient, and flexible approach to moving data from Pulsar to Hudi without requiring user code.
For more information on why lakehouse technologies are growing in popularity, check out this blog.
The Hudi Sink connector is a Pulsar IO connector that pulls data from Apache Pulsar topics and persists data to Hudi tables.
Why develop the Hudi Sink connector?
In the last 5 years, the rise of streaming data and the need for lower data latency have pushed data lakes to their limits. As a result, lakehouse technologies such as Apache Hudi have seen rapid adoption. Apache Pulsar, a distributed, open-source pub-sub messaging and streaming platform for real-time workloads, is a natural fit for lakehouse architectures. Integrating Apache Pulsar with Lakehouse streamlines data lifecycle management and data analysis.
StreamNative built the Hudi Sink Connector to provide Hudi users with a way to connect the flow of messages from Pulsar and use more powerful features, while avoiding problems with connectivity that can appear when there are intrinsic differences1 between systems or privacy requirements.
The connector solves this problem by fully integrating with Pulsar (including its serverless functions, per-message processing, and event-stream processing). The connector presents a low-code solution with out-of-the-box capabilities such as multi-tenant connectivity, geo-replication, protocols for direct connection to end-user mobile clients or IoT clients, and more.
What are the benefits of using the Hudi Sink connector?
The integration between Hudi and Apache Pulsar provides three key benefits:
Simplicity: Quickly move data from Apache Pulsar to Hudi without any user code.
Efficiency: Reduce your time spent configuring the data layer. This means you more time to discover the maximum business value from real-time data in an effective way.
Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet business and operational needs in real time.
How do I get started with the Hudi Sink connector?
The following example shows how to configure the connector running in a standalone Pulsar service.
val tripsSnapshotDF = spark.read.format("hudi").load(basepath)
spark.sql("select id from pulsar").show()
Then it will show the table hudi-connector-test content, which is produced from the Pulsar topic test-hudi-pulsar.
The Hudi Sink connector is a major step in the journey of integrating lakehouse systems into the Pulsar ecosystem. To get involved with the Hudi Sink connector for Apache Pulsar, check out the following featured resources:
Try out the Hudi Sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
Make a contribution. The Hudi Sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
1. Intrinsic differences exist between platforms that have no notion of schema and the ones that have sophisticated schema capabilities because there is no simple way to translate between them. These platform differences range from traditional messaging like Amazon SQS to multi-level hierarchical Avro schema written to a data lake. Distinctions also exist between platforms relying on different data representations, such as Pandas DataFrames and simple messages.
Yong Zhang is an Apache Pulsar committer. He works as a software engineer at StreamNative.